In both daily life and many research fields, answering the “where” question is an essential topic. Place is a concept of rich semantic information bounded within a specific geographic space. One of the most straightforward ways of perceiving a place and sharing its information with others is to take pictures of it. Images often contain both semantic and spatial properties at various levels of details. An image of a visible scene such as a desert or a beach may help us to localize the image at a relatively coarse level of detail. An image containing tangible objects, such as buildings, street signs, topological features of a terrain, provides strong clues to support geo-localization at a relatively fine level of detail. The spatial property on one hand includes the spatial location of tangible objects in a place, on the other hand reflects the camera's location from which the tangible objects are photographed. The radical development of deep learning improves machine understanding of image contents. In this research, we address the challenge of understanding a place by detecting visual and spatial property in street-level images with a specific type of deep learning model, the variational autoencoder (VAE).
In this research we test two place representations based on latent variable model. The one directly preserves all the visual contents to a latent place representation. By adding categorical labels during training, the latent representations are distinguished according to different places. The other place representation deals with a holistic latent place representation, it disentangles the visual and view information. In this place representation, we use camera pose in the learning process. The place representation is thus disentangled with camera pose.
Though understanding the visual and spatial properties is just the first step to approach the concept about a place, the learned place representation can be used as evidence or hypothesis to support further deep learning tasks dedicated to more complex spatial intelligent problems.
The base map is produces by using kepler.gl, map style is adopted from Mapbox styles (see: https://docs.mapbox.com/studio-manual/reference/styles/).
The images are from the Cambridge Landmark Dataset (see: http://mi.eng.cam.ac.uk/projects/relocalisation/)