Millions of camera phones and digital capture devices are sold annually worldwide. The even-greater number of photos and videos captured by these devices carry a significant value both for consumers and, often, for our culture. Improving the metadata attached to these resources will make the collections more accessible, searchable and, as a result, relevant for personal and public use.
This challenge focuses on capture device location and orientation, one dimension of content metadata. The problem can be stated simply: try to derive exact camera poses (location and orientation) of given photos that are lacking location annotation. This kind of technology could potentially be used to add metadata to existing or newly captured photos.
Assumptions: You can assume the availability of nearby photos/video with known location that can be used to derive unknown camera poses; other ideas that do not require existing content will be welcome. While a “clean” solution is ideal, other models that help could be used, for example, exploiting inertia sensor data, properties of personal collections, or the presence of textual descriptors. Sub-solutions can assume the existence, for example, or some fuzzy specification of location (e.g., via cell tower ID) for the content.
Objective
Embedded GPS and orientation sensors make it possible to create a variety of interesting spatial photo browsing experiences (e.g., Nokia ImageSpace). Sensor based approach, however, suffers from lack of accuracy. GPS has an error in the order of several meters and accelerometer/magnetometer based orientation measurement is sensitive to motion and magnetic disturbances. In practice sensor based orientation can - at best - reach an accuracy of plus/minus few degrees.
On the other hand, it has also been demonstrated that computer vision techniques can be used to obtain the same pose information directly from the images. Technical challenge here is to combine the best of both worlds - sensor based camera poses can be corrected by accurate image matching, while GPS and sensors can make the computer vision problem tractable on a large scale.
Figure 1 illustrates an example set of images and noisy sensor based camera poses. Your pose correction system should automatically correct poses and outputs something similar to that of Figure 2.

Figure 1: Input data set

Figure 2: Highlighted cameras have a corrected pose. Current solutions can only correct the pose for cameras which have significant overlap, which is clearly visible here. The challenge is to extend computer vision based corrections to images with little or no overlap.
Submitted solutions are evaluated based on accuracy and robustness:
- The number of photos that can be successfully registered.
- The error in recovered camera poses relative to the ground truth.
Current methods typically fail to recover a corrected pose for a number of cameras and are fairly slow. Submitted solutions are evaluated based on speed and accuracy.
The pose correction system should be completely automatic.
Many methods for pose correction may automatically be suitable for reconstructing 3D information of the world. Such reconstructions can be a valuable part of a browsing system and reconstructions of any kind are considered when evaluating winners.
Data Sets
For this challenge, several data sets will be provided. These data sets are captured with the Nokia 6210 Navigator and each image has an associated GPS measurement and orientation estimate based on the accelerometers and magnetometers embedded in the phone.
To get you started, a small example data set is already provided. Data sets come as zip-files containing a directory of images and an XML file describing the relevant meta data. The XML is fairly self-explanatory, and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<imageset>
<image height="600" id="0" src="demoset/image0.jpg" width="800">
<geolocation alt="0" lat="43.731777279237" lng="7.421039803853"/>
<orientation pitch="0" roll="0" yaw="41"/>
</image>
<image height="600" id="1" src="demoset/image1.jpg" width="800">
<geolocation alt="0" lat="43.731745092741" lng="7.421010886298"/>
<orientation pitch="15" roll="0" yaw="102"/>
</image>
<image height="600" id="2" src="demoset/image2.jpg" width="800">
<geolocation alt="0" lat="43.73173595647" lng="7.421045168269"/>
<orientation pitch="7.5" roll="0" yaw="111"/>
</image>
.
.
.
Available data sets
- demoset
This is to get you started. We are working to provide a larger data sets as well as data sets with ground truth poses. Closer to the conference date, a final competition set is planned to be published without ground truth and competitors can submit corrected pose sets which will be evaluated for accuracy - lausanne
This set represents fairly realistic, but extremely challenging conditions. It is our hope that researchers will soon find methods that are able to fairly accurately reconstruct the camera poses from a set like this. For the time being, though, we are not expecting submissions which can fully recover all the camera poses in this set.The images were taken in the city of Lausanne, Switzerland, starting from Place Centrale, going around at Place Pepinet, continuing on Rue Centrale and taking a turn to Rue du Pont. From the fountain the pictures set continues on Escaliers du Marche towards the Cathedrale de Lausanne. There are good satellite and aerial images of the city publicly available e.g. at Live.com, which may help you to visually inspect your results.
It is recommended, but not required, that you use these data sets. Submissions that address the challenge problem in a novel way which may not be compatible with provided data are also accepted.
Tools
A set of small tools is provided to help contestants get started. These tools are Matlab functions that read, write and display the data sets. You can download the tool set tools.
Example:
>> S = load_set('demoset.xml');
>> view_set(S)
This should produce the results in Figure 2. The red arrow points to north and blue arrow to east. Orientation measurements are based on accelerometer and magnetometer data and are very sensitive to disturbances.
In this example, some of the orientations are clearly wrong. Such errors are to be expected and the idea in this challenge is to exploit the connections between the images to correct for sensor errors and to average out GPS error to get improved pose estimates of all the cameras in a georeferenced coordinate system.
You may feel more comfortable working with a Cartesian coordinate system, rather than the geodetic system. The functions geo2ecef and ecef2geo are example functions that can be used to convert between the geodetic and the Earth Centered Earth Fixed (ECEF) systems.
It may also be useful at times to work in the local “map” coordinates, or East North Up (ENU) system. The local_frame-function can be used to obtain such local system. It may be useful to study the view_set function as an example on the use of these different coordinate conversions.
Awards
Award will be announced a later time - stay tune for details on the Multimedia Grand Challenge site.
Feel free to correspond with the challenge authors via the comments form below.
For private correspondence, consult the About page for contact details.




on Feb 4th, 2009 at 12:04 am
[...] http://www.scils.rutgers.edu/conferences/mmchallenge/2009/02/02/nokia-challenge/ [...]
on Feb 11th, 2009 at 7:10 pm
[...] Where was this Photo Taken, and How? This challenge focuses on capture device location and orientation, one dimension of content [...]
on Feb 24th, 2009 at 5:18 am
[...] Nokia Challenge: Where was this Photo Taken, and How? [...]
on Jun 16th, 2009 at 11:27 am
[...] Onde esta foto foi tirada? Pela [...]
on Sep 3rd, 2009 at 1:41 pm
[...] are still working on creating a ground truth set for Nokia’s Photo location & orientation Challenge. For the time being, we are glad to provide a larger data set (zip file), for which we [...]