ACM MM 2009 Header Image

Yahoo! Challenge:

Radvision Challenge:

CeWe Challenge:

Google Challenge:

HP Challenge:

Nokia Challenge:

Radvision Challenge:

Yahoo! Challenge:

Accenture Challenge:

CurrentTV Challenge:

Current TV Challenge: Media Production in the Age of Community

Community media is all around us.  News is broadcast everywhere: from websites to Facebook feeds.  In many cases, the conversation about news is as important as the news event itself.  News media providers are seeking ways to aid the production of content though the merger and analysis of video and social streams.  This raises many questions:
•    What kinds of social streams can be aligned in real time to live media?
•    What video content features align to social streams (such as what’s the relationship between a stadium video of camera flash bulbs and an onset of social status messages, like Twitter.com tweets about someone’s dress)?
•    How could social streams be used to find highlights and summarizations of events?
•    How is this video segment important to a community?  What deeper insight can the analysis of social streams add to news reportage?

application

We are seeking applications that produce insightful social media analysis of news and events, e.g. breaking new stories, political events (speeches, elections, press conferences, legislative voting), major media events (e.g. Oscars, Grammys, Pulitzer Prizes announcements, etc), etc.. These applications can be autonomous or semi-autonomous. In the latter case, the applications could require some minimal human intervention to ‘curate’ the production of new media. For a live broadcast scenario, the application could, for example, find relevant commentary from social sites aligned in real time. For a recorded broadcasts, applications can align the show to a set of time-delivered social comments and discover highlights. Given the plurality of social sources, performing video and text analysis at scale will be an issue for the live events, and perhaps even for recorded broadcasts.

Input—Any video stream of the type described above (or others) plus any combination of social sources (blog posts, Flickr Photos, Twitter tweets, Facebook Status messages, etc.). Content can be pulled from any number of public sources via public APIs or via any other feasible mechanisms. Live broadcasts can be ‘simulated’ for demonstration purposes.

Output—A filtering of social media aligned to the video stream. Once aligned, the social media should be analyzed to illuminate why a video segment is meaningful. There could be various paths for the analysis, perhaps based on categories. For example, a category of visual commentary includes information that relates to items and objects in the video: does this relate to the motorcade or the speech? Is it the speaker or their attire? What is the most insightful commentary about the identified objects? A topical commentary will show: Is this about a domestic or foreign issue? Is this about national security or health care? In another example, a sentiment analysis might show mood and reaction. For all analyses, being able to identify why an event is meaningful via the social commentary is key.

Evaluation

We will look for success in the following areas:

  • Efficacy of the filtration and categorization.
  • Speed of the application
  • Presentation novelty or attractiveness
  • Curator user Interface for the cases that require human curation
  • Quality of aggregation of community contributions over time, with best representative samples. I.e., for any segment, find representative topics and categories and the best sample social media excerpts for:
    • What did the majority of people say? (100-80%)
    • What did the core population say? (79% - 20%)
    • What did the outliers say? (< 20%)

Sample past and future events:
•    Election night: November 4th, 2008
•    Inauguration day: January 20th , 2009
•    Oscar night: February 22nd, 2009
•    Macworld
•    CES
•    Graduation Day, high school and college
•    American Idol finale
•    Baseball World Series


Sample past breaking news:
•    Hudson River plane crash, January 15th, 2009
•    Australian bushfires: early February 2009
•    Major stock market fluctuations (past and future, starting in Sept 2008)

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Accenture Challenge: Analysis of Video Footage Captured in Uncontrolled Environments

The proliferation of cameras has led to an explosion of video content. Often, it is necessary to analyze this corpus after (for) an event. An event might involve one or more objects (e.g. people, cars, etc.) and the objects’ interaction with each other. We might then want to search the corpus for similar events or objects that were part of the event. Often, we might not know the objects of interest until we see the events. Sample questions that may rise, then, include:

•    What are some categories of objects and a set of higher-level events that the objects could help identify?
•    Can the system identify key objects if given footage of an event?
•    Can we use the system or parts of it (recognition algorithms, event inference etc) to analyze real-time camera feeds?
•    In many cases, one needs to identify the ‘onset’ of an event, rather than the event itself.  How would a system find event onsets?

Application

We are looking for applications that can address the questions listed above. How would one build such an application/interface that was event and concept centric? What is the ideal interface to search and navigate the video corpus? Can we use the video of the event to allow users to select certain objects and then track/detect those objects?

Input – A video corpus such as data from surveillance cameras, and knowledge about the camera networks (if any). However since this data might be hard to get any available video datasets can be used.

Output – Categories of objects and events that we can identify based on these objects and their interactions, and a good representation for these objects or events. We would like to see what events we can reliably identify based on the objects, which may take domain or task specific knowledge into account. We would like to track higher level semantic events (in the context of the dataset) as opposed to visual events. E.g. In the context of a dataset of sports videos, we do not want to stop at tracking a soccer ball or a red shirt. Rather, we would define the event (at the least) as a soccer game between team X and Y. In the context of surveillance, we would not stop at detecting faces and silhouettes. Rather, we would like to define the event (which is a result of the objects and their interactions) at a higher level - e.g. an assault. More importantly, we would like to see what events the community can define in the context of surveillance.

Evaluation

We will look for performance in the following areas and the ability to work in uncontrolled environments.
•    Categories of objects and events
•    Ability for system to incorporate new objects
•    Precision and timing (retrieval of similar videos)
•    Application: Ease of use in identifying events and categorical assets
Systems can make assumptions regarding video quality (resolution etc) and comparison of these systems will be consistent with the assumptions they make.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Multimedia Grand Challenge: Kick-off!

Are you ready to help shape the future of multimedia?


We have just posted eight challenges from six companies: HP, Google, CeWe, Yahoo!, Radvision, and Nokia. We invite you to examine the challenges, think about the solutions, and take part in the fun. Submission are due in June; you can find the gory details on this page.


Most important, make sure you grab the RSS feed or join the group / mailing list to get the latest updates (for example, prizes have not been announced yet, but they might be coming!).


Finally, if you are a corporate and want to participate, there is still time! Simply contact the organizers via email (see here) or post a comment right here.


Yahoo! Challenge: Robust Automatic Segmentation of Video According to Narrative Themes

Video search today relies mostly on textual metadata that is associated with the video in terms of title, tags or surrounding page-text. This approach falls severely short by ignoring the richness of information within the video medium; an engine should ideally use this information to help a user search and navigate content. As video content explodes and user attention spans shrink, a next generation video search engine needs to provide users with the ability to search for sections within a video; allow users consume bits and pieces of a video that would be of interest to them; and let the users kill time during lunch breaks in creative ways. In addition, instead of offering just one thumbnail as representation for a whole video, it would be great to be able to partition a video into its constituent narrative themes and allow users to navigate through a video on a more granular level with better video surrogates.

The challenge to researchers in the multi-media community is to develop methods, techniques, and algorithms to automatically generate narrative themes for a given video, as well as present the content in an easy-to-consume manner to end-users in a search engine experience. Naturally, the themes that emerge depend entirely on the video itself – so the methods / algorithms have to be generic. Still, there could be approaches developed for certain types and genres of videos. For instance, one approach could be employed for sitcoms, sports content could have another, educational content could have another, etc.

Input/output

To use a pop-culture example, imagine the input is an episode of Seinfeld (an NBC TV show popular during the 1990s). The output will ideally be 3 or 4 narrative themes around each character and the corresponding video start & end ranges. The themes could overlap, but they do not have to. A way to navigate (user interface) through these narrative themes should also be presented. This output, of course, should be searchable (in that it generates a better representation of content to the search engines, as well). For example, if the output was a character name, “Kramer”, then if a user entered just “Kramer” as a search term, this Seinfeld episode video should surface in the results with the corresponding narrative themes surrounding Kramer also being presented to the user to enable them to browse/click.

In other examples, if the input was a financial news video talking about the economy, bailout package, etc., then the themes that could emerge could be company names mentioned, executives, govt. officials mentioned, etc. If the input was a sports game, then the output themes could be the major points in the game – for baseball, may be home runs, hits, walks, strikeouts, innings changes, etc.

Yahoo! can provide few sample videos in various domains where it holds copyright permissions for this research purposes. More information will become available on this blog.

Metrics/Evaluation

There will be 3 criteria for evaluation:
-    Relevance of narrative themes
-    Innovative presentation & navigation of sub-themes for a video
-    Efficiency of the underlying algorithm

The key criteria for evaluation will be the relevance of the themes that are extracted from a particular video. When evaluating such services at Yahoo!, we would have human judges (usually editors or product managers) rate the relevance of the narrative themes to a given video.

The second important criteria is the creativity in presenting the sub-themes in a video allowing for ease of browsing. We are looking for solution that will increase both findability and engagement with content found via search engines or while browsing.

Lastly, the elegance of the solution should be evaluated by its ease of integration into a search engine’s pipeline, and the efficiency with which it can process a video and output the narrative themes – this latter part refers to processing speed. If a technology takes a day to chew through a 20 minute video and spit out the narrative themes vs. another technology can process the video in real-time or less (20 mins = video length or less) then the latter is much more attractive clearly. New approaches and algorithms that reduce or optimize computation may be required.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Radvision Challenge: Video Conferencing To Surpass “In-Person” Meeting Experience

Video conferencing is part of a $5 Billion dollar real-time collaboration market that includes audio, video and web conferencing products and services.

The great challenge for Video conferencing vendors is to supply users with a meeting experience that equals or surpasses “in-person” meetings. It is assumed that when meeting experience will be good enough, or even better, the technology could potentially minimize the need for “physical” meetings (at least for business purposes). Such reduction would mean less travel, less cost (to people, organizations, and the planet), better efficiency and better communication.

This challenge focuses on developing new technologies and ideas to surpass the “in-person” meeting experience. In the process a set of subjective and objective measures to evaluate “meeting” experience will be developed. With these measures, alternative solutions could be compared to each other and to in-person meetings, and optimized accordingly.

Dataset

Not required.

Metrics/Evaluation

As noted above, we are hoping for new metrics, objective and subjective, to be developed that capture the meeting experience. It is desired to have a high correlation between the objective and subjective metrics, and that metrics are robust and reliable. Those metrics could be used to compare existing video conferencing solutions, in-person meetings and new technologies suggested.

About Radvision

Radvision (Nasdaq: RVSN) is the industry’s leading provider of products and technologies for unified visual communications over IP, 3G and emerging IMS/Next Generation networks - enabling high definition video-conferencing, converged video telephony services, and scalable desktop–based visual communications.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Nokia Challenge: Where was this Photo Taken, and How?

Millions of camera phones and digital capture devices are sold annually worldwide. The even-greater number of photos and videos captured by these devices carry a significant value both for consumers and, often, for our culture. Improving the metadata attached to these resources will make the collections more accessible, searchable and, as a result, relevant for personal and public use.

This challenge focuses on capture device location and orientation, one dimension of content metadata. The problem can be stated simply: try to derive exact camera poses (location and orientation) of given photos that are lacking location annotation. This kind of technology could potentially be used to add metadata to existing or newly captured photos.

Assumptions: You can assume the availability of nearby photos/video with known location that can be used to derive unknown camera poses; other ideas that do not require existing content will be welcome. While a “clean” solution is ideal, other models that help could be used, for example, exploiting inertia sensor data, properties of personal collections, or the presence of textual descriptors. Sub-solutions can assume the existence, for example, or some fuzzy specification of location (e.g., via cell tower ID) for the content.

Objective

Embedded GPS and orientation sensors make it possible to create a variety of interesting spatial photo browsing experiences (e.g., Nokia ImageSpace). Sensor based approach, however, suffers from lack of accuracy. GPS has an error in the order of several meters and accelerometer/magnetometer based orientation measurement is sensitive to motion and magnetic disturbances. In practice sensor based orientation can - at best - reach an accuracy of plus/minus few degrees.

On the other hand, it has also been demonstrated that computer vision techniques can be used to obtain the same pose information directly from the images. Technical challenge here is to combine the best of both worlds - sensor based camera poses can be corrected by accurate image matching, while GPS and sensors can make the computer vision problem tractable on a large scale.

Figure 1 illustrates an example set of images and noisy sensor based camera poses. Your pose correction system should automatically correct poses and outputs something similar to that of Figure 2.

Figure 1: Input data set

Figure 1: Input data set

Figure 2: Highlighted cameras have a corrected pose. Current solutions can only correct the pose for cameras which have significant overlap, which is clearly visible here. The challenge is to extend computer vision based corrections to images with little or no overlap.

Figure 2: Highlighted cameras have a corrected pose. Current solutions can only correct the pose for cameras which have significant overlap, which is clearly visible here. The challenge is to extend computer vision based corrections to images with little or no overlap.

Submitted solutions are evaluated based on accuracy and robustness:

  1. The number of photos that can be successfully registered.
  2. The error in recovered camera poses relative to the ground truth.

Current methods typically fail to recover a corrected pose for a number of cameras and are fairly slow. Submitted solutions are evaluated based on speed and accuracy.

The pose correction system should be completely automatic.

Many methods for pose correction may automatically be suitable for reconstructing 3D information of the world. Such reconstructions can be a valuable part of a browsing system and reconstructions of any kind are considered when evaluating winners.

Data Sets

For this challenge, several data sets will be provided. These data sets are captured with the Nokia 6210 Navigator and each image has an associated GPS measurement and orientation estimate based on the accelerometers and magnetometers embedded in the phone.

To get you started, a small example data set is already provided. Data sets come as zip-files containing a directory of images and an XML file describing the relevant meta data. The XML is fairly self-explanatory, and looks like this:

<?xml version="1.0" encoding="utf-8"?>
<imageset>
   <image height="600" id="0" src="demoset/image0.jpg" width="800">
      <geolocation alt="0" lat="43.731777279237" lng="7.421039803853"/>
      <orientation pitch="0" roll="0" yaw="41"/>
   </image>

   <image height="600" id="1" src="demoset/image1.jpg" width="800">
      <geolocation alt="0" lat="43.731745092741" lng="7.421010886298"/>
      <orientation pitch="15" roll="0" yaw="102"/>
   </image>
   <image height="600" id="2" src="demoset/image2.jpg" width="800">
      <geolocation alt="0" lat="43.73173595647" lng="7.421045168269"/>

      <orientation pitch="7.5" roll="0" yaw="111"/>
   </image>
.
.
.

Available data sets

  1. demoset
    This is to get you started. We are working to provide a larger data sets as well as data sets with ground truth poses. Closer to the conference date, a final competition set is planned to be published without ground truth and competitors can submit corrected pose sets which will be evaluated for accuracy
  2. lausanne
    This set represents fairly realistic, but extremely challenging conditions. It is our hope that researchers will soon find methods that are able to fairly accurately reconstruct the camera poses from a set like this.  For the time being, though, we are not expecting submissions which can fully recover all the camera poses in this set.

    The images were taken in the city of Lausanne, Switzerland, starting from Place Centrale, going around at Place Pepinet, continuing on Rue Centrale and taking a turn to Rue du Pont. From the fountain the pictures set continues on Escaliers du Marche towards the Cathedrale de Lausanne. There are good satellite and aerial images of the city publicly available e.g. at Live.com, which may help you to visually inspect your results.

    It is recommended, but not required, that you use these data sets. Submissions that address the challenge problem in a novel way which may not be compatible with provided data are also accepted.

    Tools

    A set of small tools is provided to help contestants get started. These tools are Matlab functions that read, write and display the data sets. You can download the tool set tools.

    Example:

    >> S = load_set('demoset.xml');
    >> view_set(S)

    This should produce the results in Figure 2. The red arrow points to north and blue arrow to east. Orientation measurements are based on accelerometer and magnetometer data and are very sensitive to disturbances.

    In this example, some of the orientations are clearly wrong. Such errors are to be expected and the idea in this challenge is to exploit the connections between the images to correct for sensor errors and to average out GPS error to get improved pose estimates of all the cameras in a georeferenced coordinate system.

    You may feel more comfortable working with a Cartesian coordinate system, rather than the geodetic system. The functions geo2ecef and ecef2geo are example functions that can be used to convert between the geodetic and the Earth Centered Earth Fixed (ECEF) systems.

    It may also be useful at times to work in the local “map” coordinates, or East North Up (ENU) system. The local_frame-function can be used to obtain such local system. It may be useful to study the view_set function as an example on the use of these different coordinate conversions.

    Awards

    Award will be announced a later time - stay tune for details on the Multimedia Grand Challenge site.

    Feel free to correspond with the challenge authors via the comments form below.

    For private correspondence, consult the About page for contact details.

    HP Challenge: Robust Identification of Informative Multimedia Content in Web Pages

    Today’s web pages, particularly from news and Web 2.0 sites (e.g. CNN, Yahoo, MySpace, Facebook, YouTube, etc.), are usually media-rich, containing both images and video. This trend is expected to continue as media-rich web pages become increasingly popular.

    In addition to the main content, web pages typically contain various advertisements and other content that are peripherally related to the main content of the page. For the purposes of this application, multimedia content on a web page is classified as either informative or “auxiliary” content. Multimedia such as advertisements, navigation aids, decorative graphics, or any other content peripherally related to the informative portions of the page is considered as auxiliary content. For the most part, users visit a web page mainly for its informative content.

    In most web data mining applications, the inclusion of auxiliary content can significantly degrade their performances. In recent years, there is research in web content analysis and extraction that attempts to tackle similar problem, but many emphasize the textual information instead of the associated multimedia data. Thus, this Grand Challenge invites solutions to the robust identification and extraction of informative multimedia content for any arbitrary web page authored in any language, not just English: Ideally, we would like to have a Grand Challenge solution that is over 99% accurate for any web page of any language.

    Input/Output

    Example:

    Input would be a web page from CNN or Amazon shopping website, with the associated URL. The page will have one or more images and videos related to the main content (e.g. news story) as well as other images and videos showing advertisements or acting as navigation aids. This auxiliary content, such as advertisements or navigation aids, may be in various formats, e.g. GIF, PNG, JPEG, MPEG, or SWF/FLV played by Flash player. The informative images and video content can also be in one of these formats as well.

    In the above-mentioned case, output would be a set of all images and videos from the web page along with the characterization of each multimedia item as “informative content” or “auxiliary content”. Further characterization of “informative content” into categories such as, news, sports, etc. would be of additional interest but that is not essential.

    Metrics/Evaluation

    The following criteria would be used in judging submissions:

    • Accuracy of identification
    • Performance of algorithm in terms of computational time
    • Merits of approach taken in terms of ensuring robust detection

    The most important criteria are accuracy and robustness. False positives (i.e. incorrectly detecting informative as auxiliary content) and false negatives (i.e. incorrectly detecting auxiliary as informative content) would be considered in determining accuracy.

    Feel free to correspond with the challenge authors via the comments form below.

    For private correspondence, consult the About page for contact details.

    Google Challenge: Robust, As-Accurate-As-Human Genre Classification for Video

    A notion of browsing collections is naturally associated with videos. Having videos classified into a pre-existing hierarchy of genres is one way to make the browsing task easier. The goal of this task would be to take user generated videos (along with their sparse and noisy metadata) and automatically classify them into genres. A public genre hierarchy like ODP (Open Directory Project) can be used as a target for this task.

    Evaluations can be based on purely video content features as well as a combination of content and metadata features. Features that bring in information from other public data sources can also be used (eg. Object detectors trained on a separate public dataset). Thinking of new (and surprising) features is recommended!

    Any dataset that reflects a breath of content is acceptable, and of course, YouTube and Google Video are a recommended source. Particularly, the data should cover most of the common video genres. If the dataset consists of web videos, sharing a list of links and corresponding labels would be ideal for researchers to compare notes. You may want to consult the The YouTube Data API for retrieving video data.

    Evaluation

    We propose two types of evaluations for this challenge:

    • Offline (direct evaluation): Use a labeled test set to measure precision/recall for the ODP categories.
    • Online (indirect): Allow users a browse interface for your classifiers and measure how easily they can find some target concepts (e.g., find a basketball scoring scene). Note that the errors of the classifier can be compensated here since a video can appear in multiple categories, so one could conceive of training for different loss functions here.

    The ideal target in this case would match the optimal score for human agreement on the dataset.  If 5 raters categorize each video and we have agreement in 92% of the cases, we expect the automatic classifier to hit the same agreement rate.

    Feel free to correspond with the challenge authors via the comments form below.

    For private correspondence, consult the About page for contact details.

    CeWe Challenge: The Next Generation of Tangible Multimedia Products

    Thematic Photo Story Generation from Personal Photo Collections

    With the advent of digital photography, the number of photos taken has increased tremendously. While only recently, in the analogue days, a small number of films documented a 2-weeks holiday, we are nowadays taking and storing hundreds or even thousands of digital photos. This capture rate has an enormous impact on the way users deal with their photographs. Often they are just overwhelmed with the masses of photos and defy carefully organizing and selecting them. Users may want a selection that best represents the event to browse and share with family and friends. Manually creating such a selection requires much time and effort. At the end, the precious memories reside on hard disks and are not shared with others or made into prints or other products such as calendars or photo books.

    The open issue is how to help the user determine a meaningful subset of photos out of a collection, which best summarizes and represents the specific event. This is still not satisfactory solved after years of research in multimedia analysis and retrieval. However, such methods could ease the process of designing products and services from personal media significantly, and therefore attract more users to order such products from photo finishing companies like CeWe Color.

    The multimedia challenge is to take realistic photo sets of users as a basis to (semi-) automatically determine those that best summarize the underlying event such as a 2-weeks holiday. This can also incorporate video snippets often taken with digital still cameras for which a suitable representation for a printed product has to be developed (e.g., extraction of suitable key frames or representative fraction). For the media selection, the system should take into account the target use of the selection, which should be oriented at commercial print products such as calendars, collages, posters or photo books. Additionally, the process could incorporate the exploitation and addition of shared media from social community platforms to augment the personal collection. The solution should not only consist of an approach for the selection but could be embedded in an authoring system the user in the loop.

    Metrics/Evaluation

    The primary measure for the quality of the approach will be the user’s satisfaction with the summarization result and process. Following the assumption that a user can only evaluate the summarization quality of his or her own photos, researchers should work with the users themselves to provide their own photos as data sets and evaluate the results. One aspect of this evaluation could be a questionnaire. For the evaluation, an exemplary version of a questionnaire with guideline questions will be provided and posted on the Multimedia Grand Challenge website.

    The evaluation should cover qualitative questions like:

    • How well does the summary reflect the personal memory of the event?
    • How much is the user satisfied with the selection according to different criteria such as photo quality or presentation quality?
    • How much is the user satisfied with the overall, (semi-) automatic design process? Is it too complex? Does it significantly ease the authoring process? Does it lead to results even better than with manual authoring and selection?

    Additionally, several quantitative measurements can be taken by observing the user when using the system. This can be for example:

    • Number of photos in the summary in relation to the costs of the product
    • Number of clicks / time effort from photo selection to achieved summary
    • Automatically selected photo set compared to those that would have been manually selected set by the user

    In addition, the evaluation may involve:

    • Performance efficiency of the underlying algorithm.
    • Commercial potential, that means does the presented solution lead to an increase in sales of related digital print products such as photo books or calendars. Obviously, this cannot directly be measured, but the presented solution should have a strong focus on potential commercial exploitation and realistic estimations for this should be made.

    Data set

    For the evaluation, the data set has to be chosen and provided with the challenge by the participants of the challenge. Besides data sets for training and statistical evaluation and performance evaluation, we expect the researchers to use representative personal photo collections of at least 5 users. Additional metadata can be attached to this photo collection, but these have to be realistic, that means they might have been created by standard photo management tools (e.g. descriptions, tags, Exif header) or current state of the art metadata extraction.

    In the evaluation, the datasets should make sure to include the following types of events:

    • Birthday: Short time event (1-2 days), at least 100 photos, more than 5 persons on the different photos of the collections that reoccur on the photos of the collection.
    • Vacation: A vacation of at least ten days documented by 300 photos from different places and locations. Additional metadata can be an associated such as a GPS-Track or location information attached to the photos.
    • Yearbook: Should consist of at least 1000 photos of different types of events (birthday, vacation, family, fun, …) over a period of 12 months.

    About CeWe

    CeWe Color is the Number One services partner for first-class trade brands on the European photographic market. CeWe supplies both stores and Internet retailers (e-commerce) with photographic products.

    Feel free to correspond with the challenge authors via the comments form below.

    For private correspondence, consult the About page for contact details.

    Yahoo! Challenge: Robust Clustering Guided by User Intent in Image Search

    There are over 100 billion images on the internet today and continues to grow every day. Image search engines often only surface a portion of those images and often rely on text surrounding an image on a webpage, or image file name. With the growing number of images on the Internet it is important to have the ability to organize and surface the images in the most efficient, meaningful way possible so that more images can be surfaced to searchers.

    The challenge to researchers in the multi-media community is to 1) develop a robust way of understanding user intent and 2) generate highly relevant clusters for the given intent and query.

    Metrics/Evaluation

    There will be 4 criteria for evaluation:

    • Precision of estimating successful user intent (goal: 90% success)
    • Relevance of clusters.
    • Performance efficiency of the underlying algorithm.
    • Time it takes the user to find an image or groups of images.

    Criteria for evaluation will be how efficient the system is to estimate successful user intent and to then surface relevant, meaningful clusters in the shortest amount of time possible.

    Researchers working on this challenge will develop a way to successfully measure user intent, relevance of clusters and performance efficiency. In addition, creativity in presenting clusters and allowing for ease of searching and browsing will also be a key criteria. Lastly, the elegance of the solution will be judged by its ease of integration into a search engine’s pipeline, and the efficiency with which it can understand user intent and process one or more meaningful clusters – this latter part refers to processing speed. If a technology takes too long to provide meaningful clusters vs. another technology can process the clusters very quickly, then the latter is much more attractive.

    Dataset/Suggested Queries

    We may be able to provide a sample dataset of queries. Stay tune on the Multimedia Grand Challenge website.

    Feel free to correspond with the challenge authors via the comments form below.

    For private correspondence, consult the About page for contact details.