ACM MM 2009 Header Image

Yahoo! Challenge:

Radvision Challenge:

CeWe Challenge:

Google Challenge:

HP Challenge:

Nokia Challenge:

Radvision Challenge:

Yahoo! Challenge:

Accenture Challenge:

CurrentTV Challenge:

What to Submit? Some Answers

We have been getting various queries asking about what kind of content is expected for the two-page submission to the Multimedia Grand Challenge. While there are no specific guidelines, there is one main thing to keep in mind:

  1. Convince us that your method is interesting, innovative, promising and that you will have something to show if we invite you to participate in October.

In other words, if the reviewers (including those from the relevant companies) are excited about your approach and your idea, that’s enough. Obviously, in two pages you cannot fit details of the implementation, the evaluation or the interface; focus on the high-level details of each. Of course, it will not hurt to show high-level results for a specific dataset if one is available.

Of course, if you provide a link to online demos or videos, even better.

That’s it! Submit your paper, make us excited, and do not forget the June 15th deadline is quickly approaching!

Evaluation Metrics & Example Test Data for HP Web Content Identification Challenge

The goal of the algorithms for the HP web content identiciation challenge is to retrieve or label all the informative multimedia content in web pages. The performance of the algorithms would be measured by comparing the automatically computed results with the manually labeled ground truth. Ground truth is generated by manually labeling all the informative multimedia content (i.e. images/video/flash objects) in a pre-selected set of the web pages (in various languages). The algorithm is expected to retrieve nearly all the informative multimedia content in the web pages.

 

The precision is defined as the number of informative images classified/labeled correctly by the algorithm divided by the total number of images labeled as informative by the algorithm. In other words, precision is the number of true positives divided by the sum of true positives and false positives. Recall is defined as the number of informative images classified/labeled correctly by the algorithm divided by the total number of informative images (which should have been labeled as informative). Recall is the number of true positives divided by the sum of true positives and false negatives. The final comparisons between the algorithms will be made by computing the F-measures using the precision and recall.

 

Here are examples of the types of web pages that we plan to use to evaluate the submissions:

 

* English

http://www.mapquest.com/maps?1c=Palo+Alto&1s=CA&2c=San+Francisco+&2s=CA (Driving directions)

 

http://www.buy.com/specialty_store_1/promotions/33379.html (shopping)

(Note, images of similar products or other product recommendations are considered informative)

 

http://edition.cnn.com/2009/TECH/05/27/ship.sinking.reef/index.html (news)

 

* Chinese

http://news.sina.com/oth/phoenixtv/502-104-103-108/2009-05-27/01323899156.html (entertainment articles)

 

http://www.china.travel/sym/lyhd/2009-05-14/274878.shtml (travel)

 

http://www.yahtour.com/destination/province.php?id=2238 (travel)

 

 

* Arabic

http://www.aljazeera.net/NR/exeres/5CC37A8B-39E7-4692-BCD9-2D8807ACE580.htm

 

http://www.marma.net/content.prt-CID=16303

 

* Korean

http://news.chosun.com/site/data/html_dir/2009/05/28/2009052800708.html (news)

 

http://blog.naver.com/honeykja/40045216645 (blog, recipe)

 

$3K prizes announced, Submission Site Open

We are delighted to announce that thanks to our generous sponsors (Google, HP Labs, Nokia and Yahoo!), the Multimedia Grand Challenge 2009 will offer a total of $3000 in prizes, including a first prize of $1500. See the Prizes section for (limited) details.

In other news, the submission site is now open, with submissions Due on June 15th! The submissions should be made in a 2-page ACM format, and must address one of the corporate challenges listed below. Details available on the SUBMIT page.

Submission, Prize, Sponsors all coming soon!

Stay tuned for information about submission and (most excitingly) a grand prize for the Grand Challenge coming very soon on this blog. In the meantime, Jay just posted about the Google Challenge on the Google Research Blog.

Posted: Dataset for Nokia Photo Location Challenge

We are still working on creating a ground truth set for Nokia’s Photo location & orientation Challenge. For the time being, we are glad to provide a larger data set (zip file), for which we unfortunately do not have the ground truth yet. This set also includes calibration grid images. All the images were taken with the same camera.


This set represents fairly realistic, but extremely challenging conditions. It is our hope that researchers will soon find methods that are able to fairly accurately reconstruct the camera poses from a set like this.  For the time being, though, we are not expecting submissions which can fully recover all the camera poses in this set.


The images were taken in the city of Lausanne, Switzerland, starting from Place Centrale, going around at Place Pepinet, continuing on Rue Centrale and taking a turn to Rue du Pont. From the fountain the pictures set continues on Escaliers du Marche towards the Cathedrale de Lausanne. There are good satellite and aerial images of the city publicly available (for example, from Live.com) , which may help you to visually inspect your results.


The image set can be downloaded here.

Full Paper Deadline Approaching

As we mentioned earlier, the Grand Challenge committee will give preference for submissions that are accompanied by other conference submissions. When we say “other conference submission” we primarily refer, of course, to full paper submissions.

With that in mind, we just wanted to remind everyone that the MM2009 full paper deadline is April 17th; April 10th is the deadline for paper Registration (abstract submission).

Other deadlines are May 8th for short paper submissions, and, of course, June 15th for the Grand Challenge.

Current TV Challenge: Media Production in the Age of Community

Community media is all around us.  News is broadcast everywhere: from websites to Facebook feeds.  In many cases, the conversation about news is as important as the news event itself.  News media providers are seeking ways to aid the production of content though the merger and analysis of video and social streams.  This raises many questions:
•    What kinds of social streams can be aligned in real time to live media?
•    What video content features align to social streams (such as what’s the relationship between a stadium video of camera flash bulbs and an onset of social status messages, like Twitter.com tweets about someone’s dress)?
•    How could social streams be used to find highlights and summarizations of events?
•    How is this video segment important to a community?  What deeper insight can the analysis of social streams add to news reportage?

application

We are seeking applications that produce insightful social media analysis of news and events, e.g. breaking new stories, political events (speeches, elections, press conferences, legislative voting), major media events (e.g. Oscars, Grammys, Pulitzer Prizes announcements, etc), etc.. These applications can be autonomous or semi-autonomous. In the latter case, the applications could require some minimal human intervention to ‘curate’ the production of new media. For a live broadcast scenario, the application could, for example, find relevant commentary from social sites aligned in real time. For a recorded broadcasts, applications can align the show to a set of time-delivered social comments and discover highlights. Given the plurality of social sources, performing video and text analysis at scale will be an issue for the live events, and perhaps even for recorded broadcasts.

Input—Any video stream of the type described above (or others) plus any combination of social sources (blog posts, Flickr Photos, Twitter tweets, Facebook Status messages, etc.). Content can be pulled from any number of public sources via public APIs or via any other feasible mechanisms. Live broadcasts can be ‘simulated’ for demonstration purposes.

Output—A filtering of social media aligned to the video stream. Once aligned, the social media should be analyzed to illuminate why a video segment is meaningful. There could be various paths for the analysis, perhaps based on categories. For example, a category of visual commentary includes information that relates to items and objects in the video: does this relate to the motorcade or the speech? Is it the speaker or their attire? What is the most insightful commentary about the identified objects? A topical commentary will show: Is this about a domestic or foreign issue? Is this about national security or health care? In another example, a sentiment analysis might show mood and reaction. For all analyses, being able to identify why an event is meaningful via the social commentary is key.

Evaluation

We will look for success in the following areas:

  • Efficacy of the filtration and categorization.
  • Speed of the application
  • Presentation novelty or attractiveness
  • Curator user Interface for the cases that require human curation
  • Quality of aggregation of community contributions over time, with best representative samples. I.e., for any segment, find representative topics and categories and the best sample social media excerpts for:
    • What did the majority of people say? (100-80%)
    • What did the core population say? (79% - 20%)
    • What did the outliers say? (< 20%)

Sample past and future events:
•    Election night: November 4th, 2008
•    Inauguration day: January 20th , 2009
•    Oscar night: February 22nd, 2009
•    Macworld
•    CES
•    Graduation Day, high school and college
•    American Idol finale
•    Baseball World Series


Sample past breaking news:
•    Hudson River plane crash, January 15th, 2009
•    Australian bushfires: early February 2009
•    Major stock market fluctuations (past and future, starting in Sept 2008)

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Accenture Challenge: Analysis of Video Footage Captured in Uncontrolled Environments

The proliferation of cameras has led to an explosion of video content. Often, it is necessary to analyze this corpus after (for) an event. An event might involve one or more objects (e.g. people, cars, etc.) and the objects’ interaction with each other. We might then want to search the corpus for similar events or objects that were part of the event. Often, we might not know the objects of interest until we see the events. Sample questions that may rise, then, include:

•    What are some categories of objects and a set of higher-level events that the objects could help identify?
•    Can the system identify key objects if given footage of an event?
•    Can we use the system or parts of it (recognition algorithms, event inference etc) to analyze real-time camera feeds?
•    In many cases, one needs to identify the ‘onset’ of an event, rather than the event itself.  How would a system find event onsets?

Application

We are looking for applications that can address the questions listed above. How would one build such an application/interface that was event and concept centric? What is the ideal interface to search and navigate the video corpus? Can we use the video of the event to allow users to select certain objects and then track/detect those objects?

Input – A video corpus such as data from surveillance cameras, and knowledge about the camera networks (if any). However since this data might be hard to get any available video datasets can be used.

Output – Categories of objects and events that we can identify based on these objects and their interactions, and a good representation for these objects or events. We would like to see what events we can reliably identify based on the objects, which may take domain or task specific knowledge into account. We would like to track higher level semantic events (in the context of the dataset) as opposed to visual events. E.g. In the context of a dataset of sports videos, we do not want to stop at tracking a soccer ball or a red shirt. Rather, we would define the event (at the least) as a soccer game between team X and Y. In the context of surveillance, we would not stop at detecting faces and silhouettes. Rather, we would like to define the event (which is a result of the objects and their interactions) at a higher level - e.g. an assault. More importantly, we would like to see what events the community can define in the context of surveillance.

Evaluation

We will look for performance in the following areas and the ability to work in uncontrolled environments.
•    Categories of objects and events
•    Ability for system to incorporate new objects
•    Precision and timing (retrieval of similar videos)
•    Application: Ease of use in identifying events and categorical assets
Systems can make assumptions regarding video quality (resolution etc) and comparison of these systems will be consistent with the assumptions they make.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

Multimedia Grand Challenge: Kick-off!

Are you ready to help shape the future of multimedia?


We have just posted eight challenges from six companies: HP, Google, CeWe, Yahoo!, Radvision, and Nokia. We invite you to examine the challenges, think about the solutions, and take part in the fun. Submission are due in June; you can find the gory details on this page.


Most important, make sure you grab the RSS feed or join the group / mailing list to get the latest updates (for example, prizes have not been announced yet, but they might be coming!).


Finally, if you are a corporate and want to participate, there is still time! Simply contact the organizers via email (see here) or post a comment right here.


Yahoo! Challenge: Robust Automatic Segmentation of Video According to Narrative Themes

Video search today relies mostly on textual metadata that is associated with the video in terms of title, tags or surrounding page-text. This approach falls severely short by ignoring the richness of information within the video medium; an engine should ideally use this information to help a user search and navigate content. As video content explodes and user attention spans shrink, a next generation video search engine needs to provide users with the ability to search for sections within a video; allow users consume bits and pieces of a video that would be of interest to them; and let the users kill time during lunch breaks in creative ways. In addition, instead of offering just one thumbnail as representation for a whole video, it would be great to be able to partition a video into its constituent narrative themes and allow users to navigate through a video on a more granular level with better video surrogates.

The challenge to researchers in the multi-media community is to develop methods, techniques, and algorithms to automatically generate narrative themes for a given video, as well as present the content in an easy-to-consume manner to end-users in a search engine experience. Naturally, the themes that emerge depend entirely on the video itself – so the methods / algorithms have to be generic. Still, there could be approaches developed for certain types and genres of videos. For instance, one approach could be employed for sitcoms, sports content could have another, educational content could have another, etc.

Input/output

To use a pop-culture example, imagine the input is an episode of Seinfeld (an NBC TV show popular during the 1990s). The output will ideally be 3 or 4 narrative themes around each character and the corresponding video start & end ranges. The themes could overlap, but they do not have to. A way to navigate (user interface) through these narrative themes should also be presented. This output, of course, should be searchable (in that it generates a better representation of content to the search engines, as well). For example, if the output was a character name, “Kramer”, then if a user entered just “Kramer” as a search term, this Seinfeld episode video should surface in the results with the corresponding narrative themes surrounding Kramer also being presented to the user to enable them to browse/click.

In other examples, if the input was a financial news video talking about the economy, bailout package, etc., then the themes that could emerge could be company names mentioned, executives, govt. officials mentioned, etc. If the input was a sports game, then the output themes could be the major points in the game – for baseball, may be home runs, hits, walks, strikeouts, innings changes, etc.

Yahoo! can provide few sample videos in various domains where it holds copyright permissions for this research purposes. More information will become available on this blog.

Metrics/Evaluation

There will be 3 criteria for evaluation:
-    Relevance of narrative themes
-    Innovative presentation & navigation of sub-themes for a video
-    Efficiency of the underlying algorithm

The key criteria for evaluation will be the relevance of the themes that are extracted from a particular video. When evaluating such services at Yahoo!, we would have human judges (usually editors or product managers) rate the relevance of the narrative themes to a given video.

The second important criteria is the creativity in presenting the sub-themes in a video allowing for ease of browsing. We are looking for solution that will increase both findability and engagement with content found via search engines or while browsing.

Lastly, the elegance of the solution should be evaluated by its ease of integration into a search engine’s pipeline, and the efficiency with which it can process a video and output the narrative themes – this latter part refers to processing speed. If a technology takes a day to chew through a 20 minute video and spit out the narrative themes vs. another technology can process the video in real-time or less (20 mins = video length or less) then the latter is much more attractive clearly. New approaches and algorithms that reduce or optimize computation may be required.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.