ACM MM 2009 Header Image

HP Challenge: Robust Identification of Informative Multimedia Content in Web Pages

Today’s web pages, particularly from news and Web 2.0 sites (e.g. CNN, Yahoo, MySpace, Facebook, YouTube, etc.), are usually media-rich, containing both images and video. This trend is expected to continue as media-rich web pages become increasingly popular.

In addition to the main content, web pages typically contain various advertisements and other content that are peripherally related to the main content of the page. For the purposes of this application, multimedia content on a web page is classified as either informative or “auxiliary” content. Multimedia such as advertisements, navigation aids, decorative graphics, or any other content peripherally related to the informative portions of the page is considered as auxiliary content. For the most part, users visit a web page mainly for its informative content.

In most web data mining applications, the inclusion of auxiliary content can significantly degrade their performances. In recent years, there is research in web content analysis and extraction that attempts to tackle similar problem, but many emphasize the textual information instead of the associated multimedia data. Thus, this Grand Challenge invites solutions to the robust identification and extraction of informative multimedia content for any arbitrary web page authored in any language, not just English: Ideally, we would like to have a Grand Challenge solution that is over 99% accurate for any web page of any language.

Input/Output

Example:

Input would be a web page from CNN or Amazon shopping website, with the associated URL. The page will have one or more images and videos related to the main content (e.g. news story) as well as other images and videos showing advertisements or acting as navigation aids. This auxiliary content, such as advertisements or navigation aids, may be in various formats, e.g. GIF, PNG, JPEG, MPEG, or SWF/FLV played by Flash player. The informative images and video content can also be in one of these formats as well.

In the above-mentioned case, output would be a set of all images and videos from the web page along with the characterization of each multimedia item as “informative content” or “auxiliary content”. Further characterization of “informative content” into categories such as, news, sports, etc. would be of additional interest but that is not essential.

Metrics/Evaluation

The following criteria would be used in judging submissions:

  • Accuracy of identification
  • Performance of algorithm in terms of computational time
  • Merits of approach taken in terms of ensuring robust detection

The most important criteria are accuracy and robustness. False positives (i.e. incorrectly detecting informative as auxiliary content) and false negatives (i.e. incorrectly detecting auxiliary as informative content) would be considered in determining accuracy.

Feel free to correspond with the challenge authors via the comments form below.

For private correspondence, consult the About page for contact details.

4 Comments on “HP Challenge: Robust Identification of Informative Multimedia Content in Web Pages”

  1. #1 Telecom ParisTech Multimedia Adaptation Team » Can we rise up the Multimedia Grand Challenge?
    on Feb 24th, 2009 at 5:19 am

    [...] HP Challenge: Robust Identification of Informative Multimedia Content in Web Pages [...]

  2. #2 Arunachalam
    on Jun 8th, 2009 at 3:13 pm

    DO we have training Dataset for initial model building

  3. #3 PiRo
    on Sep 15th, 2009 at 12:03 am

    We may need specific informaion of this challenge.

    I’d like to gain data sets which contain specific input/output.

    Especially, output is nearly essential for me at least.

    I expect this web page to be updated with any data set.

  4. #4 Winners of the Multimedia Grand Challenge 2009 – Multimedia Grand Challenge 2009
    on Oct 26th, 2009 at 9:27 pm

    [...] HP Honorable Mention: Tewson Seeoun, Choochart Haruechaiyasek, Toshiaki Kondo. Identifying Auxiliary Web Images Using Combination of Analyses (response to the HP challenge). [...]

Leave a Comment