The Rutgers Distributed Laboratory for Digital Libraries
Director: Paul B. Kantor  Director of Graduate Program: Nicholas Belkin 

Unit 1. Kantor: Information Retrieval.

 

Outline:

 

Representation

Bean counting versus comprehension

Parts of speech and disambiguation

Words reflect what a document is about

Basic Boolean combinations

From words to ideas: stemming; stopwords; phrases

From revelance to frequency:: the bag of words approach

From relevance to proximity: commercial search services

From frequency to binarization - the LAD/DIP approach

Understanding meaning: hand-built thesauri:

WordNet (http://www.cogsci.princeton.edu/~wn/)

Understanding meaning: Statistical thesauri: (Salton and Lesk); UIUC Concept Spaces

Retrieval

What does the user want?

Satisfaction - "to be done"

Need expressed as a text of some length (the "query"). Keywords; a sentence; …

Definition of Sim((document, quest )

Ranking of documents

Evaluation

Sets of documents have value v(D)

Simplifying assumption: v(D)=v(d1)+…..v(dN)

Further simplifying: v(d) in {0,1}.

Look down the list to point n. g(n).

Precision := g(n)/n

Recall := g(n)/G(Quest)

As n varies this traces a "Precision versus recall curve" Area under it is pave

Average of this across a family of Quests is "average precision over recall"

The Future

Is now.

 

Recommended Materials to Read

 

Paper

The following materials are available only in paper, and will be distributed at the first meeting.

Luhn HP. (1958) The Automatic Creation of Literature Abstracts. IBM J. of Research and Development 2 (2) 159-165 and 317.

-- (1959) Auto-encoding documents for information retrieval systems. in M. Boaz, Ed. Modern Trends in Documentation. Pergamon, London 45-58.

Schulz CK. H. P. Luhn the Man.

Kantor PB. (1994) Information Retrieval Techniques. Annual Rev of Information science. 53-90.

 

Maron ME Kuhns JL. (196) On relevance, probabilistic indexing and retrieval. JACM 7(3) (In Saracevic, Intro to Information Science)

Lesk M. (1969) Word-word association in document retrieval systems. American documentation 20(1). (In Saracevic, Intro to Information Science)

There is an excellent book of readings:

Sparck Jones K Willet P. (1997) Readings in Information Retrieval. Morgan Kauffman. However, in some cases, such as the paper by Maron and Kuhns included here, the Appendices, which are essential to the argument are omitted, and in other cases the paper included, while recognized in the field, contains more or less serious errors, which are not pointed out by the editors. Hence it is recommended along with a caveat lector.

 

Some Web references.

For an overview of the TREC approach to evaluating information retrieval systems, see the TREC conferences.

http://trec.nist.gov/pubs.html

A good overview of the TREC process is given by the slides (note that these are bit hogs, so you won't want to view them from home) at:

http://trec.nist.gov/presentations/TREC7/index.htm

Follow the link for TREC5 to:

http://trec.nist.gov/pubs/trec5/t5_proceedings.html

Select

17. SPIDER Retrieval System at TREC-5, page 217

J.P. Ballerini, M. Buchel, R. Domenig, D. Knaus, B. Mateev, E. Mittendorf,

P. Schauble, P. Sheridan, M. Wechsler (Swiss Federal Institute of Technology (ETH))

For a mathematically clear discussion of a method which uses various tricks to estimate what is missing in a text, when trying to retrieve it.

For a particularly successful system based on plain vector matching and the bag-of-words approach, see:

http://trec.nist.gov/pubs/trec6/t6_proceedings.html

and select

16. AT&T at TREC-6 , page 215

A. Singhal (AT&T Labs-Research)

You can find lots of useful information at the ACM SIGIR site:

http://www.acm.org/sigir/

….

Software resources

and especially the software resources at: http://www.acm.org/sigir/filters.html including, at ftp://ftp.vt.edu/pub/reuse/IR.code/ir-code/stopper/ a list of stop words in machine readable form.

WORDNET is at:

http://www.cogsci.princeton.edu/~wn/

You may be amused at the results of an Altavista search for WordNet

3. Web WordNet Interface (version 0.3)

Web WordNet Interface (version 0.3) The basic aim of this tool is to provide a flexible access to our

multilingual lexical knowledge bases. The tool...

URL: nipadio.lsi.upc.es/wwi.html

Last modified 15-Sep-98 - page size 8K - in English [ Translate ]

4. WordNet

Computer Science - University of Windsor. WordNet. WordNet has a home page at

http://www.cogsci.princeton.ed /~wn/. The installed version is 1.5. An...

URL: www.cs.uwindsor.ca/help/on-line-docs/wordnet/wordnet.html

Last modified 3-Dec-95 - page size 868 bytes - in English [ Translate ]

5. Educational Uses of WordNet

READER: A Lexical Aid. Cognitive Science Laboratory Princeton University. READER allows a student

to read a text displayed by a computer, and to have...

URL: www.cogsci.princeton.edu/~geo/reader.html

Last modified 4-Oct-96 - page size 2K - in English [ Translate ]

6. WordNet

WordNet. Digital Logging Recorder. WordNet combines ease of use with the latest technology to provide

a state-of-the-art, message-based digital recording..

URL: www.jtsinclair.co.uk/wordnet.html

Last modified 2-Dec-97 - page size 5K - in English [ Translate ]

7. WordNet, the online English dictionary

Modèles Informatiques du Langage et de la Cognition - MILC. Groupe Intelligence Artificielle

Département Informatique. English. L'équipe. Recherches. Home.

URL: www-inf.enst.fr/~milc/wordnet.html

Last modified 8-Sep-98 - page size 4K - in French [ Translate ]

8. WordNet

WordNet 1.5 A Lexical Database for the English Language. If you experience difficulty with this Java

interface, please try our HTML forms interface...

URL: www.cogsci.princeton.edu/~wn/online/java/

Last modified 5-Aug-97 - page size 1K - in English [ Translate ]

9. WordNet Publications

WordNet - a Lexical Database for English Cognitive Science Laboratory Princeton University 221 Nassau

St. Princeton, NJ 08542. Publications. Those...

URL: www.cogsci.princeton.edu/~wn/papers/

Last modified 10-Aug-98 - page size 2K - in English [ Translate ]

10. WordNet at Vancouver Webpages

WordNet 1.5 Search. See the WordNet Home Page for more information about WordNet. Recursive

WordNet Definition of terms (holonym, meronym, etc.) Search...

URL: vancouver-webpages.com/wordnet/original.shtml

Last modified 14-Oct-96 - page size 3K - in English [ Translate ]

A useful page is:

http://nipadio.lsi.upc.es/wwi.html


RDLDL Home   | SCILS Home   | Rutgers Home

Contact us at lreba@scils.rutgers.edu
RDLDL; SCILS Building; Room 214; 4 Huntington Street; New Brunswick, NJ 08903 (732) 932-7705; fax: (732) 932-1504
Last updated on: Thu Apr 6 22:56:26 EDT 2000