|
The Rutgers Distributed Laboratory for Digital Libraries | ||
|
|||
Unit 1. Kantor: Information Retrieval.
Outline:
Representation
Bean counting versus comprehension
Parts of speech and disambiguation
Words reflect what a document is about
Basic Boolean combinations
From words to ideas: stemming; stopwords; phrases
From revelance to frequency:: the bag of words approach
From relevance to proximity: commercial search services
From frequency to binarization - the LAD/DIP approach
Understanding meaning: hand-built thesauri:
WordNet (http://www.cogsci.princeton.edu/~wn/)
Understanding meaning: Statistical thesauri: (Salton and Lesk); UIUC Concept Spaces
Retrieval
What does the user want?
Satisfaction - "to be done"
Need expressed as a text of some length (the "query"). Keywords; a sentence; …
Definition of Sim((document, quest )
Ranking of documents
Evaluation
Sets of documents have value v(D)
Simplifying assumption: v(D)=v(d1)+…..v(dN)
Further simplifying: v(d) in {0,1}.
Look down the list to point n. g(n).
Precision := g(n)/n
Recall := g(n)/G(Quest)
As n varies this traces a "Precision versus recall curve" Area under it is pave
Average of this across a family of Quests is "average precision over recall"
The Future
Is now.
Recommended Materials to Read
Paper
The following materials are available only in paper, and will be distributed at the first meeting.
Luhn HP. (1958) The Automatic Creation of Literature Abstracts. IBM J. of Research and Development 2 (2) 159-165 and 317.
-- (1959) Auto-encoding documents for information retrieval systems. in M. Boaz, Ed. Modern Trends in Documentation. Pergamon, London 45-58.
Schulz CK. H. P. Luhn the Man.
Kantor PB. (1994) Information Retrieval Techniques. Annual Rev of Information science. 53-90.
Maron ME Kuhns JL. (196) On relevance, probabilistic indexing and retrieval. JACM 7(3) (In Saracevic, Intro to Information Science)
Lesk M. (1969) Word-word association in document retrieval systems. American documentation 20(1). (In Saracevic, Intro to Information Science)
There is an excellent book of readings:
Sparck Jones K Willet P. (1997) Readings in Information Retrieval. Morgan Kauffman. However, in some cases, such as the paper by Maron and Kuhns included here, the Appendices, which are essential to the argument are omitted, and in other cases the paper included, while recognized in the field, contains more or less serious errors, which are not pointed out by the editors. Hence it is recommended along with a caveat lector.
Some Web references.
For an overview of the TREC approach to evaluating information retrieval systems, see the TREC conferences.
http://trec.nist.gov/pubs.html
A good overview of the TREC process is given by the slides (note that these are bit hogs, so you won't want to view them from home) at:
http://trec.nist.gov/presentations/TREC7/index.htm
Follow the link for TREC5 to:
http://trec.nist.gov/pubs/trec5/t5_proceedings.html
Select
17. SPIDER Retrieval System at TREC-5, page 217
J.P. Ballerini, M. Buchel, R. Domenig, D. Knaus, B. Mateev, E. Mittendorf,
P. Schauble, P. Sheridan, M. Wechsler (Swiss Federal Institute of Technology (ETH))
For a mathematically clear discussion of a method which uses various tricks to estimate what is missing in a text, when trying to retrieve it.
For a particularly successful system based on plain vector matching and the bag-of-words approach, see:
http://trec.nist.gov/pubs/trec6/t6_proceedings.html
and select
16. AT&T at TREC-6 , page 215
A. Singhal (AT&T Labs-Research)
You can find lots of useful information at the ACM SIGIR site:
http://www.acm.org/sigir/
….
Software resources
and especially the software resources at: http://www.acm.org/sigir/filters.html including, at ftp://ftp.vt.edu/pub/reuse/IR.code/ir-code/stopper/ a list of stop words in machine readable form.
WORDNET is at:
http://www.cogsci.princeton.edu/~wn/
You may be amused at the results of an Altavista search for WordNet
3. Web WordNet Interface (version 0.3)
Web WordNet Interface (version 0.3) The basic aim of this tool is to provide a flexible access to our
multilingual lexical knowledge bases. The tool...
URL: nipadio.lsi.upc.es/wwi.html
Last modified 15-Sep-98 - page size 8K - in English [ Translate ]
4. WordNet
Computer Science - University of Windsor. WordNet. WordNet has a home page at
http://www.cogsci.princeton.ed /~wn/. The installed version is 1.5. An...
URL: www.cs.uwindsor.ca/help/on-line-docs/wordnet/wordnet.html
Last modified 3-Dec-95 - page size 868 bytes - in English [ Translate ]
5. Educational Uses of WordNet
READER: A Lexical Aid. Cognitive Science Laboratory Princeton University. READER allows a student
to read a text displayed by a computer, and to have...
URL: www.cogsci.princeton.edu/~geo/reader.html
Last modified 4-Oct-96 - page size 2K - in English [ Translate ]
6. WordNet
WordNet. Digital Logging Recorder. WordNet combines ease of use with the latest technology to provide
a state-of-the-art, message-based digital recording..
URL: www.jtsinclair.co.uk/wordnet.html
Last modified 2-Dec-97 - page size 5K - in English [ Translate ]
7. WordNet, the online English dictionary
Modèles Informatiques du Langage et de la Cognition - MILC. Groupe Intelligence Artificielle
Département Informatique. English. L'équipe. Recherches. Home.
URL: www-inf.enst.fr/~milc/wordnet.html
Last modified 8-Sep-98 - page size 4K - in French [ Translate ]
8. WordNet
WordNet 1.5 A Lexical Database for the English Language. If you experience difficulty with this Java
interface, please try our HTML forms interface...
URL: www.cogsci.princeton.edu/~wn/online/java/
Last modified 5-Aug-97 - page size 1K - in English [ Translate ]
9. WordNet Publications
WordNet - a Lexical Database for English Cognitive Science Laboratory Princeton University 221 Nassau
St. Princeton, NJ 08542. Publications. Those...
URL: www.cogsci.princeton.edu/~wn/papers/
Last modified 10-Aug-98 - page size 2K - in English [ Translate ]
10. WordNet at Vancouver Webpages
WordNet 1.5 Search. See the WordNet Home Page for more information about WordNet. Recursive
WordNet Definition of terms (holonym, meronym, etc.) Search...
URL: vancouver-webpages.com/wordnet/original.shtml
Last modified 14-Oct-96 - page size 3K - in English [ Translate ]
A useful page is:
http://nipadio.lsi.upc.es/wwi.html
| RDLDL Home | | SCILS Home | | Rutgers Home |