Information Retrieval
                                                                             17:610:551
                       Pre and Co-Requisites:
                       17:610:550 or at least one course in computer programming and permission of the
                       instructor instructor
                                                                             Credits: 3
                       Catalog Description
                       Theory, design, use, and evaluation of information retrieval (IR) systems. Design principles for
                       IR systems and their implementation, characteristics of operational and experimental retrieval
                       systems, and evaluation of information retrieval systems.
 
 
 

                       I.Course Objectives

                       To understand how modern information retrieval systems, such as those found on the Web
                       work, and to understand the important issues in the design, study and evaluation of such
                       systems.

                       II.Organization of the Course

                       The course will follow a lecture/discussion format, and every class will include some kind of a
                       "small group" breakout session either in the classroom or in the computer lab. The major
                       sections are:

                              A.Introduction and overview
                              B.Relation of IR to indexing and cataloging
                              C.Representation of texts by numbers and vectors; Basic notions of probability for
                                IR
                              D.Design for usability; principles of user-centered design
                              E.Vector space models; other linear classifiers: probabilistic models; inference
                                models
                              F.Evaluation of systems; precision, recall and utility
                              G.Bases for representation: terms; phrases; word pairs; n-grams; natural language
                              H.Collection based indexes; LSI; PIRCS; network structures;
                               I.Collection descriptions; collection fusion; meta- crawlers
                               J.Non-textual materials: images and sound; moving images
                              K.Evaluation in context: users, communities, goals and measures
                              L.Collaborative systems: passive and active models
                              M.Presentation of term projects
                              N.Presentation of term projects

                            III.Major Assignments

                                  A.Small tasks. One or two simple tasks will be assigned each week to ensure
                                     that neither students nor teachers are surprised by a backlog of poorly
                                     understood concepts at the end of the semester. These are referred to
                                     below as "knowledge crumbs" KC. Note: These problems are "easy" in
                                     the sense that when you get the point you can do them in less than 5
                                     minutes. The point, of course, is to help you get the point.
                                   B.Readings. Each week there will be one assigned reading from materials
                                     either available on the Web, or distributed in class the prior week. Note:
                                     We are trying to avoid an expensive "Professor Packet" by this process,
                                     but it is still experimental.
                              A.The following books will each be of some use to some of you, but because of the
                                diversity of interests and preparation, no book seems suitable for all the students in
                                any given year.

                       *Baeza-Yates, R. & Ribeiro-Neto, B. Modern information retrieval. New York: ACM Press.
                       This text takes a computer-science perspective. It is more current than Frakes & Baeza-Yates
                       (1992), and broader than Witten Moffat and Bell. If you select only one CS-type text, this would
                       be the better choice.

                       Frakes, W.B. & Baeza-Yates, R., eds. (1992) Information retrieval: data structures and
                       algorithms. Englewood Cliffs, NJ: Prentice Hall. This is a collection of chapters, which vary
                       widely. It is a good place to learn some basics of how IR systems are built. Much of the code is
                       available at ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/

                       *Hersh, W.R. (1995) Information retrieval: A health care perspective. New York: Springer
                       Verlag.

                       General introductory text. This is accessible to all students in this course. Bill Hersh (MD) has
                       concentrated his evaluation efforts on health care applications, but covers the general
                       principles, and pays particular attention to the problem of designing good evaluation
                       experiments.

                       van Rijsbergen, C.J. (1979) Information retrieval, 2nd ed. London: Butterworths. This is an
                       important early text on the subject, from an IR+CS perspective. Parts are quite mathematical.
                       This is the origin of the so-called "F-measure" which has become popular among computer
                       scientists. The great thing is that Keith van Rijsbergen has made it available for FREE, at
                       http://www.dcs.gla.ac.uk/Keith/Preface.html

                       Salton, G. (1989) Automatic text processing: The transformation, analysis and retrieval of
                       information by computer. Reading, MA: Addison- Wesley. Gerry Salton was the founder of
                       automated IR in the US, and his academic descendants have done much of the good work. This
                       book is quite different from the earlier book Salton and McGill, and perhaps less useful in
                       introducing fundamental ideas.

                       Salton, G. & McGill, M. (1983) Introduction to modern information retrieval. New York:
                       McGraw-Hill. Of course this is quite old now, but it introduces many of the ideas that are
                       otherwise only to be found in the papers and techincal reports of Salton?s group at Cornell.

                       *Sparck Jones, K. & Willett, P. eds. (1997) Readings in Information Retrieval. San Francisco:
                       Morgan Kaufmann. A collection of significant research papers in the evolution of information
                       retrieval from its roots in librarianship. The editors have written good historical introductions to
                       each section. The mathematical rigor and clarity of the papers varies widely, and in many cases
                       it is not easy to reconstruct the "commonly accepted meaning" of the terms used by the
                       author(s). This is a very good place to look for papers on which to report to the class, and any
                       other papers contained in this collection will be acceptable for that purpose.

                       Witten, I, Moffat, & Bell T. Managing Gigabytes. compressing and indexing documents and
                       images. New York : Van Nostrand Reinhold, c1994.xiv, 429 p. : ill. ; 26 cm.

                       Online resource: http://www.searchenginewatch.com/

                              A.Presentation of a paper. Each student is required to select one paper from an extended list
                                that will be made available on the web site, to study that paper thoroughly, and to present
                                the key results of that paper to the entire class on a date to be selected. The presentation
                                may include slides, handouts, or web pages. The presentation should make clear: (1) What
                                problem the paper addresses; (2) what relation it has to prior cited literature; (3) what idea it
                                proposes to solve or improve the problem; (4) what was done to implement that idea; (5)
                                what results were found, and (6) what suggestions were made for further work. Note: We
                                have adopted this approach to presenting readings because several years of experience has
                                established that when I explain the papers they tend to become less accessible, rather than
                                more. When you explain them to each other, they become clear. Of course, if the presentation
                                does not make the paper clear, ASK QUESTIONS.
                              B.Term project: System building OR system evaluation. Students are encouraged to work in
                                teams of 2 for this assignment, although a student may elect to work alone. More specifically,
                                to

                               1.BUIILD A SYSTEM.

                                       a.Select a body of texts. These may either be the body of texts that the class
                                         assembles in KC-1 below, or they may be assembled in some other way.
                                         There must be at least 80 texts in the collection, but they may be as short as
                                         abstracts if desired. They may contain tags only if that tagging is used to
                                         index or retrieve on the collection.
                                       b.Select a way to represent those texts ? this may be simple (such as term
                                         occurrence, or more complex, using term frequency, term position, or distance
                                         between terms).
                                        c.Select a way to match simple queries (sets of terms) to those
                                         representations to produce a ranked list of the 10 documents most likely to
                                         be relevant.
                                       d.Give the system some kind of interface (this may be web based, but it could
                                         be Unix command line if you like).
                                       e.Get at least one other person from this class to use the system in the
                                         intended way, to ensure that you are not totally fooling yourself.
                                        f.Write up a coherent explanation of what you have done, including the code.
                                       g.Present the system to the class in one of the last two class meetings of the
                                         semester.

                       EVALUATE A SYSTEM

                                       a.Select a system to evaluate.
                                       b.Describe the systems in terms of the functions that it supports for a user.
                                        c.Describe, as well as you can determine, how it must represent documents
                                         internally, in order to be able to support those functions
                                       d.Describe as well as you can how it must match queries to documents, in
                                         order to support these functions
                                       e.Prepare a "tasket" (a basket of tasks) that seem suitable for the system and
                                         submit them to the system. Process the results to produce all the following
                                         measures of effectiveness: precision at 10 documents for each task; average
                                         precision versus recall for each task; averages of each of those measures
                                         over the tasket; cumulated number of useful documents versus number
                                         presented for each task; cumulated cumulation for the tasket. (This is most
                                         easily done in a spreadsheet, of course, since it will draw the graphs for you
                                         from the data).
                                        f.Write up a coherent explanation of your evaluation including both "objective"
                                         measures derived form the tasket, and user- oriented measures based on
                                         your own use and observation of the system.
                                       g.Present your analysis to the class in one of the last two meetings of the
                                         semester.

                                Note: In previous years we have sometimes used class projects that involved studying
                                multiple subjects using a system of interest. Because of the increased vigilance that we
                                (Rutgers) are exercising to protect human subjects in experimental or research studies, this
                                will not be done this semester. However, if you are interested in evaluation of real systems
                                using real users, you should visit the relevant Web site at Rutgers to learn more about the
                                notion of Human Subjects Review.

                                .

                       IV.Methods of Assessment

                       Grades will be assigned for each of the activities, with the following weights:

                       Knowledge Crumbs 25%

                       Presentation of a Reading 20%

                       Term Project 45%

                       Participation 10% *

                       TOTAL 100%

                       Mathematically the mapping from letter grades to numbers is as follows

                       96-100 -> A+ -> 98

                       90-95 -> A -> 93

                       83-89 ->B+ -> 86

                       78-82 ->B -> 80

                       72-77 ->C+ -> 75

                       65-71 ->C -> 68

                       Work which does not achieve a grade of C will be assigned an honorary score of 50% if it is submitted on
                       time. If it is not submitted, it will be assigned a courtesy score of 10%.

                       Late Work: Work which is submitted late will experience a "decay of its grade" at the rate of 5% per day.
                       Thus an A paper received 3 days late receives a score of A -> 93%x95%x95%x95% = 79.7% -> B. And so
                       forth. I will not count weekend days, if you don?t count them against me in grading papers.

                       * Participation. We will have small group sessions each week. The first will be random. Thereafter, I will
                       ask each of you to give me ranked list of those in your group, with the person you would most like to meet
                       with again listed first, and so on. In succeeding weeks we will keep re-arranging groups to try to satisfy
                       those preferences. These rankings will also factor into half the grade for participation. This is part of an
                       ongoing experiment to determine whether peer assessment is consistent with instructor assessment in my
                       courses. You are free to not submit a list of preferences, in which case you will surely be assigned to a group
                       at random the following week.