I.Course Objectives
To understand how modern information retrieval systems, such as those found
on the Web
work, and to understand the important issues in the design, study and evaluation
of such
systems.
II.Organization of the Course
The course will follow a lecture/discussion format, and every class will
include some kind of a
"small group" breakout session either in the classroom or in the computer
lab. The major
sections are:
A.Introduction and overview
B.Relation of IR to indexing and cataloging
C.Representation of texts by numbers and vectors; Basic notions of probability
for
IR
D.Design for usability; principles of user-centered design
E.Vector space models; other linear classifiers: probabilistic models;
inference
models
F.Evaluation of systems; precision, recall and utility
G.Bases for representation: terms; phrases; word pairs; n-grams; natural
language
H.Collection based indexes; LSI; PIRCS; network structures;
I.Collection descriptions; collection fusion; meta- crawlers
J.Non-textual materials: images and sound; moving images
K.Evaluation in context: users, communities, goals and measures
L.Collaborative systems: passive and active models
M.Presentation of term projects
N.Presentation of term projects
III.Major Assignments
A.Small tasks. One or two simple tasks will be assigned each week to ensure
that neither students nor teachers are surprised by a backlog of poorly
understood concepts at the end of the semester. These are referred to
below as "knowledge crumbs" KC. Note: These problems are "easy" in
the sense that when you get the point you can do them in less than 5
minutes. The point, of course, is to help you get the point.
B.Readings. Each week there will be one assigned reading from materials
either available on the Web, or distributed in class the prior week. Note:
We are trying to avoid an expensive "Professor Packet" by this process,
but it is still experimental.
A.The following books will each be of some use to some of you, but because
of the
diversity of interests and preparation, no book seems suitable for all
the students in
any given year.
*Baeza-Yates, R. & Ribeiro-Neto, B. Modern information retrieval. New
York: ACM Press.
This text takes a computer-science perspective. It is more current than
Frakes & Baeza-Yates
(1992), and broader than Witten Moffat and Bell. If you select only one
CS-type text, this would
be the better choice.
Frakes, W.B. & Baeza-Yates, R., eds. (1992) Information retrieval:
data structures and
algorithms. Englewood Cliffs, NJ: Prentice Hall. This is a collection of
chapters, which vary
widely. It is a good place to learn some basics of how IR systems are built.
Much of the code is
available at ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/
*Hersh, W.R. (1995) Information retrieval: A health care perspective. New
York: Springer
Verlag.
General introductory text. This is accessible to all students in this course.
Bill Hersh (MD) has
concentrated his evaluation efforts on health care applications, but covers
the general
principles, and pays particular attention to the problem of designing good
evaluation
experiments.
van Rijsbergen, C.J. (1979) Information retrieval, 2nd ed. London: Butterworths.
This is an
important early text on the subject, from an IR+CS perspective. Parts are
quite mathematical.
This is the origin of the so-called "F-measure" which has become popular
among computer
scientists. The great thing is that Keith van Rijsbergen has made it available
for FREE, at
http://www.dcs.gla.ac.uk/Keith/Preface.html
Salton, G. (1989) Automatic text processing: The transformation, analysis
and retrieval of
information by computer. Reading, MA: Addison- Wesley. Gerry Salton was
the founder of
automated IR in the US, and his academic descendants have done much of
the good work. This
book is quite different from the earlier book Salton and McGill, and perhaps
less useful in
introducing fundamental ideas.
Salton, G. & McGill, M. (1983) Introduction to modern information retrieval.
New York:
McGraw-Hill. Of course this is quite old now, but it introduces many of
the ideas that are
otherwise only to be found in the papers and techincal reports of Salton?s
group at Cornell.
*Sparck Jones, K. & Willett, P. eds. (1997) Readings in Information
Retrieval. San Francisco:
Morgan Kaufmann. A collection of significant research papers in the evolution
of information
retrieval from its roots in librarianship. The editors have written good
historical introductions to
each section. The mathematical rigor and clarity of the papers varies widely,
and in many cases
it is not easy to reconstruct the "commonly accepted meaning" of the terms
used by the
author(s). This is a very good place to look for papers on which to report
to the class, and any
other papers contained in this collection will be acceptable for that purpose.
Witten, I, Moffat, & Bell T. Managing Gigabytes. compressing and indexing
documents and
images. New York : Van Nostrand Reinhold, c1994.xiv, 429 p. : ill. ; 26
cm.
Online resource: http://www.searchenginewatch.com/
A.Presentation of a paper. Each student is required to select one paper
from an extended list
that will be made available on the web site, to study that paper thoroughly,
and to present
the key results of that paper to the entire class on a date to be selected.
The presentation
may include slides, handouts, or web pages. The presentation should make
clear: (1) What
problem the paper addresses; (2) what relation it has to prior cited literature;
(3) what idea it
proposes to solve or improve the problem; (4) what was done to implement
that idea; (5)
what results were found, and (6) what suggestions were made for further
work. Note: We
have adopted this approach to presenting readings because several years
of experience has
established that when I explain the papers they tend to become less accessible,
rather than
more. When you explain them to each other, they become clear. Of course,
if the presentation
does not make the paper clear, ASK QUESTIONS.
B.Term project: System building OR system evaluation. Students are encouraged
to work in
teams of 2 for this assignment, although a student may elect to work alone.
More specifically,
to
1.BUIILD A SYSTEM.
a.Select a body of texts. These may either be the body of texts that the
class
assembles in KC-1 below, or they may be assembled in some other way.
There must be at least 80 texts in the collection, but they may be as short
as
abstracts if desired. They may contain tags only if that tagging is used
to
index or retrieve on the collection.
b.Select a way to represent those texts ? this may be simple (such as term
occurrence, or more complex, using term frequency, term position, or distance
between terms).
c.Select a way to match simple queries (sets of terms) to those
representations to produce a ranked list of the 10 documents most likely
to
be relevant.
d.Give the system some kind of interface (this may be web based, but it
could
be Unix command line if you like).
e.Get at least one other person from this class to use the system in the
intended way, to ensure that you are not totally fooling yourself.
f.Write up a coherent explanation of what you have done, including the
code.
g.Present the system to the class in one of the last two class meetings
of the
semester.
EVALUATE A SYSTEM
a.Select a system to evaluate.
b.Describe the systems in terms of the functions that it supports for a
user.
c.Describe, as well as you can determine, how it must represent documents
internally, in order to be able to support those functions
d.Describe as well as you can how it must match queries to documents, in
order to support these functions
e.Prepare a "tasket" (a basket of tasks) that seem suitable for the system
and
submit them to the system. Process the results to produce all the following
measures of effectiveness: precision at 10 documents for each task; average
precision versus recall for each task; averages of each of those measures
over the tasket; cumulated number of useful documents versus number
presented for each task; cumulated cumulation for the tasket. (This is
most
easily done in a spreadsheet, of course, since it will draw the graphs
for you
from the data).
f.Write up a coherent explanation of your evaluation including both "objective"
measures derived form the tasket, and user- oriented measures based on
your own use and observation of the system.
g.Present your analysis to the class in one of the last two meetings of
the
semester.
Note: In previous years we have sometimes used class projects that involved
studying
multiple subjects using a system of interest. Because of the increased
vigilance that we
(Rutgers) are exercising to protect human subjects in experimental or research
studies, this
will not be done this semester. However, if you are interested in evaluation
of real systems
using real users, you should visit the relevant Web site at Rutgers to
learn more about the
notion of Human Subjects Review.
.
IV.Methods of Assessment
Grades will be assigned for each of the activities, with the following weights:
Knowledge Crumbs 25%
Presentation of a Reading 20%
Term Project 45%
Participation 10% *
TOTAL 100%
Mathematically the mapping from letter grades to numbers is as follows
96-100 -> A+ -> 98
90-95 -> A -> 93
83-89 ->B+ -> 86
78-82 ->B -> 80
72-77 ->C+ -> 75
65-71 ->C -> 68
Work which does not achieve a grade of C will be assigned an honorary score
of 50% if it is submitted on
time. If it is not submitted, it will be assigned a courtesy score of 10%.
Late Work: Work which is submitted late will experience a "decay of its
grade" at the rate of 5% per day.
Thus an A paper received 3 days late receives a score of A -> 93%x95%x95%x95%
= 79.7% -> B. And so
forth. I will not count weekend days, if you don?t count them against me
in grading papers.
* Participation. We will have small group sessions each week. The first
will be random. Thereafter, I will
ask each of you to give me ranked list of those in your group, with the
person you would most like to meet
with again listed first, and so on. In succeeding weeks we will keep re-arranging
groups to try to satisfy
those preferences. These rankings will also factor into half the grade
for participation. This is part of an
ongoing experiment to determine whether peer assessment is consistent with
instructor assessment in my
courses. You are free to not submit a list of preferences, in which case
you will surely be assigned to a group
at random the following week.