![]() | |||||
PrefaceOne of the things you learn from students is jokes:
You probably have to have gone to school in Cambridge to really appreciate this joke. I never did, but I find it funny because it laughs at an important division between thoughtful people. Two CulturesAccording to C. P. Snow, the world seems eternally divided into two cultures, the literary intellectuals and the scientists [Snow, 1961, p.4]. Snow himself provides very few clues as to just how we might identify someone at one pole or the other. He suggests that the literate have the unfortunate tendency of falling into the moral trap:
He sees scientists and engineers, on the other hand, as optimistic, impatient do-ers! This leads him to hypothesize that literature changes more slowly than science (p. 9).Testable hypothesis. He also thought that, due to the forces of a fanatical belief in educational specialization and a tendency to let our social forms crystallize (p. 18), the gap between the two cultures was ...much less bridgeable among the young than it was even 30 years ago. He said this in 1959! Certainly these same forces have not helped matters in the intervening 40 years. But Snows most important recommendation remains true:
The premise of this text is that Finding Out About (FOA), the process of actively seeking out information relevant to a topic of interest, absolutely demands a wide-ranging attack by both literary and scientific disciplines. The kind of fractionation that Snow describes has boxed investigators from various disciplines into corners from which they each attempt to address a broad range of fundamentally interdisciplinary questions of cognition. FOA is only one such question, but the tension between computational and linguistic sensibilities has been manifest in this domain for an especially long time. For example, as part of an early meeting of cyberneticists exploring the way that communication and computation might interact, Benoit Mandelbrot, an eminent mathematician and physicist (now most famous for his fractal landscapes), presented hypothetical models of language use that would explain a phenomeon known as Zipfs law (a topic discussed in this text, cf. Section 3.2 and Section 5.1), claiming these models were analogous to physical systems with which he was familiar. In reaction, A. S. C. Ross, a famous linguist of the 1950s, offered the following commentary:
Mandelbrots probabilistic models and statistics did not have much to say to at least this linguist.* An optimist, however, could see a basic complementarity between statistical methods and the linguists syntactic methods. FOAs statistical methods are good at semantics, knowing gross things about an entire documents meaning what words mean in terms of how they relate to other documents in the corpus and to users queries. It blithely throws away noise words like AND, OF, and THE, because they are assumed to say little about the documents content. Syntactic analysis captures the fine structure of individual sentences and depends critically on the same noise words to reliably anchor its parsing.Corpus-based linguistics The title of this textbook also makes cognitive aspirations. Cognitive stems from the Latin cognitio, referring to structure, building. We typically imagine cognitive structures to be within an individuals head. But part of what is now known as the discipline of cognitive science is the realization that these representations can be built by many individuals as well as by one. Considering the World Wide Web (WWW) as a representation of knowledge is a topic considered further in Section 6.9. I am personally drawn to the FOA problem because of the way it intermixes verbal and numeric sensibilities. To say that literary intellectuals are interested in language is almost tautological. But one of the major arguments put forward by this text is that many linguistic phenomena also have interesting statistical and mathematical properties. Computations involving these numbers are not only central to the engineering of effective search engines, but they portend fundamental insights into the new forms of communication emerging on the WWW. Depending on your particular background, some of the techniques and perspectives discussed in this text will come naturally to you, and others will seem as if they are from a different planet. But if you apply some effort at understanding these foreign objects, you may just find out you have lots of new friends in the rest of the solar system. Literate people can learn new mathematical names to apply to their literature, and mathematicians can appreciate new features of the language going on about them. Typographic ConventionsOther authors who have attempted to discuss language, of course using language to do so, have recognized the confusion that can result as words are used in these two very different roles. Like many of them, I have chosen to use typography to help make this distinction. For example, many of the examples used throughout the text will be drawn from the area of ARTIFICIAL INTELLIGENCE, a subdiscipline of computer science. Terms like this, which are used as examples of lexical items rather than as part of the discourse between me (the author) and you (the reader), will appear as CAPITALIZED and in MONOSPACE FONT. Second, boldface type will be used to flag especially important terms that help to define the FOA problem. For example, domain of discourse is the technical term used to describe ARTIFICIAL INTELLIGENCE, the subject matter of the documents we hope to find. These are collected at the end of each chapter, for purposes of review. Third, the fundamental relation between something in the world and what we think it means is a pivotal issue of this book. But about-ness is also a natural, ubiquitous part of much of our communication, so much so that we will adopt the typographic convention of underlining words such as about and meaning in order to highlight and better appreciate their use. Finally, authors are always faced with decisions as to which thing they must say first. Making the right decision keeps the story moving forward, while interjecting a digression can make readers lose their way. The WWW is most peoples first experience with the hypertext alternative to this linear flow. Readers are given the choice points and the opportunity to construct their own nonlinear path through a text simply by clicking on links. Obviously such jumps are more difficult to accomplish in a printed text. In this text marginal notesMarginal notes are used to point to tangential topics that a reader might choose to pursue. On the accompanying CD, clicking on the correlated anchor will lead to a brief discussion of this topic. Extra details or clarifications will be provided by footnotes, which are called out in text by asterisks.* Traditional numbered footnotes will be used to give URLs of Web sites discussed in the text. AudiencesMy interest in the topics discussed here goes back to my own dissertation. At that point I was primarily interested in machine learning techniques, and I learned just enough about free-text information retrieval to use it as a demonstration domain for the connectionist learning techniques I proposed (cf. Section 6.5.2). Since then, I have become increasingly interested in the issues surrounding FOA and have taught courses in Information Retrieval (IR) for many years, at the University of California in San Diego and the University of Wisconsin in Madison. This book began as a series of lecture notes for these classes. In the first years, I used Keith van Rijsbergens seminal text [van Rijsbergen, 1979]. (This book was already out of print when I first found it, but van Rijsbergens text has now been placed in its entirety on the WWW.) This text so influenced my thinking on this subject that it occupies a special relationship with FOA: I quote from it especially often, and I use the special referential convention of van Rijsbergen, p. iii. With Keiths permission, I include a complete copy of his hypertext on the FOA CD, and every reference to that text will allow you to click and go directly to the cited page. Several other texts deserve special mention. The collection of chapters edited by Frakes and Baeza-Yates [Frakes and Baeza-Yates, 1992] provides an excellent introduction to many topics; Foxs chapter 7 in particular figures heavily in Chapter 2 of this text. Baeza-Yates and Ribeiro have recently edited another collection of very useful chapters [Baeza-Yates and Ribeiro, 1999]. As I was finishing work on this book, Manning and Schütze produced an excellent survey of corpus-based linguistic techniques [Manning and Schütze, 1999] that extends significantly beyond the basics provided in Section 6.3.2. Robert Korfhage has written a textbook that is especially useful from the perspective of library science [Korfhage, 1997]. I highly recommend Readings in Information Retrieval, edited by Karen Sparck Jones and Peter Willett [Sparck Jones and Willett, 1997], as a companion to this text. That collection pulls together many classic papers from IRs distant past, some of which are now hard to get. A supplement (available at the FOA Web site links readings from that text as an adjunct to this textbook. Because I teach primarily in a Computer Science department, the primary audience for this textbook is computer science students, both graduate and undergraduate, like those I have had the good fortune to meet in my classes. At the same time, I have tried to suppress technical details or explain them in ways that should make the most important themes accessible to audiences (e.g., linguists, library scientists) who are more comfortable with words than with equations. Search engine technologies are central to the FOA problem, but this text was designed to be accessible to those who write such computer programs as well as to those who do not. Executable versions of all basic routines are available on the attached CD-ROM; current versions are maintained at the FOA Web site. Together with the test corpora and experimental data (queries, relevance assessments), students and teachers should be able to explore many variations without changing any code. Source code for the routines is also provided for those programmers who want to modify or extend the basic functionalities. Exercises are collected at the end of each chapter, but they are an admittedly uneven mix. They are intended as basic review exercises; some are more challenging than others. The primary assignments for my classes are a series of machine problems: extended programming assignments that cumulatively build all the parts of a basic search engine. The details of these assignments, as well as lecture slides, test questions, and so on, are available on the FOA Web site to instructors who might be interested. The first chapter of the text is designed to give any audience a broad overview of the basic questions underlying FOA and how they interact. The next three chapters cover the core issues involved in building and evaluating a generic search engine at a level appropriate to undergraduates. Chapter 5 collects several important topics that require more mathematical sophistication, and Chapter 6 and Chapter 7 consider extensions of the basic core material at a graduate level. Chapter 6 considers extensions of basic search technologies that use features of documents beyond keywords to draw more artificially intelligent inferences about them. Chapter 7 focuses on how one particular branch of AI, machine learning, has been used to automatically learn more about both documents and the users searching through them. Chapter 8 concludes with some looks into the most active development in FOA and a reassessment of fundamental issues that will be with us for the foreseeable future. AcknowledgmentsI had the good fortune to have David Blair at the University of Michigan (in a single lecture!) make it clear that FOA isnt just an engineering problem, but important to anyone deeply interested in language. Mike Gordon (energized by that same lecture), Manfred Kochen, Bob Lindsay, Gary and Judy Olson, Ken Winter, and Maurita Holland were all in Ann Arbor, and they taught me more than I would really appreciate until years later. Keith van Rijsbergens unswerving confidence has made this book possible. His book is where I began and the standard I have tried to maintain. Gerry Salton and Karen Sparck Jones have been generous and patient with me as they have been to so many others in the IR community. I thank Nick Belkin, Bruce Croft, Doug Cutting, Sue Dumais, Norbert Führ, David Lewis, Jan Petersen, and Steve Robertson for uncountable interesting SIGIR dinners. I am happy to acknowledge the influence of the industrious groups around Carnegie Mellon University and Just Research, led by Tom Mitchell and Andrew McCallum, especially on Chapter 7. A summer of exciting conversation (1987) with Ed Hutchins and Don Norman of UCSDs Cognitive Science department helped me think more broadly about parallel distributed processing models of cognition, involving networks of people rather than neurons, as parts of social systems. I have benefited from a long, productive relationship with the editors and others working at Encyclopædia Britannica. I am grateful to have met Mortimer Adler (once!) and especially to have worked closely with Editor-in-Chief Bob McHenry and others at Encyclopædia Britannica in Chicago, Chris Needham (in London), and Bob Clarke, John Dimm, John McInerney, and Harold Kester in La Jolla. Over an even longer period, Jack Conrad, Dan Dabny, Andy Desmond, Peter Jackson, and Isabelle Moulinier of West Publishing have provided my second, extended experience with the highly edited WESTLAW corpus. I enjoyed a pleasant sabbatical at the University of Wisconsin in Madison, teaching with and learning from Jude Shavlik and Mark Craven. Paul Kube is, more than anyone else I know, comfortable in both of Snows cultures (and several others as well); he has helped me sober and balance many aspects of this manuscript. I thank Kim Itkonen for turning my words about words into a wonderful image for the cover. Most of my own research has been done in collaboration with students. Many of my thoughts about what I had done right and wrong with AIR were shaped in conversations with Dan Rose, concerning his thesis. I am also grateful to both Dan and Susan Gruber for their help in shaping very early drafts of all chapters. Brian Bartell asked hard questions about FOA from the beginning, and I have appreciated the pleasure of his collaboration ever since. John Hatton, Amy Steier, and Fil Menczer have all helped me explore aspects of FOA as part of their own research; Thomas Kammeyer, Chris Vogt, and Bryan Tower have all helped push various aspects of the FOA code base forward. I am also grateful to Apple Computer, Encyclopædia Britannica, and the National Science Foundation for funding various portions of our work over the years. Chris Rosin and Terry Jones provided useful feedback on some chapters, and Marti Hearst (University of California, Berkeley) and Paul Thompson (University of Minnesota and St. Thomas University) used early drafts of FOA with their classes. I am grateful to David Tranah and Shari Chappell for their rescue of FOA at Cambridge University Press. Will, Lee, Cori and Julie are my nearest and dearest family. Simply completing this book (finally!) is the best apology I can offer them. Beyond that...Whereof one cannot speak, one must remain silent. It is here where I must say that despite the best efforts of these many friends and colleagues, I know I havent said it all, and that mistakes surely remain. I have written down those things I wish Id known when I began my thesis, for use by students in the classes I teach. If it helps you avoid any of the mistakes it has taken me a decade to learn, it will almost have been worth it.
| |||||
![]() | |||||