Back to TOC

Foreword

This is a book that I would like to have written. Like all good scientific accounts it has a story to tell, and it tells it in such a way that makes it possible for the reader and student to understand the more technical aspects of information retrieval (IR). The title Finding Out About is highly significant, signaling that the book is concerned with the “process of actively seeking out information relevant to a topic of interest.” Every keyword in this quotation is important, and each corresponding topic is treated within the book. This is not to say that formal and mathematical aspects are not discussed; they are, but not without considerable motivation first.

From the preceding you may infer that I like the book, which is correct. For many years there has been a dearth of books in our field, especially a lack of textbooks. To a certain extent books in IR have started to fill gaps; however, there are still very few. This book is an excellent addition to that small number. As a textbook it provides many examples and exercises supported by software tools and data for experimentation. The emphasis is on system building and search engine construction rather than user modeling; this is a deliberate choice by the author.

One of the wonderful things about this textbook is that it considers information retrieval in the context of the Web. The history of our subject stretches back to before World War II (for example, the work of Fairthorne), and if you include manual systems (e.g., traditional libraries) then it goes back thousands of years. It is only in the last ten years that a new technology – the World Wide Web – has started to have an impact on people’s information-seeking activities. It is not easy to introduce current technology meaningfully and successfully into scientific discussions about IR, but Rik Belew has done just that. Students will come to this book with quite a sophisticated knowledge of the Web and will not be disappointed. From the point of view of a teacher introducing the Web as a vehicle for experimentation, it is ideal. Another attractive feature from an experimental point of view is that email data are used as an example of data to retrieve from, again data with which students will be very familiar.

The intellectual thrust of this book is well rooted in the IR tradition; theory and experiments are developed and presented in tandem. There is one, perhaps unique, approach that is most welcome. Belew draws on the not inconsiderable intellectual tradition of artificial intelligence (AI). As he points out, AI and IR in many respects developed in parallel and are often concerned with similar problems, but there has been little communication between the two fields. Recently this has begun to change a little; in particular, the very strong experimental methodology and statistical approach to natural language processing in IR has been embraced by AI researchers. In the reverse direction, AI research in approximate reasoning and machine learning has begun to have an impact on IR. At each stage in the book, when possible, bridges are shown between AI and IR, which is very refreshing.

The book is primarily concerned with text retrieval; a document is a piece of text. Other forms of retrieval – image, speech and video – are not discussed to any extent. This is not a disadvantage! Many of the ideas and techniques developed for text retrieval readily transfer to other media, and in an introductory book concerned with presenting fundamentals there is no need to clutter the text with other media. Besides, there is not the same experimental backing for work done in other media.

The way the story of IR unfolds is fairly traditional, following the path of many a research paper and book in IR. After the overview (which every student should read), Belew introduces the nature of data and the tools to manipulate and represent them. He then quickly moves on to the weighting and matching schemes, using formal explanations when needed. I particularly like the way he refers to the earlier work of Zipf, Mandelbrot, and Swanson. We also get an elementary introduction to one of the key developments in IR: the vector space model pioneered by Salton and his coworkers.

We are now ready to think about retrieval performance, and Chapter 4, “Assessing the Retrieval,” is devoted to just that. The difficult notion of relevance, its definition or lack thereof, is not shirked but taken head on. For example, and I quote: “For now, we simply observe that it seems quite likely that an assessment of a document’s relevance depends greatly on the ‘basket’ of other documents we have seen.” The student is left in no doubt that we are simplifying but not forgetting that we are doing so. This Chapter introduces the basic IR experimental methodology with much clarity. There is an added bonus of a description of RAVe, a didactic tool that can be used by IR experimenters to collect large numbers of relevance assessments for an arbitrary document corpus. Both students and teachers will find this extremely useful.

The next chapter, entitled “Mathematical Foundations,” signaling that it is likely to be more difficult, introduces a mathematical account of some well-known retrieval models: latent semantic indexing, clustering and multidimensional scaling, probabilistic retrieval, and Bayesian networks. The neat thing here is that common themes across models, such as dimensionality reduction, are highlighted. It is cheering to see the correct attribution being made to the early work of Maron and Cooper. Chapter 6 has a distinctive information science flavor about it. It begins with a discussion of bibliometric and citation analysis, but then explores the development of these ideas in the context of the WWW, showing examples of recent work such as that of Kleinberg. A particularly intriguing section is on discovering latent knowledge within a corpus. It illustrates how, as described in Swanson’s work, a possible causal connection between magnesium deficiency and migraine can be made by searching the published literature in particular ways (very reminiscent of Lorenzo’s Oil).

Finally, we come to one of the centerpieces of IR research: adaptive information retrieval. Relevance feedback has been one of the great success stories of IR, and in Chapter 7 Belew discusses it from a number of different points of view. Experienced IR researchers might read this chapter first! It has the flavor of a manifesto, bringing to bear on a uniquely IR success, ideas from machine learning but at all times concentrating on the IR issues. This is an exciting chapter, justifying Belew’s claim that “FOA is an especially ripe area for AI and machine learning.” The chapter ends with “But as AI has moved from a concern with manually constructed knowledge representations to machine learning, and as IR has begun to consider how indexing structures can change with use, these two methodologies have increasingly overlapped.”

The last chapter, “Conclusions and Future Directions,” reaches out into the future and makes good bedside reading.

 C. J. “Keith” van Rijsbergen
University of Glasgow
Back to TOC