1

Back to TOC

Overview

Finding Out About. Reproduced by permission of The New Yorker*

1.1 Finding Out About – A Cognitive Activity

We are all forced to make decisions regularly, sometimes on the spur of the moment. But the rest of the time we have enough warning that it is possible to collect our thoughts and do some research that makes our decision as sound as it can be. This book is a closer look at the process of finding out about (FOA), research activities that allow a decision-maker to draw on others’ knowledge. It is written from a technical perspective, in terms of computational tools that speed the FOA activity in the modern era of the distributed networks of knowledge collectively known as the World Wide Web (WWW). It shows you how to build many of the tools that are useful for searching collections of text and other media. The primary argument advanced is that progress requires that we appreciate the cognitive foundation we bring to this task as academics, as language users, and even as adaptive organisms.

As organisms, we have evolved a wide range of strategies for seeking useful information about our environment. We use the term “cognitive” to highlight the use of internal representations that help even the simplest organisms perceive and respond to their world; as the organisms get less simple, their cognitive structures increase in complexity. Whether done by simple or complex organisms, however, the process of finding out about is a very active one – making initial guesses about good paths, using complex sets of features to decide if we seem to be on the right path, and proceeding forward.

As humans, we are especially expert at searching through one of the most complex environments of all: language. Its system of linguistic features is not derived from the natural world, at least not directly. It is a constructed, cultural system that has worked well since (by definition!) prehistoric times. In part, languages remain useful because they are capable of change when necessary. New features and new objects are noticed, and it becomes necessary for us to express new things about them, to form our reactions to them, and to express these reactions to one another.

Our first experience of language, as children and as a species, was oral – we spoke and listened. As children we learn Sprachspiele (word or language games) [Wittgenstein, 1953] – how to use language to get what we want. A baby saying “Juice!” is using the exclamation as a tool to make adults move; that’s what a word means. Such a functional notion of language, in terms of the jobs it accomplishes, will prove central to our conception of what keywords in documents and queries mean as part of the FOA task.

Beyond the oral uses of language, as a species we have also learned the advantages of writing down important facts we might otherwise forget. Writing down a list of things to do, which we might forget tomorrow, extends our limited memory. Some of these advantages accrue to even a single individual: We use language personally, to organize our thoughts and to conceive strategies.

Even more important, we use writing to say things to others. Writing down important, memorable facts in a consistent, conventional manner, so that others can understand what we mean and vice versa, further amplifies the linguistic advantage. As a society, we value reading and writing skills because they let us interpret shared symbols and coordinate our actions. In advanced cultures’ scholarship, entire curricula can be defined in terms of what Robert McHenry (Editor-in-Chief of Encyclopdia Britannica) calls Knowing How to Know.”

It is easiest to think of the organism’s or human’s search as being for a valuable object, sweet pieces of fruit in the jungle, or (in modern times) a grocer that sells them. But as language has played an increasingly important role in our society, searching for valuable written passages becomes an end unto itself. Especially as members of the academic community, we are likely to go to libraries seeking others’ writings as part of our search. Here we find rows upon rows of books, each full of facts the author thought important, and endorsed by a librarian who has selected it. The authors are typically people far from our own time and place, using language similar but not identical to our own.

Of course the library contains many such books on many, many topics. We must Find Out About a topic of special interest, looking only for those things that are relevant to our search. This basic skill is a fundamental part of an academic’s job:

  • We look for references in order to write a term paper.
  • We read a textbook, looking for help in answering an exercise.
  • We comb through scientific journals to see if a question has already been answered.

We know that if we find the right reference, the right paper, the right paragraph, our job will be made much easier. Language has become not only the means of our search, but its object as well.

Today we can also search the World Wide Web (WWW) for others’ opinions of music, movies, or software. Of course these examples are much less of an “academic exercise”; Finding Out About such information commodities, and doing it consistently and well, is a skill on which the modern information society places high value indeed. But while the infrastructure forming the modern WWW is quite recent, the promise offered by truly connecting all the world’s knowledge has been anticipated for some time, for example, by H. G. Wells [Wells, 1938].

Many of the FOA searching techniques we will discuss in this text have been designed to operate on vast collections of apparently “dead” linguistic objects: files full of old email messages, CD-ROMs full of manuals or literature, Web servers full of technical reports, and so on. But at their core, each of these collections is evidence of real, vital attempts to communicate. Typically an author (explicitly or implicitly) anticipates the interests of some imagined audience and produces text that is a balance between what the author wants to say and what he or she thinks the audience wants to hear. A textual corpus will contain many such documents, written by many different authors, in many styles and for many different purposes. A person searching through such a corpus comes with his or her own purposes and may well use language in a different way from any of the authors. But each individual linguistic expression – the authors’ attempts to write, the searchers’ attempts to express their questions and then read the authors’ documents – must be appreciated for the word games [Wittgenstein, 1953] that they are. FOA is centrally concerned with meaning: the semantics of the words, sentences, questions, and documents involved. We cannot tell if a document is about a topic unless we understand (at least something of) the semantics of the document and the topic. This is the notion of about-ness most typical within the tradition of library science [Hutchins, 1978].

This means that our attempts to engineer good technical solutions must be informed by, and can contribute to, a broader philosophy of language. For example, it will turn out that FOA’s concern with the semantics of entire documents is well complemented by techniques from computational linguistics, which have tended to focus on syntactic analysis of individual sentences. But even more exciting is the fact that the recent availability of new types of electronic artifacts – from email messages and WWW corpora to the browsing behaviors of millions of users all trying to FOA – brings an empirical grounding for new theories of language that may well be revolutionary.

At its core, the FOA process of browsing readers can be imagined to involve three phases:

  1. asking a question;
  2. constructing an answer; and
  3. assessing the answer.

This conversational loop is sketched in Figure 1.1.

Step 1. Asking a Question

The first step is initiated by people who (anticipating our interest in building a search engine) we’ll call users, and their questions. We don’t know a lot about these people, but we do know they are in a particular frame of mind, a special cognitive state; they may be awareMeta-cognition about ignorance of a specific gap in their knowledge (or they be only vaguely puzzled), and they’re motivated to fill it. They want to FOA... some topic.

Supposing for a moment that we were there to ask, the users may not even be able to characterize the topic, that is, to articulate their knowledge gap. More precisely, they may not be able to fully define characteristics of the “answer” they seek. A paradoxical feature of the FOA problem is that if users knew their question, precisely, they might not even need the search engine we are designing: Forming a clearly posed question is often the hardest part of answering it! In any case, we’ll call this somewhat befuddled but not uncommon cognitive state the users’ information need.

While a bit confused about their particular question, the users are not without resources. First, they can typically take their ill-defined, internal cognitive state and turn it into an external expression of their question, in some language. We’ll call their expression the query, and the language in which it is constructed the query language.

Step 2. Constructing an Answer

So much for the source of the question; whence the answer? If the question is being asked of a person, we must worry about equally complex characteristics of the answerer’s cognitive state:

  • Can they translate the user’s ill-formed question into a better one?
  • Do they know the answer themselves?
  • Are they able to verbalize this answer?
  • Can they give the answer in terms the user will understand?
  • Can they provide the necessary background knowledge for the user to understand the answer itself?

We will refer to the question-answerer as the search engine, a computer program that algorithmically performs this task. Immediately each of the concerns (just listed) regarding the human answerer’s cognitive state translates into extremely ambitious demands we might make of our computer system.

Throughout most of this book, we will avoid such ambitious issues and instead consider a very restricted form of the FOA problem: We will assume that the search engine has available to it only a set of preexisting, “canned” passages of text and that its response is limited to identifying one or more of these passages and presenting them to the users; see Figure 1.2. We will call each of these passages a document and the entire set of documents the corpus. Especially when the corpus is very large (e.g., assume it contains millions or even billions of documents), selecting a very small set (say 10 to 20) of these as potentially good answers to be retrieved will prove sufficiently difficult (and practically important) that we will focus on it for the first few chapters of this book. In the final chapters however, we will consider how this basic functionality can be extended towards tools for “Searching for an education” (cf. Section 8.3.9).

Step 3. Assessing the Answer

Imagine a special instance of the FOA problem: You are the user, waiting in line to ask a question of a professor. You’re confused about a topic that is sure to be on the final exam. When you finally get your chance to ask your question, we’ll assume that the professor does nothing but select the three or four preformed pearls of wisdom he or she thinks come closest to your need, delivers these “documents,” and sends you on your way. “But wait!” you want to say. “That isn’t what I meant.” Or, “Let me ask it another way.” Or, “That helps, but I still have this problem.”

The third and equally important phase of the FOA process “closes the loop” between asker and answerer, whereby the user (asker) provides an assessment of how relevant they find the answer provided. If after your first question and the professor’s initial answer you are summarily ushered out of the office, you have a perfect right to be angry because the FOA process has been violated. FOA is a dialog between asker and answerer; it does not end with the search engine’s first delivery of an answer. This initial exchange is only the first iteration of an ongoing conversation by which asker and answerer mutually negotiate a satisfactory exchange. In the process, the asker may recognize elements of the answer he or she seeks and be able to reexpress the information need in terms of threads taken from previous answers.

Because the question-answerer has been restricted to a simple set of documents, the asker’s relevance feedback must be similarly constrained; for each of the documents retrieved by the search engine, the asker reacts by saying whether or not the document is relevant. Returning to the student/professor scenario, we can imagine this as the student saying “Thanks, that helps” after those pearls that do and remaining silent or saying, “Huh?” or “What does that have to do with anything?!” or “No, that’s not what I meant!” otherwise. More precisely, relevance feedback gives askers the opportunity to provide more information with their reaction to each retrieved document – whether it is relevant (), irrelevant (), or neutral (#). This is shown as a Venn diagram–like labeling of the set of retrieved documents in Figure 1.3. We’ll worry about just how to solicit and make use of relevance feedback judgments in Chapter 4.What FOA data can we observe?

1.1.1 Working within the IR Tradition

If it seems to you that the last section has sidestepped many of the most difficult issues underlying FOA, you’re right! Later chapters will return to redress some of these omissions, but the immediate goal of Chapter 2, Chapter 3 and Chapter 4 is to “operationalize” FOA to resemble a well-studied problem within computer science, typically referred to as information retrieval (IR). IR is a field that has existed since computers were first used to count words [Belkin and Croft, 1987]. Even earlier, the related discipline of library science had developed many automated techniques for efficiently storing, cataloging, and retrieving physical materials so that browsing patrons could find them; many of these methods can be applied to the digital documents held within computers. IR has also borrowed heavily from the field of linguistics, especially computational linguistics.

The primary journals in the field and most important conferencesOther places to FOA IR in IR have continued to publish and meet since the 1960s, but the field has taken on new momentum within the last decade. Computers capable of searching and retrieving from the entire biomedical literature, across an entire nation’s judicial system, or from all of the major newspaper and magazine articles, have created new markets among doctors, lawyers, journalists, students, everyone! And of course, the Internet, within just a few years, has generated many, many other examples of textual collections and people interested in searching through them.

The long tradition of IR is therefore the primary perspective from which we will approach FOA. Of course, every tradition brings with it tacit assumptions and preconceived notions that can hinder progress. In some ways, an elementary school student using the Internet to FOA class materials is related to the original problem considered by library science and IR, but in many other ways it couldn’t be more different (cf. Section 8.1). In this text, “FOA” will be used to refer to the broadest characterization of the cognitive process and “IR” to this subdiscipline of computer science and its traditional techniques. When we talk of the “search engine,” this is not meant to refer to any particular implementation, but to an idealized system most typical of the many different generations and varieties of actual search engines now in use. If you are using this text as part of a course, you may build one simple example of a search engine.

Using Figure 1.4 as a guide, we’ll return to each of the three phases and be a bit more specific about each component of our search engine. Here, finally, the human question-answerer has been replaced by an algorithm, the search engine, that will attempt to accomplish the same purpose. This figure also makes clear that the fundamental operation performed by a search engine is a match, between descriptive features mentioned by users in their queries and documents sharing those features. By far the most important kind of features are keywords.

1.2 Keywords

Keywords are linguistic atoms – typically words, pieces of words, or phrases – used to characterize the subject or content of a document. They are pivotal because they must bridge the gap between the users’ characterization of information need (i.e., their queries) and the characterization of the documents’ topical focus against which these will be matched. We could therefore begin to describe them from either perspective: how they are used by users, or how they become associated with documents. We will begin with the former.

1.2.1 Elements of the Query Language

If the query comes from a student during office hours or from a patron at a reference librarian’s desk, the query language they’ll use to frame their question is entirely natural, that most expressive “mother tongue” familiar to both question-asker and - answerer. But for the software search engines we will consider, we must assume a much more constrained “artificial” query language. Like other languages, ours will have both a meaningful vocabulary – the set of important keywords any user is allowed to mention in any queries – and a syntax that allows us to construct more elaborate query structures.

1.2.2 Topical Scope

The first constraint we can apply to the set of keywords we will allow in our vocabulary is to define a domain of discourse – the subject area within which each and every user of our search engine is assumed to be searching. While we might imagine building a truly encyclopedic reference work, one capable of answering questions about any topic whatsoever, it is much more common to build a search engine with more limited goals, capable of answering questions about some particular subject. We will choose the simpler path (it will prove enough of a challenge!) and focus on a particular topic. To be concrete, throughout this text we will assume that the domain of discourse is ARTIFICIAL INTELLIGENCE (AI). Briefly, AI can be defined as a subdiscipline of computer science, especially concerned with algorithms that mimic inferences which, had they been made by a human, would be considered “intelligent.” It typically includes such topics as KNOWLEDGE REPRESENTATION, MACHINE LEARNING, and ROBOTICS.

Thus COMPUTER SCIENCE is a broader term than ARTIFICIAL INTELLIGENCE. This hypernym relationship between the two phrases is something we will return to later (cf. Section 6.3). For example, our task becomes more difficult if we assume that the corpus of documents contains material on the broader topic of COMPUTER SCIENCE, rather than just (!) ARTIFICIAL INTELLIGENCE. Conversely, the topics KNOWLEDGE REPRESENTATION, MACHINE LEARNING, and ROBOTICS are all narrower terms, and our task would, caeteris paribus,* be made easier if we only had to help users FOA one of them.

Constraining the vocabulary so that it is exhaustive enough that any imaginable and relevant topic is expressible within the language, while remaining specific enough that any particular subjects a user is likely to investigate can be distinguished from others, will become a central goal of our design. ROBOTICS, for example, would seem a descriptive keyword because it identifies a relatively small subarea of ARTIFICIAL INTELLIGENCE. COMPUTER SCIENCE would be silly as a keyword (for this corpus), because we are assuming it would apply to every document and hence does nothing to discriminate them – it is too exhaustive. At the other extreme, ROBOTIC VACUUM CLEANERS FOR 747 AIRLINERS is almost certainly too specific.

The vocabulary size – the total number of keywords – depends on many factors, including the scope of the domain of discourse. A typical language user has a reading vocabulary of approximately 50,000 words. Web search engines and large test corpora formed from the union of many document types may require vocabularies ten times this size. It is unlikely that such a large lexicon of keywords would be required for restricted corpora, but it is also true that even a narrow field can develop an extensive, specialized jargon or terms of art. In practice, search engines typically have difficulty reducing the number of usable keywords to much below 10,000.

1.2.3 Document Descriptors

We’ve introduced keywords as features mentioned by users as part of their queries, but the other face of keywords is as descriptive features of documents. That is, we might naturally say that a document is about ROBOTICS. Users mentioning ROBOTICS in their query should expect to get those documents that are about this topic. Keywords must therefore also function as the documents’ description language. The same vocabulary of words used in queries must be used to describe the topical content of each and every document. Keywords become our characterization of what each document is about. Indexing is the process of associating one or more keywords with each document.

The vocabulary used can either be controlled or uncontrolled (a.k.a. closed vocabularies or open vocabularies). Suppose we decide to have all the documents in our corpus manually indexed by their authors; this is quite common in many conference proceedings, for example. If we provide a list of potential keywords and tell authors they must restrict their choices to terms on this list, we are using a controlled indexing vocabulary. On the other hand, if we allow the authors to assign any terms they choose, the resulting index has an uncontrolled vocabulary [Svenonius, 1986].

To get a feel for the indexing process, imagine that you are given a piece of text and must come up with a set of keywords that describe what the document is about. Let’s make the exercise more concrete. You are the author of a report entitled USING A NEURAL NETWORK FOR PREDICTION, and you are submitting it to a journal. One of the things this particular journal requires is that the author provide up to six keywords under which this article will be indexed. If you are sending it to the Communications of the ACM, you might pick a set of keywords that identify, to the audience of computer scientists you think read this publication, connections between this new work and prior work in related areas: NONLINEAR REGRESSION; TIME SERIES PREDICTION.

But now imagine that you’ve decided to submit the exact same paper to Byte magazine, and you must again pick keywords that have meaning to this audience. You might choose: NEURAL NETWORKS;STOCK MARKET ANALYSIS.

What is the context in which these keywords are going to be interpreted? Who’s the audience? Who’s going to understand what these keywords mean? Anticipating the FOA activity in which these keywords will function, we know that the real issue to be solved is not only to describe this one document, but to distinguish it from the millions of others in the same corpus. How are the keywords chosen going to be used to distinguish your document from the others?

It is often easiest to imagine keywords as independent features of each document. In fact, however, keywords are best viewed as a relation between a document and its prospective readers, sensitive to both characteristics of the users’ queries and other documents in the same corpus. In other words, the keywords you pick for Byte should be different from those you pick for Communications of the ACM, and for deeper reasons than what we might cynically consider “spin control.”

1.3 Query Syntax

Keywords therefore have a special status in IR and as part of the FOA process. Not only must they be exhaustive enough to capture the entire topical scope reflected by the corpora’s domain of discourse, but they must also be expressive enough to characterize any information needs the users might have.

Of course we need not restrict our users to only one of these keywords. It seems quite natural for queries to be composed of two or three, perhaps even dozens, of keywords. Recent empirical evidence suggests that many typical queries have only two or three keywords (cf. Section 8.1), but even this number provides a great combinatorial extension to the basic vocabulary of single keywords. Other applications, for example, using a document itself as a query (i.e., using it as an example: “Give me more like this”), can generate queries with hundreds of keywords. Regardless of size, queries defined only as sets of keywords will be called simple queries. Many Web search engines support only simple queries. Often, however, the search engines also provide more advanced interfaces, including operators in the query language. Perhaps, because you have previously been warped by an exposure to computer science:), you think that sets of keywords might be especially useful if joined by Boolean operators. For example, if we have one set of documents about NEURAL NETWORKS and another set of documents about SPEECH RECOGNITION, we can expect the query: NEURAL NETWORKS AND SPEECH RECOGNITION to correspond to the intersection of these two sets, while NEURAL NETWORKS OR SPEECH RECOGNITION would correspond to their union.

The Boolean NOT operator is a bit more of a problem. If users say they want things that are not about NEURAL NETWORKS, they are in fact referring to the vast majority of the corpus. That is, NOT is more appropriately considered a binary, subtraction operator. To make this distinction explicit we will call it BUT_NOT.

There are other syntactic operators that are often included in a search engine’s query language, but discussion of these will be put off until later. Even with these simple Boolean connectives and a keyword vocabulary of reasonable size, users can construct a vast number of potential queries when attempting to express their information need.

1.3.1 Query Sessions

As we consider the specific features of each query, it is important to remember the role these short expressions play in the larger FOA process. Queries are generated as an attempt by users to express their information need. As with any linguistic expression, conveying a thought you have can be difficult, and this is likely to be especially true of the muddled cognitive state of our FOA searcher. Users who are familiar with the special syntactic features of a query language may be able to express their need more easily, but others for whom this unnatural syntax is new or difficult will have additional difficulties.“Typical” users have changed

As with many of the idealizing assumptions we are at least temporarily making, it is often simpler to think about only one iteration of the three-step query/retrieve/assess FOA process at a time. In most realistic situations we can expect that single queries will not occur in isolation but as part of an iteration of the FOA process. An initial query begins the dialog; the search engine’s response provides clues to the user about directions to pursue next; these are expressed as another query. An abstract view of this sequence is presented in Figure 1.5. Note especially the concatenation of a series of basic FOA three-step iterations. Data are produced by the user, then by the search engine, and then by the user; this constructs a very natural alternation of user–search engine exchanges. Users’ assessments can also function as their next query statement. This can be achieved simply if we have some method for automatically constructing a query from relevance feedback. For example, if users click on documents they like, the search engine can, by itself, form a new query that focuses on those keywords that are especially associated with these documents.

There are many such techniques for using relevance feedback from a single query/retrieval, and there are many more things we can learn from the entire query session. The full query session provides more complete evidence about the users’ information need than we can gain from any one query. In fact, as will be discussed extensively in Chapter 7, there exist algorithmic means by which the search engine itself might “learn” from such evidence. Learning methods might even be expected to make transitive leaps, from the users’ initial expressions of their information needs to the final documents that satisfied them.Transitivity (Of course, this transitive leap is only warranted if we are certain that users ended the session satisfied and aren’t just quitting in frustration!) For all these reasons, we must try to identify a query session’s boundaries, that is, when one focused search session ends and the next session, involving the same user searching on a different topic, begins.

1.4 Documents

When “documents” were first introduced as part of the FOA process, it was as one of the set of potential, predefined answers to users’ queries. Here we will ground this abstract view in practical terms that can be readily applied, for example, to the searches that are now common on the Web. Our goal will be to balance this practical description of how search engines work today with the abstract FOA view that goes beyond current practices to other kinds of searches still to come.

A useful working definition is that a document is a passage of free text. It is composed of text, strings of characters from an alphabet. We’ll typically make the (English) assumption that uses the Roman alphabet, Arabic numerals, and standard punctuation. Complications like font style (italics, bold) and non-Roman marked alphabets that add characters like ä, Ç, Ñ, and æ; and the iconic characters of Asian languages require even more thought.

By “free” text we mean it is in natural language, the sort native readers and writers use easily. Good examples of free text might be a newspaper article, a journal paper, or a dictionary definition. Typically the text will be grammatically well-formed language, in part because this is written language, not oral. People are more careful when constructing written artifacts that last beyond the moment. Informal texts like email messages, on the other hand, help to point to ways that some texts can retain the spontaneity of oral communication, for better and worse [Ong, 1982].

Finally, we will be interested in passages of such text, of arbitrary size. The newspaper example makes us imagine documents of a few thousand words, but journal articles make us think of samples ten times larger, and email messages make us think of something only a tenth that size. We can even think of an entire book as a single document. All such passages satisfy our basic definition; they might be appropriate answers to a search about some topic.

The length of the documents will prove to be a critical issue in FOA search engine design, especially when some corpus contains documents of widely varying lengths. This is because longer documents can discuss more topics, so they are capable of being about more. Longer documents are more likely to be associated with more keywords, and hence they are more likely to be retrieved (cf. Section 3.4.2).

One possible response is to make a simple but very consequential assumption.

ASSUMPTION 1All documents have equal about-ness.

In other words, if we ask the (a priori) probability of any document in the corpus being considered relevant, we will assume that all are equiprobable. This would lead us to normalize documents’ indices in some way to compensate for differing lengths. The normalization procedure is a matter of considerable debate; we will return to consider it in depth later (cf. Section 3.4.2).

For now, we will take a different tack toward the issue of document length, as captured by an alternative pair of assumptions.

ASSUMPTION 2The smallest unit of text with appreciable about-ness is the paragraph.
ASSUMPTION 3All manner of longer documents are constructed out of basic paragraph atoms.

The first piece of this argument is that the smallest sample of text that can reasonably be expected to satisfy an FOA request is a paragraph. The claim is that a word, even a sentence, does not by itself provide enough context for any question to be answered or “found out about.” But if the paragraph has been well constructed, as defined by conventional rules of composition, it should answer many such questions. And unless the text comes from James Joyce, Proust, or Jorge Luis Borges, we can expect paragraphs to occupy about half an average screen page – nicely viewable chunks.

Assumption 3 alludes to the range of structural relationships by which the atomic paragraphs can typically be strung together to form longer passages. First and foremost is simple sequential flow, the order in which an author expects the paragraphs to be read. The sequential nature of traditional printed media, from the first papyrus scrolls to modern books and periodicals, has meant that a sequential ordering over paragraphs has been dominant. It may even be that the modern human is especially capable of understanding rhetoric of this form (cf. Section 6.2.3).

In any case, a sequential ordering of paragraphs is just one possible way they might be related. Other common relationships include:

  • a hierarchical structure composing paragraphs into subsections, sections, and chapters;
  • footnotes, embellishing the primary theme;
  • bibliographic citations to other, previous publications;
  • references to other sections of the same document; and
  • pedagogical prerequisite relationships ensuring that conceptual foundations are established prior to subsequent discussion.

Of course each of these relationships has grown up within the tradition of printed publication. Special typographical conventions (boldface, italics, sub- and superscripting, margins, rules) have arisen to represent them and distinguish them from sequential flow.

But new, electronic media now available to readers (and becoming available to authors) need not follow the same strictly linear flow. The new capabilities and problems of traversing text in nonlinear ways – hypertext – have been discussed by some visionaries [Bush, 1945; Nelson, 1987] for decades. This new technology certainly permits us to make some traversals more easily (e.g., jumping to a cited reference with the click of a button rather than a trip to the library), but this same ease may make it more difficult for an author to present a cogent argument.

For now we will not worry about how arguments can be formed with nonlinear hypermedia. Assumptions 2 and 3 simply allow us to infer Assumption 1: If all the documents are paragraphs, we can expect them to have virtually uniform about-ness. These are also simplifying assumptions, however. In an important sense, a scientific paper’s abstract is about the same content as the rest of the paper, and a newspaper article’s first paragraph attempts to summarize the details of the following story. These issues of a text’s level of treatment will be discussed later.

1.4.1 Structured Aspects of Documents

In addition to their free text, many documents will carry meta-data that gives some facts about the document. We may have publication information, for example, that this document appeared in this journal, in this issue, on this page. We are likely to know the author(s) of the document. Queries will often refer to aspects of both free text and meta-data.

QUERY 1I’m interested in documents about Fools’ Gold that have been published in children’s magazines in the last five years.

The first portion of this query depends on the same about-ness relation that is at the core of our FOA process. But the last two criteria, concerning publication type and date, seem to be just the sort of query against structured attributes that database systems perform very successfully. In most real-life applications a hybrid of database and IR technologies will be necessary. (We distinguish between these techniques in Section 1.6.)

The most interesting examples concern characteristics that do not clearly fall into either IR or database categories. For example, can you define precisely what you mean by a “children’s magazine” in terms of unambiguous attributes on which a database would depend? Consider another query.

QUERY 2What sort of work has K. E. Smith done on metabolic proteins affecting neurogenesis?

Finding an exact match for the string K. E. Smith in the AUTHORS attribute is straightforward. But the conventions in much of medical and biological publication (as well as in some areas of physics) sometimes lead to dozens of authors on papers, from the director of the institute through all of the laboratory assistants. Although K. E. Smith might well fulfill the syntactic requirements of authorship on a particular paper, users searching for “the work of” this person might well have a more narrowly defined semantic relationship in mind.

1.4.2 Corpora

We have focused on individual documents, but of course the FOA problem would not interest us except that we are typically faced with a corpus of millions of such documents, and we are interested in finding only the handful that are of interest. The actual number of documents and their cumulative size will matter a great deal, as some of our IR methods have time or space complexities that make them viable only within certain parameters. To pick a simple example, if you are trying to find a newspaper article (you read it a few days ago) for a friend, exhaustively searching through all the pages is probably quite effective if you know it was in Friday’s paper, but not if you need to search through an entire month’s recycling pile! Similarly, a standard utility like the Unix grep command can be a practical alternative if the corpus is small and the queries simple.

1.4.3 Document Proxies

Do you remember the library’s original card catalogs, those wooden, beautifully constructed cabinets full of rows and rows of drawers, each full of carefully typed index cards? The card catalog contained proxies – abridged representations of documents, acting as their surrogate – for the books it indexed. No one expected the full text of the books to actually be found in the drawers.

Computerized card catalogs are only capable of supporting a similar function. They do allow more extensive indexing and efficient retrieval, from terminals that might be accessed far from the library building. At the heart of this system is a text search engine capable of matching features of a query against book titles.Card catalogs were the first search engines Just like with the original index cards, however, retrieval is limited to some proxy of the indexed work, a bibliographic citation, or perhaps even an abstract. The text of ultimate interest – in a book, magazine, or journal – remains physically quite distinct from the search engine used to find it.

As computer storage capacities and network communication rates have exploded, it has become increasingly common to find retrieval systems capable of presenting the full text of retrieved items. In the modern context, proxies extend beyond the bibliographic citation information and subject headings we associate with card catalogs and include a document’s title, an article’s abstract, a judicial opinion’s headnote, or a book’s table of contents.

The distinction between the search engine retrieving documents and retrieving proxies remains important, however, for at least two reasons. First, the radically changing technical capabilities of libraries (and computers and networks more generally) can create conceptual confusion about just what the search engine is doing. While it has been possible for a decade or more to get the full text of published journal articles through commercial systems such as DIALOG and Lexis/Nexis, free access to these through your public library would have been almost unheard of until quite recently. In fact, most libraries did not even try to index individual articles in their periodical collections. Changing technical capacities, changes in the application of intellectual property laws, changes in the library’s role, and resulting changes in the publishing industry are radically altering the traditional balance. Even when all new publications are easily available electronically, the issue of retrospectively capturing previously published books and journals remains unresolved.

Looking far into the future and assuming no technical, economic, or legal barriers to a complete rendering of any document in our corpus, there is still an important reason to consider document proxies. Recall that FOA is a process we are attempting to support and that retrieving sets of documents to show users is a step we expect to repeat many times. Proxies are abridged versions of the documents that are easier for browsing users to quickly scan and react to (i.e., provide relevance feedback) than if they had to read the entire document. If a document’s title is accurate (if its abstract is well written, if its bibliographic citation is complete), this proxy may provide enough information for users to decide if it seems relevant.A misleading title, or did the document teach you something?!

1.4.4 Genre

A more subtle characteristic of documents that may need to concern us is their genre – the voice or style in which a document is written. You would, um, like, be pretty darn surprised to find stuff like this in a textbook, but not if it came to you over the phone. The genre of email seems to be settling somewhere between typical printed media and spoken conversation, with special markings of sarcasm:) and expletives #!?% common. Newspaper journalists are carefully trained to produce articles consistent with what newspaper readers expect, and their editors are paid to ensure that these stories maintain a consistent voice. Scientific journal articles are written to be understood by peers in the same field, according to standards that pertain to that community [Bayermann, 1988]. An important component of this audience focus is the vocabulary choice an author makes (cf. Section 8.2.1); stylistic variations and document structure may also differ. In a field like psychology, for example, it would be difficult to get a paper accepted in some journals if it is not subdivided into sections like Hypothesis, Methodology, and Subjects. Legal briefs are also written in highly conventionalized forms [Harvard Law Review Association, 1995], and legislation is drafted to satisfy political realities [Allen, 1980; Goodrich, 1987; Levi, 1982; Nerhot, 1991].

In part, these variations in genre are difficult to detect because they remain consistent within any single corpus. That is, the typical email message would jump out at you as out of place if it appeared in your newspaper, but probably not if it were on the Letters to the Editor page. These examples highlight how much context about the corpus we bring with us whenever we read a particular document. They also foreshadow problems Web searchers are just beginning to appreciate, as WWW search engines include every document to which they can crawl, intermixing their very different contexts and writing styles. Without the orienting features of the newspaper’s masthead, the “Letters to the Editor” rubric, or the purposeful selection of a tool that scans only Usenet news, the browsing users’ abilities to understand an arbitrary document is diminished. Individual textual passages have been stripped of much of the context that made them sensible. As more and more of us generate content – in new hypermedia forms as well as traditional publications – that more and more of us retrieve, the range of genres we will experience can only increase, and our methods for FOA must help to represent not just the document but contextual information as well.

1.4.5 Beyond Text

Our definition of “documents” has hewn closely to the printed forms that still dominate the FOA retrievals most people now do. But print media are not the only form of answer we might reasonably seek, and we must ensure that our methods generalize to the other media that are increasingly part of the Net. Sound, images, movies, maps, and more are all appearing as part of the WWW, and they are typically intermixed with textual material. We need to be able to search all of these.

One reason for casting the central problem of this text as “finding out about” is that many aspects of multimedia retrieval remain the same from this perspective. We still have users, who have information needs. We can still reasonably use the term “document” to include any potential answer to users’ queries, but now we expand this term to include whatever media are available. Most centrally, we must still characterize what each document is about in order to match it to these queries, and users can still assess how well the search engine has done.

At the same time, many parts of the FOA problem change as we move away from textual documents to other media. Most important is the increased difficulty of algorithmically extracting clues related to the documents’ semantic content from their syntactic features. The primary source of semantic evidence used within text-based IR is the relative frequencies of keywords in document corpora, and a major portion of this text will show that this is a powerful set of clues indeed. We will also discuss the role other syntactic clues (e.g., bibliographic links) associated with texts can play in understanding what they are about. As we move to other media, the important question becomes what consistent features these new media have that we can also process to reliably infer semantic content. For example, what can we know about an image from the distribution of its pixel values? Do all SUNSETS share a brightness profile (dark below a horizontal line, symmetrically bright above it) that is reliable enough that this clue can be exploited to identify just these scenes?Signature of human culture?! If so, can this mode of analysis be generalized sufficiently to allow retrieval of images based on more typical descriptors such as CHILDREN FEEDING ANIMALS?

Even if we imagine that certain obvious, superficial aspects of some images may be extracted, our hopes must not blind us to the rich vocabulary that many images use every day. Consider a query like FIDELITY AS A POLITICAL ISSUE and consider Figure 1.6. Would any reasonable person claim that they could provide an exhaustive list of all the things these pictures “say”? Did you include the set of Hillary’s jaw? The angle of Bill’s gaze? The attitudes about divorce prevalent when the Doles’ picture was taken and now? The tacit commentary by the editors of The New York Times produced by the juxtaposition of these two photos? Note also that this picture (and its selection for use in this text!) occurred years before anyone had even heard of Monica Lewinsky!MONICA the meme

Figure 1.7 gives a second example. This is a photograph of a locking display case, containing a concert performance schedule. Pasted over the glass of the case is a sign, saying: “IGNORE THIS CALENDAR: THESE DATES ARE 3 YEARS OLD.” But the photo also reveals a number of more subtle clues – that the key to the case has been lost (for three years!), that some frustrated teacher finally got tired of dealing with confused parents, that none of the school’s administrators can think of a more imaginative solution.

These examples may seem far-fetched. But those of you old enough to remember the Cold War may also remember that there was an entire job category known as “Kremlinologist”: someone expert at divining various power shifts among the Politburo based on evidence such as where various participants were placed within group photos! The conventional wisdom is that “a picture is worth a thousand words,” and although some images may not require much explanation, others speak volumes. As we move from still images to movies, entirely new channels for meaning – conveyed with the camera’s attentional focus, soundtrack, etc. – are available to a skilled director. Music itself has an equally rich but distinct vocabulary. The ability to easily record and transmit digital spoken documents (speech) makes this form of audio especially worthy of analysis [Sparck Jones et al., 1996].

As with text, music, film, and motion pictures all predate their representations on computers. The convenience and availability of all these electronic media make it more possible and even more important to analyze them.

Once again, text is an excellent place to begin. Semiotics is one label for the subfield of linguistics concerned with words as symbols, as conveyors of meaning. Words in a language represent a particularly coherent system of symbol use, but so do the symbols used by photo journalists, painters, and movie directors. The meaning of these symbols changes with time; recall the pictures of the Clintons and Doles, their interpretation at the time of publication, and their interpretation now. What these pictures mean is different if we ask about the original context of 1996 and its meaning now. And again, complex, shifting meanings are typical not only of images but of documents as well: Watson and Crick’s publication of the DNA code in Nature in 1953 [Watson and Crick, 1953] was important even then, but what that paper means now could not have been anticipated.

Yet the prospects for associating contentful descriptors with images and even richer media are not quite as bleak as they might seem. In many important cases (e.g., the archives of news photos maintained by magazines and newspapers), images are accompanied by captions, and video streams with transcripts. This additional manually constructed textual data means that techniques for inferring semantic content directly from images can piggyback on top of text-based IR techniques. In conjunction with the machine learning techniques we will discuss (cf. Chapter 7), statistically reliable associations found in captioned image and video corpora can be extrapolated to situations where we have images without captions and video without transcripts.

In the interim, we will return to the narrower, text-only notion of a document with which we began and consider FOA solutions for this simpler (!) case.

1.5 Indexing

Indexing is the process by which a vocabulary of keywords is assigned to all documents of a corpus. Mathematically, an index is a relation mapping each document to the set of keywords that it is about:

The inverse mapping captures, for each keyword, the documents it describes:

This assignment can be done manually or automatically. Manual indexing means that people, skilled as natural language users and perhaps with expertise in the domain of discourse, have read each document (at least cursorily) and selected appropriate keywords for it. Automatic indexing refers to algorithmic procedures for accomplishing this same result. Because the index relation is the fundamental connection between the users’ expressions of information need and the documents that can satisfy them, this simply stated goal – “Build the Index relation” – is at the core of the IR problem and FOA generally.

1.5.1 Automatically Selecting Keywords

We begin by considering the document at its most mechanical level, as a string of characters. Our first candidates for keywords will be tokens, things broken by white space. That is, each token in the document could be considered one of its keywords.

How good is this simple solution? Suppose users ask for documents about CARS and the document we are currently indexing has the string CAR. It seems reasonable to assume that users are interested in this document, despite the fact that the query happens to contain the plural form CARS while the document contains the singular CAR. For many queries we might like to consider occurrences of the words CAR and CARS, or even RETRIEVAL and RETRIEVE, as roughly interchangeable with one another; the suffixes do not affect meaning dramatically. And of course our problem doesn’t end with plurals; we could make similar arguments concerning past-tense ED endings and -ING participles.

This simple solution also depends too much on where spaces occur. Consider the German noun GESCHWINDIGKEITSBESCRANKUNG, corresponding to the English phrase SPEED LIMIT. In many ways, the fact that English happens to put a white space between the words while German does not is not semantically critical to the meaning of these descriptors or the documents in which they might occur. Such morphological features – used to mark relatively superficial, surface-structure features (such as tense or singular versus plural) – can be considered less important to the meaning. And differences between German and English are trivial when they are compared to Asian texts, where the relationship between characters and words is radically different.

What about hyphenation? Use of the word DATABASE, the phrase DATA BASE, and the hyphenated phrase DATA-BASE is highly variable, depending on author preference and current practice at the time and place of publication. Yet we would hope that all occurrences of any of these tokens would be treated as references to approximately the same semantic category. Similarly, we hope that the end-of-line hyphenation (breaking long words at syllable boundaries) would not create two keywords when we would expect only one. But simply adding “-” to the set of white space characters defining tokens would make CLINTON-DOLE and A-Z keywords, too!

Hyphenation is concerned with the situation in which a potential keyword is broken up by punctuation; what about those situations where a space also breaks up a semantic unit? SPEED LIMIT seems semantically cohesive, but what algorithm could distinguish it from other bigrams (consecutive pairs of words) that happen to occur sequentially? The problem only becomes that much more complicated if we attempt to consider longer noun phrases like APPLES AND ORANGES or BACK PROPAGATION NEURAL NETWORK, let alone more complicated syntactic compounds such as verb phrases, clauses, or sentences. Identifying phrases is an important and active area of research from the perspectives of both IR and computational linguistics.

Summarizing, we will take a token to be our default keyword because this is straightforward. More sophisticated solutions will handle hyphenation, multiword phrases, subtoken stems, and so on (cf. Section 2.3.1).

1.5.2 Computer-Assisted Indexing

The field of library science has studied the manual process of constructing effective indices for a very long time. This standard becomes a useful comparison against which our best automatic techniques can be compared, but it also demonstrates how difficult comparison will be. There are data, for example, that suggest that the capacity of one person (e.g., the indexer) to anticipate the words used by another person (e.g., a second indexer or the query of a subsequent user) is severely limited [Furnas et al., 1987]; we are all quite idiosyncratic in this regard. The lack of interindexer consistency among humans must make us humble in our expectations for automated techniques.

But manual and automatic indexing need not be viewed as competing alternatives. In economic terms, if we had sufficient resources, we could hire enough highly trained catalogers to carefully read every document in a corpus and index each of them. If we couldn’t afford this very expensive option, we would have to be satisfied with the best index our automatic system could construct. But if we have enough resources to hire one or two human indexers, what tools might we give them that would make the most effective use of their time?

We seek methods that leverage the editorial resource, in the sense that this manual effort does not grow as the corpus does. How might editors and librarians guide an automatic indexing process? What information should this computation provide that would allow intelligent human readers the assurance of a high-quality indexing function? Chapter 7 will discuss ways that editors can train machine learning systems, and a number of analyses that are of interest to editors will be mentioned, especially in Chapter 6.

1.6 FOA versus Database Retrieval

Within the field of computer science, the subfields of databases and IR are often closely aligned. Databases have well-developed theoretic underpinnings [Abiteboul et al., 1995] that have generated efficient algorithms [McFadden and Hoffer, 1994] and become the foundation for one of the most successful elements of the computer industry.

Both databases and search engines attempt to characterize a particular class of queries by which many users are expected to attempt to get information from computers. Historically, database systems and theory have been perceived as central to the discipline of computer science, probably more so than the IR techniques that are the core technologies for FOA. Things may be changing, however.

see exercise 1 Exercise 1

The general public’s discovery of the Internet and subsequent interest in search engines like Alta Vista, InfoSeek, and Yahoo! suggest that many users find value in the lists of Web pages returned in response to searches. These search engines are clearly doing an important job for many people. It is also a quantitatively different job from organizing their address book (or record collection or baseball statistics) databases. How are IR and database technologies to be distinguished?

To make the distinctions more concrete, let’s imagine a particular information need and think about how both a database and a search engine might attempt to satisfy it. An example query might be as follows.

QUERY 3What is the best SCSI disk drive to buy?

In the case of databases, strong assumptions must first be made about structure among attributes of individual records. Good database design demands that the fundamental elements of data, their format, and logical relations among them be carefully analyzed and anticipated in a logical data model long before any data are actually collected and maintained within a physical implementation. These assumptions allow specification of a syntax for the query language, strategies for optimizing the query’s use of computational resources, and efficient storage of the data on physical devices.

Now let’s assume that a logical data model has been constructed and that a large catalog of information from various hard drive manufacturers and vendors has been collated. We will also make the larger and problematic assumption that the users can translate the natural language of Query 3 into the somewhat baroque syntax of a query language such as SQL. The result of the database search might look something like Table 1.1NLP for databases

Creating an example relation like this and populating it with a few instances is simple, but performing the necessary data modeling, collating the data from all of the manufacturers and vendors, and keeping it all up to date are much more daunting tasks. If the database catalog is out of date or missing data from important vendors, users might leave the database badly informed.

Now let’s imagine using a search engine on the same query. When run against a UseNet news search engine like DejaNews, this query results in the retrieval shown in Figure 1.8 with the most highly ranked posting shown in Figure 1.9.

Users of this search engine will read about many issues related to hard disks, some of which may be relevant to their particular situation. For example, does the “best” qualifier in Query 3 mean lowest cost, maximum capacity, minimum access time, or something else? Can users choose between IDE and SCSI, or are they restricted to SCSI? Depending on what kind of users they are, some of the information retrieved may be immediately applicable to the purchase being considered, while other parts of it are better considered collateral knowledge (D. E. Rose, personal communication) that simply leaves users better informed.

A very different set of assumptions from those we made about the database system are necessary to imagine the search engine working. For example, who wrote these postings? Are they a credible source of good information; what is their authority? Well-trained database users should ask equally skeptical questions about the data retrieved, but rarely are authority, data integrity, and the like considered part of database analysis.

But the key assumption for our IR users is that they can “listen in” on this previous “conversation” and interpret the text that has been left behind as containing potential answers to the current question. The search engine is charged with retrieving textual passages that are likely to answer the users’ questions. Once presented with these retrievals, FOA users have more humble expectations and are willing to do more interpretive work. Because FOA searches are often even less concrete than Query 3 and are issued by users simply trying to learn about a topic, semantic issues central to the interpretation of a textual passage and its context, validity, and so on are at the heart of the FOA enterprise.

Van Rijsbergen, p. 2, table 1.1 has summarized these issues along a number of dimensions by which IR and database systems can be distinguished, and several of these are duplicated in Table 1.2 Database systems are almost always assumed to provide data items directly. Search engines provide a level of indirection, a pointer to textual passages that contain many facts, hopefully including some of interest. The information need of the users is quite vague when compared to that of database users. The search engine users are searching for information about a topic they don’t completely understand. Typical database users have a fairly specific question, like Query 3, in mind. It might even be that the database is missing some data; for example, the special null value in Table 1.1 shows that the price of the third disk drive is not known. Even in this case, however, the database system “knows that it doesn’t know” this information. FOA queries are rarely brought to such a sharp point; ambiguity is intrinsic to the users’ expectations.

Because the queries are so general, an FOA retrieval must be described in probabilistic terms. If a particular hard disk’s price is part of our database, we are certain, with probability = 1.0, of its value. Never would a database system reply with “This hard disk might cost about $300.” As discussed in depth in Section 5.5, a search engine can use sophisticated methods for reasoning probabilistically, and available evidence might even allow it to be quite confident that retrieved items will be perceived as relevant. But never will we be entirely certain that a document is what users want; we can only have high confidence that it may be.

Finally, one of the problems in evaluating search engines is just what success criteria are to be used. We typically assume that information we get back from a database system is correct. (Try to find an ad for a database system that boasts, “Our system retrieves only right answers”!) One database system claims to be more efficient, cheaper, easier to integrate into existing code, and more user-friendly than others.

This list of ways that search engines might be distinguished from databases is far from exhaustive; Blair has proposed a more extensive analysis [Blair, 1984]. More recently, as search engine technology and WWW-inspired applications have both burgeoned, hybrids of databases and search engines have blurred the historical differences further. Some bases of database/search engine interaction are mentioned in Chapter 6.

Chapter 4 discusses the evaluation of search engines in great detail, but typically the bottom line is: Does the system help you? If you are writing a research paper, did this search engine help you find material that was useful in your research? If you are a lawyer preparing a case and you want to find every relevant judicial opinion, does the search engine offer an advantage over an equivalent amount of time combing through books in a law library? Such squishy, qualitative judgments are notoriously difficult to measure, and especially to measure consistently across broad populations of users. The next section provides a quick preview of several precise measurements that have proven useful to the IR community but would not be found persuasive within the database community.

1.7 How Well Are We Doing?

Suppose you and I each build an FOA search tool; how might we decide which does the better job? How might a potential customer decide on their relative values? If we use a new search engine that seems to work much better, how can we determine which of its many features are critical to this success? If we are to make a science of FOA, or even if we only wish to build consistent, reliable tools, it is vital that we establish a methodology by which the performance of search engines can be rigorously evaluated.

Just as your evaluation of a human question-answerer (professor, reference librarian, etc.) might well depend on subjective factors (how well you “communicate”) and factors that go beyond the performance of the search engine (does any available document contain a satisfying answer?), evaluation of search engines is notoriously difficult. The field of IR has made great progress, however, by adopting a methodology for search engine evaluation that has allowed objective assessment of a task that is closely related to FOA. Here we will sketch this simplified notion of the FOA task.

The first step is to focus on a particular query. With respect to this query, we identify the set of documents Rel that are determined to be relevant to it.Omniscient relevance Then a good search engine is one that can retrieve all and only the documents in Rel Figure 1.10 shows both Rel and Retr, the set of documents actually retrieved in response to the query, in terms of a Venn diagram. Clearly, the number of documents that were designated both relevant and retrieved, Retr Rel, will be a key measure of success.

But we must compare the size of the set | Retr Rel | to something, and several standards of comparison are possible. For example, if we are very concerned that the search engine retrieve every relevant document, then it is appropriate to compare the intersection to the number of documents marked as relevant, | Rel |. This measure of search engine performance is known as recall:

(1.1)

However, we might instead be worried about how much of what the users see is relevant, so an equally reasonable standard of comparison is what number of the documents retrieved, | Retr |, are in fact relevant. This measure is known as precision:

(1.2)

Note that even in this simple measure of search engine performance, we have identified two legitimate criteria. In real applications, our users will often vary as to whether high precision or high recall is more important. For example, a lawyer looking for every prior ruling (i.e., judicial opinions, retrievable as separate documents) that is on point for his or her case will be more interested in high-recall behavior. The typical undergraduate, on the other hand, who is quickly searching the Web for a term paper due the next day, knows all too well that there may be many, many relevant documents somewhere out there. But the student cares much more that the first screen of hits be full of relevant leads. Examples of high-recall and high-precision retrievals are also shown in Figure 1.10.

To be useful, this same analysis must be extended to consider the order in which documents are retrieved, and it must consider performance across a broad range of typical queries rather than just one. These and other issues of evaluation are taken up in Chapter 4.

1.8 Summary

This chapter has covered enormous ground and attempted to summarize topics that will be discussed in the rest of this text. Major points include:

  • We constantly and naturally Find Out About (FOA) many, many things. Computer search engines need to support this activity, just as naturally.
  • Language is central to our FOA activities. Our understanding of prior work in linguistics and the philosophy of language will inform our search engine development, and the increasing use of search engines will provide empirical evidence reflecting back to these same disciplines.
  • IR is the field of computer science that traditionally deals with retrieving free-text documents in response to queries. This is done by indexing all the documents in a corpus with keyword descriptors. There are a number of techniques for automatically recommending keywords, but it also involves a great deal of art.
  • Users’ interests must be shaped into queries constructed from these same keywords. Retrieval is accomplished by matching the query against the documents’ descriptions and returning a list of those that appear closest.
  • A central component of the FOA process is the users’ relevance feedback, assessing how closely the retrieved documents match what they had “in mind.”
  • Search engines accomplish a function related to database systems, but their natural language foundations create fundamental differences as well.
  • In order to know how to shop for a good search engine, as well as to allow the science of FOA to move forward, it is important to develop an evaluation methodology by which we can fairly compare alternatives.

In this overview we’ve made some simplifying assumptions and raised more questions than we’ve answered, but that is the goal! By now, I hope you have been convinced that there are many facets to the problem of FOA, ranging from a good characterization of what users seek, to what the documents mean, to methods for inferring semantic clues about each document, to the problem of evaluating whether our search engines are performing as we intend. The rest of this book will consider each of these facets – and others – in greater detail. But like all truly great problems, issues surrounding FOA will remain long after this text is dust.


TERMS INTRODUCED IN THIS CHAPTER

audience (4)
author (4)
authority (31)
automatic indexing (26)
bigrams (28)
broader term (11)
captions (26)
closed vocabularies (12)
collateral knowledge (31)
context (13)
controlled (12)
conventional (3)
corpus (4, 6)
database (29)
describes (26)
document (6, 16)
domain of discourse (11)
electronic artifacts (4)
exhaustive (11)
finding out about (2)
genre (21)
high recall (35)
hits (35)
hypernym (11)
hypertext (18)
hyphenation (27)
indexing (12, 26)
information need (6)
information retrieval (8)
IR (32)
jargon (12)
keywords (10)
level of treatment (18)
logical data model (30)
manual indexing (26)
marked alphabets (16)
meta-data (19)
morphological (27)
narrower terms (11)
natural language (16)
natural (10)
on point (35)
open vocabularies (12)
operators (14)
passages (16)
plural (27)
precision (35)
proxies (20)
publication information (19)
query (6)
query language (6, 10)
recall (35)
relevance feedback (8)
relevant (3, 7)
retrieval method (32)
retrieved (6)
rhetoric (18)
search engine (6)
semiotics (25)
simple queries (14)
specific (11)
spoken documents (25)
success criteria (32)
system provides (32)
terms of art (12)
tokens (27)
train (29)
transcripts (26)
transitive (15)
uncontrolled (12)
user’s query (32)
users (5)
vocabulary (10)
vocabulary choice (22)
vocabulary size (12)
white space (27)
word games (4)
World Wide Web (4)
Back to TOC