4

Back to TOC

Assessing the Retrieval

We’ve come a long way since Chapter 1, where we first sketched the full range of activities we might consider FOA. As Chapter 2 considered the various ways of breaking text into indexable features, and Chapter 3 explained the various ways of weighting combinations of these features to identify the best matches to a query, I hope you have been aware of how many alternatives have been mentioned! That is, rarely has there been a single method that can be proven to be better than all others. IR has traditionally been driven by empirical demonstrations, and the range of commercial competitors now trying to provide the “best” search of the WWW makes it likely this performance orientation will continue. But whether we are search engineers, scientists objectively assessing one particular technique, or consumers of WWW search engine technology interested in buying and using the best, a solid basis of performance assessment is critical.

Several perspectives on assessment are possible. In the first chapter FOA was viewed as a personal activity, adopting the users’ points of view. Section 4.1 will continue in this theme, considering how users assess the results of their retrievals and how they can express their opinions using relevance feedback (RelFbk). Oddy is credited with first identifying this important stream of data, naturally provided by users as a part of their FOA browsing [Oddy, 1977; Belkin et al., 1982].

But in this book we are also concerned with FOA from the IR system builder’s point of view. Ideally, we would like to construct a search engine that robustly finds the “right” documents for each query and for each user. The second section of this chapter discusses performance measures of statistical properties that are reliable across large numbers of users and their highly variable queries. The key to these measures is having some insight into which documents should have been retrieved, typically because some idealized omniscient expert has determined (within a specially constructed experimental situation) that certain documents “should” have been retrieved. Alternatively the RelFbk of many users can be combined to form a consensual opinion of relevance, as described in Section 4.4.

A concrete notion of relevance would seem a fundamental precondition for understanding either an individual’s RelFbk or how this can be used to assess a search engine. But in this respect, information retrieval generally, and RelFbk in particular, is like many other academic areas of study (including artificial intelligence and genetics) in that the lack of a fully satisfactory definition of the core concept (information, intelligence, genes, and so on) has not entirely stopped progress. That is, a great deal can be done by operationalizing RelFbk to be simply those relevance assessment behaviors produced as part of an FOA dialog. This operational simplification will hold us until fundamental issues of language and communication are again addressed in Section 8.2.1

4.1 Personal Assessment of Relevance*

4.1.1 Cognitive Assumptions

“Garbage in, garbage out” is one of the first insights every software developer learns, and FOA is no exception. The primary source of data considered by traditional IR methods, and the focus of Chapter 2 and Chapter 3 of this text, are the documents of the corpus, particularly the keywords they contain.(Chapter 6 will consider the use of other document attributes.) A fundamental feature of the broader FOA view is that browsing users provide an equally important source of data concerning what keywords mean and what documents are about. It is therefore appropriate to begin by characterizing who these users we will be watching are.

We begin with one important cognitive assumption we must make about our users: How thorough an FOA search do they wish to perform? Is this an important search to which the users are willing to dedicate a great deal of time and attention, or will a quick, cursory answer suffice? For example, Chapter 1 mentioned how much less thorough the typical undergraduate (doing some quick research before submitting a term paper) is than the Ph.D. candidate (who wants to ensure his or her proposed dissertation topic is new). The typical WWW searcher seems satisfied with only a few useful leads, but the professional searcher (a lawyer looking for any case that might help, a doctor looking for any science that might heal a patient) will search diligently if there is even a small chance of finding another relevant document. This kind of variability can be observed not only across different classes of users but even across the same user at different times. In for a fact, stay for a lesson

Prototypic Retrievals

From the perspective of cognitive psychology, the task facing users who are asked to produce relevance feedback (RelFbk) can best be described as one of object recognition, in the tradition of Rosch and others [Rosch and Mervis, 1975; Rosch, 1977]. The object to be recognized is an internally represented prototypic document satisfying the users’ information need. In this case, the prototype corresponds to the model the subjects maintain of ideally relevant documents. As users consider an actual retrieved document’s relevance, they evaluate how well it matches the prototype. Barry and others have suggested the many and varied features over which prototypes can be defined [Barry, 1994]. Only a small number of these may be revealed by any one of the users’ queries, of course.

Here we will simply assume that users are capable of grading the quality of this match. Users might be asked to score the quality of relevance match according to a five-point scale like that shown in Figure 4.1. Users can qualify the middle Relevant response either by weakening it (Possibly Relevant) or strengthening it (Critically Relevant). Such distinctions are often made in experimental settings (e.g., in the use of the STAIRS retrieval system by lawyers [Blair and Maron, 1985]) and relate to the different purposes for FOA that different users may have. To make these distinctions concrete, we might imagine “Critically Relevant” to apply only to those documents that must be read even for an undergrad term paper, while “Possibly Relevant” would be much more broadly applied to those a Ph.D. student needs as part of his or her literature review.

For now, however, we will simplify the types of RelFbk to allow users to reply with only a single grade of “Relevant” () or “Not Relevant” (). These two assessments require overt action on the part of subjects; “No response”(#) is the default RelFbk assessment for documents not receiving any other responses. Again, this frees users from the much more cognitively demanding task of exhaustively assessing every retrieved document. Those documents that “jump out” at users as particularly good – or especially bad – examples of the prototype they seek provide the most informative RelFbk. Figure 4.1 also introduces a color-code convention in the electronic version: will be used to indicate positive RelFbk and to indicate negative RelFbk.

RelFbk is Nonmetric

As we move from a cognitive understanding of the users’ tasks to statistical analyses of their behaviors, we must understand one important feature of the RelFbk data stream: RelFbk is nonmetric data. That is, while users find it easy and natural to critique retrieved documents with , , and #, they would find it much more difficult to reliably assign numeric quantities reflecting something like the relative applicability of each retrieval.

Think for a moment just why this is hard, by imagining your reactions to a typical retrieval. Is the first document to be rated 10 or 6743? If you rate the first document as 10, the second as 6, and the third as 2, then you must ensure that the third document is exactly as much less relevant than the second as the second is from the first. Trying to keep all Rel(di) assessments consistent in the metric sense, for many retrieved documents or any other set, makes people crazy.

see exercise 1 Exercise 1

This is not only a property of relevance assessments. A large literature on psychological assessment [Kruskal, 1977a; Kruskal, 1977b; Shepard et al., 1972] has demonstrated that human subjects can quite easily and reliably sort objects into “piles,” and that they like one pile better than another. Yet the same people find it more difficult to quantify how much they like each object, let alone make these quantitative assessments consistent with one another. Reliable estimates of both cognitive qualities would be necessary if we are to have a true preference metric.*

Rather than assuming that users can provide a separate score for each retrieved document, we will treat this as an ordered nonmetric scale of increasing preference:

(4.1)

Each of these assumptions about what “relevance” is and how it can be measured is a matter of considerable debate [Wilson, 1973; Froehlich, 1994] and is likely to be the topic of much future work. It is also interesting to note that in our later attempts at a comprehensive model of how and why humans use language, “relevance” again plays a central role (cf. Section 8.2.2).

4.2 Extending the Dialog with RelFbk

Figure 4.2 focuses on a single instance of RelFbk, shown as a labeling over the set of Retr documents. But beyond any one reaction to a single retrieved set product, a central premise of the FOA process is that users’ reactions to just-retrieved documents provide the pivotal linkage between assessments, to form the FOA search dialog. This is perhaps more clear in Figure 4.3, where RelFbk is used to link a series of reactions into a query session.

Attempts to support this searching process, and then attempts to rigorously evaluate how well software systems support browsing users as they FOA, is one of the most vexing issues within IR evaluation [Daniels et al., 1985; Saracevic et al., 1988; Larson, 1996; Oddy et al., 1992; O’Day and Jeffries, 1993; Cawsey et al., 1992; Russel et al., 1993; Koenemann and Belkin, 1996]. Part of the problem is the misconception that if a search engine works perfectly and the user issues the perfect magic bullet query, out will spill all and only relevant documents! Such simplistic definitions of optimality come naturally to computer scientists; library scientists, who are used to the naturalistic behaviors of real patrons in their libraries, know that a much more extended and nebulous form of support is required.

Marcia Bates’s famous berry picking metaphor [Bates, 1986; Bates, 1989] is useful here:

[A] query is not satisfied by a single final retrieved set, but by a series of selections of individiual references and bits of information at each stage of the ever-modifying search. A bit-at-a-time retrieval of this sort is here called berrypicking. This term is used by analogy to picking huckleberries or blueberries in the forest. The berries are scattered on the bushes; they do not come in bunches. One must pick them one at a time. One could do berrypicking of information without the search need itself changing (evolving) but...[we] consider searches that combine both of these characteristics. [Bates, 1989, p.410]

In addition to highlighting the same iterative browsing behavior central to FOA’s characterization of the dialog, the “evolving” character of the information need in Bates’s metaphor is also important. Imagine that you are in the forest on an idyllic day with only one purpose: to fill your bucket with the best blueberries you can find. Early in the day, with your whole afternoon in front of you, you are likely to be very choosy. At this juncture, you could bump into a bush full of blueberries that were not as ripe or as large as you imagine must exist somewhere else in the forest, and not drop a single one into your basket. But late in the afternoon, if you have had poor luck and little to show for your efforts, you could come across an even worse bush and grab every single berry, even the shriveled ones!

Applying this metaphor to FOA is provocative in many respects. For example, it suggests that maintaining an explicit representation of the retrieved document “basket” might be a useful addition to any search engine interface. It predicts a time course to the distribution of users’ RelFbk assessments. For now, we simply observe that it seems quite likely that an assessment of one document’s relevance will depend greatly on the “basket” of other documents we have already seen. The general idea of thinking of an “evolutionary ecology of information foraging” [Pirolli and Card, 1997] has become less metaphoric and more concrete as information search agents (like the InfoSpiders described in Section 7.6) explore the “environment” of the WWW.

4.2.1 Using RelFbk for Query Refinement

While Figure 4.2 showed the retrieved set of documents as a simple set, it is interesting to impose the RelFbk labeling on the retrieved documents viewed in vector space. Figure 4.4 shows a query vector and a number of retrieved documents,together with a plausible distribution of RelFbk over them. That is, we can expect that there is some localized region in vector space where relevant documents are most likely to occur. If we believe that these positively labeled retrieved documents are in fact clustered, it becomes appropriate to consider a hypothetical centroid (average) document d+, which is at the center of all those documents that users have identified as relevant.

It is less reasonable, however, to imagine that the negatively labeled documents are similarly clustered. Documents that were inappropriately retrieved failed to be relevant for one reason or another; there may be several such reasons. These are shown as discriminating planes in Figure 4.5.

The vector space view also lets us easily portray two very different uses to which RelFbk information might be applied. Most typically, RelFbk is used to refine the user’s query. Figure 4.6 represents this refinement in terms of two changes we can make to the initial query vector. The first is to “take a step toward” the centroid of the positive RelFbk cluster. The size of this stepNeural-net style learning reflects how confident we are that the positive cluster centroid is a better characterization of the user’s interests than the original query.

There is one important difference between the query and even a slight perturbation of it toward a cluster’s centroid: While the original query vector is often very sparse and result from just those few words used in the user’s original query, any movement toward the centroid will include (linear combinations of) all those keywords used in any of the positively labeled documents. The additional difficulty in implementing this more densely filled feature vector becomes a serious obstacle in many system implementations. The fact that RelFbk-refined queries involve many more nonzero keyword entries also means that query weighting and matching techniques may be sensitive to this difference.

Seminal work on the use of RelFbk was done by Salton, especially with students Rocchio, Brauen, and Ide in the late 1960s and early 1970s [Rocchio, 1966; Brauen, 1969; Salton, 1972]. More recent students have extended the theory of query refinement and related it to topics in machine learning [Buckley et al., 1994; Buckley and Salton, 1995; Allan, 1996].

Some of these experiments explored a second modification to the query vector. In addition to moving toward the d+ center of , it is also plausible to move away from the irrelevant retrieved documents . As noted earlier, however, it is much less likely that these irrelevant documents are as conveniently clustered. As Salton [Salton and McGill, 1983] reported:

...retrieval operation is most effective when the relevant documents as well as the non-relevant documents are tightly clustered and the difference between the two groups is as large as possible....The RelFbk operation is less favorable in the more realistic case where the set of non-relevant documents covers a wider area of the space. [p. 145]

One possible strategy is to take a single element d of the irrelevant retrieved documents (for example, the most highly ranked irrelevant retrieval) and define the direction of movement with respect to it alone.

As we have discussed in connection with Figure 4.2, RelFbk helps to link individual queries into a browsing sequence. And so, although we have focused here on the simplest form of query refinement, with respect to the users’ initial queries, RelFbk can be given again and again. An initial query vector is moved toward the centroid of documents identified as relevant (and perhaps away from an especially bad one); this modified query instigates a new retrieval, which is again refined. In practice, it appears that such adjustments result in diminishing returns after only a few iterations of query refinement [Salton and McGill, 1983; Stanfill and Kahle, 1986b].

However, Section 7.3 will discuss a type of FOA in which a document corpus is constantly changing and the user’s interest in a topic is long-lived. In this case, we can imagine the query as a filter against which a constant stream (e.g., of newly published Web documents) is applied. RelFbk has also been used in this setting, to make ongoing changes to the query/filter that continue to improve its effectiveness [Allan, 1996].

Using RelFbk for query refinement produces results that are immediately satisfying to the users. First, it automatically generates a new query with which they can continue their browsing process. Second, the statistical analysis of positively labeled retrieved documents can provide other forms of useful information to the users as well. For example, rather than simply retrieving a new set of documents, new keywords that are not in the users’ original queries but are present in positively labeled documents at statistically significantly levels can be suggested to the users as new vocabulary. Conversely, words that were in the original query but are negatively correlated with d+ (and/or positively correlated with d) can also be identified.

4.2.2 Using RelFbk to Adapt Documents’ Indices

An alternative use of RelFbk is to make changes to the documents rather than to the query. Previously, the argument was that it is sensible to make the query look more like those documents the users liked; now the argument is the converse: Documents found relevant to a query should be described more like the query used to identify them. Changes made to document vectors according to this heuristic are known as document modifications and are shown in Figure 4.7.

Note that unlike query modification, adaptive document modifications made in response to RelFbk are not expected to be of (immediate) use to the users who provide it. Instead, the hope is that these changed document representations will be available later, to others who might be searching. As Salton [Salton and McGill, 1983] described the goal:

Following a large number of such interactions documents which are wanted by the users will have been moved slowly into the active portion of the document space – that part in which large numbers of users’ queries are concentrated, while items which are normally rejected will be located on the periphery of the space. [p. 145]

This provocative proposal, allowing a search engine to learn from its users, is considered in much greater detail in Chapter 7.

4.2.3 Summary

We have been discussing RelFbk from the individual user’s point of view. We’ve focused on how this information might be collected and how it might be used, both in the short term to modify users’ retrievals and in a longer-term way to change the documents’ indices. Now we want to consider a third use for RelFbk information. When we have more than one system to use for retrieval and would like to evaluate which is doing the better job, users’ assessments of retrieved documents’ relevance can be used as “grades.” If one system can consistently, across a range of typical queries, more frequently retrieve documents that the users mark as relevant and fewer that they mark as irrelevant, then that system is doing a better job.

4.3 Aggregated Assessment: Search Engine Performance

The last section considered the assessment problem from the perspective of a single individual, the browsing user. Now we would like to generalize on this individual performance to attempt to obtain statistically significant observations.

4.3.1 Underlying Assumptions

As with all scientific models, our attempts to evaluate the performance of a search engine rests on a number of assumptions. Many of these involve the user and simplifying assumptions about how users assess the relevance of documents.

1. Real FOA versus Laboratory Retrieval. From the FOA perspective, users retrieve documents as part of an extended search process. They do this often, and because they need the information for something important to them. If we are to collect useful statistics about FOA, we must either capture large numbers of such users “in the act” (i.e., in the process of a real, information-seeking activity) or attempt to create an artificial laboratory setting. The former is much more desirable but makes strong requirements concerning a desirable corpus, a population of users, and access to their retrieval data. So, typically, we must work in a lab. The first big assumption, then, is that our lab setting is similar to real life; i.e., “guinea pig” users will have reactions that mirror real ones.“With Web search engines don’t we have access to enormous numbers of users searching the same corpus?” [SG]

2. Intersubject Reliability. Even if we assume we have a typical user and that this user is engaged in an activity that at least mirrors the natural FOA process, we have to believe that this user will assess relevance the same as everyone else! But clearly, the educational background of each user, the amount of time he or she has to devote to the FOA process relative to the rest of the task, and similar factors will make one user’s reaction differ from another’s. For example, there is some evidence that stylistic variations also impact perceived relevance [Karlgren, 1996]. The consensual relevance statistic (cf. Section 4.3.2) is one mechanism for aggregating across multiple users.

This becomes a concern with intersubject reliability. If we intend to make changes to document representations based on one user’s RelFbk opinions, we would like to believe that there is at least some consistency between this user’s opinion and those of others. This is a critical area for further research, but some encouraging, preliminary results are available. For example, users of a multilingual retrieval system that presents some documents in their first language (“mother tongue”) and others in foreign languages they read less well, seem to be able to provide consistent RelFbk data even for documents in their second, weaker language [Sheridan and Ballerini, 1996].

3. Independence of Interdocument Relevance Assessments. Finally, notice that the atomic element of data collection for relevance assessments is typically a (queryi, documentj) pair: documentj is relevant to queryi. Implicitly, this assumes that the relevance of a document can be assessed independently of assessments of other documents. Again, this is a very questionable assumption.

Recall also that often the proxy on which the user’s relevance assessment depends is distinct from the document itself. The user sees only the proxy, a small sample of the document in question, for example, its title, first paragraph, or bibliographic citation. While we must typically take user reaction to the proxy as an opinion about the whole document, this inference depends critically on how informative the proxy is. Cryptic titles and very condensed citation formats can make these judgments suspect. And of course the user’s ultimate assessment of whether a document is relevant, after having read it, remains a paradox.

4.3.2 Consensual Relevance

In most search engine evaluation, the assumption has been that a single expert can be trusted to provide reliable relevance assessments. Whether any one, “omniscient” individual is capable of providing reliable data about the appropriate set of documents to be retrieved remains a foundational issue within IR. For example, a number of papers in a recent special issue of the Journal of the American Society for Information Systems devoted to relevance advocated a move toward a more “user-centered, situational” view of relevance[Froehlich, 1994].

Our attention to the opinions of individual users suggests the possibility of combining evidence from multiple human judges. Rather than having relevance be a Boolean determination made by a single expert, we will consider “relevance” to be a consensual, central tendency of the searching users’ opinions. The relevance assessments of individual users and the resulting central tendency of relevance is suggested by Figure 4.8. Two features of this definition are significant. First, consensual relevance posits a “consumers’ ”perspective on what will count as IR system success. A document’s relevance to a query is not going to be determined by an expert in the topical area, but by the users who are doing the searching. If they find it relevant, it’s relevant, whether or not some domain expert thinks the document “should” have been retrieved.

Second, consensual relevance becomes a statistical, aggregate property of multiple users’ reactions rather than a discrete feature elicited from any one individual. By making relevance a statistical measure, our confidence in the relevance of a document (with respect to a query) increases as more relevance assessment data are collected. This reliance on statistical stability creates a strong link between IR and machine learning (cf. Chapter 7). Allen’s investigation into idiosyncratic cognitive styles of browsing users [Allen, 1992] and Wilbur’s assessment of the reliability of RelFbk across users [Wilbur, 1998] provide a more textured view of how multiple relevance assessments can be compared and combined.

It seems, however, that our move from omniscient to consensual relevance has only made the problem of evaluation that much more difficult. Test corpora must be large enough to provide robust tests for retrieval methods, and multiple queries are necessary to evaluate the overall performance of a search engine. Getting even a single person’s opinion about the relevance of a document to a particular query is hard, and we are now interested in getting many! However, software like RAVE (cf. Section 4.4) allows an IR experimenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus.

4.3.3 Traditional Evaluation Methodologies

Before surveying all of the ways in which evaluation might be performed, it is worthwhile to sketch how it has typically been done in the past [Cleverdon and Mills, 1963]. In the beginning, computers were slow and had very limited disk space and even more limited memories; initial test corpora needed to be small, too. One benefit of these small corpora was that it allowed at least the possibility of having a set of test queries compared exhaustively against every document in the corpus.

The source of these test queries, and the assessment of their relevance, varied in early experiments. For example, in the Cranfield experiments [Lancaster, 1968], 1400 documents in metallurgy were searched according to 221 queries generated by some of the documents’ authors. In Salton’s experiments with the ADI corpus, 82 papers presented at a 1963 American Documentation Institute meeting were searched against 35 queries and evaluated by students and “staff experts” associated with Salton’s lab [Salton and Lesk, 1968]. Lancaster’s construction of the MEDLARS collection was similar [Lancaster, 1969].

As computers have increased in capacity, reasonable evaluation has required much larger corpora. The Text Retrieval Conference (TREC), begun in 1992 and still held annually, has set a new standard for search engine evaluation [Harman, 1995]. The TREC methodology is notable in several respects. First, it avoids exhaustive assessment of all documents by using the pooling method, a proposal for the construction of “ideal” test collections that predates TREC by decades [Sparck Jones and van Rijsbergen, 1976]. The basic idea is to use each search engine independently and then “pool” their results to form a set of those documents that have at least this recommendation of potential relevance. All search engines retrieve ranked lists of k potentially relevant documents, and the union of these retrieved sets is presented to human judges for relevance assessment.

In the case of TREC, k = 100 and the human assessors were retired security analysts, like those that work at the National Security Agency (NSA) watching the world’s communications. Because only documents retrieved by one of the systems being tested are evaluated there remains the possibility that relevant documents remain undiscovered, and we might worry that our evaluations will change as new systems retrieve new documents and these are evaluated. Recent analysis seems to suggest that, at least in the case of the TREC corpus, evaluations are in fact quite stable [Voorhees, 1998].

An important consequence of this methodological convenience is that unassessed documents are assumed to be irrelevant. This creates an unfortunate dependence on the retrieval methods used to nominate documents, which we can expect to be most pronounced when the methods are similar to one another. For example, if the alternative retrieved sets are the result of manipulating single parameters of the same basic retrieval procedure, the resulting assessments may have overlap with, and hence be useless for comparison of, methods producing significantly different retrieval sets. For the TREC collection, this problem was handled by drawing the top 200 documents from a wide range of 25 methods, which had little overlap [Harman, 1995]. Vogt [Vogt and Cottrell, 1998] has explored how similarities and differences between retrieval methods can be similarly exploited as part of combined, hybrid retrieval systems (cf. Section 7.5.4).

It is also possible to sample a small subset of a corpus, submit the entire sample to review by the human expert, and extrapolate from the number of relevant documents found to an expected number across the entire corpus. One famous example of this methodology is Blair and Maron’s assessment of IBM’s STAIRS retrieval system [Blair and Maron, 1985] of the early 1980s. This evaluation studied the real-world use of STAIRS by a legal firm as part of a litigation support task: 40,000 memos, design documents, etc. were to be searched with respect to 51 different queries. The lawyers themselves then agreed to evaluate the documents’ relevance. As they reported:

To find the unretrieved relevant documents we developed sample frames consisting of subsets of unretrieved databases that we believed to be rich in relevant documents.....Random samples were taken from these subsets, and the samples were examined by the lawyers in a blind evaluation; the lawyers were not aware they were evaluating sample sets rather than retrieved sets they had personally generated. The total number of relevant documents that existed in these subsets could then be estimated. We sampled from subsets of the database rather than the entire database because, for most queries, the percentage of relevant documents in the database was less than 2%, making it almost impossible to have both manageable sample sizes and a high level of confidence in the resulting Recall estimates. Of course, no extrapolation to the entire database could be made from these Recall calculations. Nonetheless, the estimation of the number of relevant unretrieved documents in the subsets did give us a maximum value for Recall for each request. [Blair and Maron, 1985, pp. 291–293]

This is a difficult methodology, but it allows some of the best estimates of Recall available. And the news was not good: On average, retrievals captured only 20 percent of relevant documents!Killing the messenger

In short, methodologies for valid search engine evaluations require much more sophistication and care than is generally appreciated. Careful experimental design [Tague-Sutcliffe, 1992], statistical analysis [Hull, 1993], and presentation [Keen, 1992] are all critical.

4.3.4 Basic Measures

Figure 4.9 shows the relationship between relevant (Rel) and retrieved (Retr)sets as a Venn diagram, against the backdrop of the universe U of the rest of the documents of the corpus. Obviously, our focus should be on those documents that are in the intersection of Rel and Retr and on making this intersection as large as possible. Informally, we will be most happy with a Rel set when it best overlaps with the Retr set, and therefore we seek evaluation measures that reflect this. The basic relations between the sizes of these sets can also be captured in the contingency table of Table 4.1.

We know we want the intersection of the Rel and Retr sets to be large, but large relative to what?! As mentioned in Chapter 1, if we are most focused on the Rel set and use it as our standard of comparison, we’d like to know what fraction of these we’ve retrieved. This ratio is called recall:

(4.2)

Anticipating the probabilistic analysis of Section 5.5, we can think of Recall as (an estimate of) the conditional probability that a document will be retrieved, given that it is relevant: Pr(Ret|Rel).

Conversely, if we instead focus on the Ret set, we are most interested in what fraction of these are relevant; this ratio is called precision:

(4.3)

Similarly, this is the probability that a document will be relevant, given that it is retrieved: Pr(Ret|Rel). A closely related but less common measure is called fallout, where we (perversely!) focus on the irrelevant documents and the fraction of them retrieved:

(4.4)

The close relationship between these three measures can be defined precisely, if the generality G of the query (cf. Section 4.3.7) is known:

(4.5)

This is Pr(Ret|). These two measures, Recall and Precision, have remained the bedrock of search engine evaluation since they were first introduced by Kent in 1955 [Kent et al., 1955; Saracevic, 1975]. By far the most common measures of search engine performance are just the pair of measures, Precision and Recall.

Ideally, of course, we’d like a system that has both high precision and high recall: only relevant documents and all of them. But real-world, practical systems must select documents based on features that are only statistically useful indicators of relevance; we can never be sure. In this case efforts made to improve recall must retrieve more documents, and it is likely that precision will suffer as a consequence. The best we can hope for is some balance.

In some applications it is nevertheless desirable to evaluate IR system performance according to a single measure rather than the two-dimensional Recall/Precision criteria.Single dimensions for simple minds We will return to this topic in Section 4.3.8.

4.3.5 Ordering the Retr Set

Do not worry about large numbers of results: the best ones come first! (www.AltaVista.com, 1998)

The next step is to move beyond thinking of Retr as simply a set. We will suppose that retrieved documents are returned in some order by the search engine, reflecting its assessment of how well each document matches the query. Following current Web vernacular, we will call this ordering of the Ret set a hitlist and a retrieved document’s position its hitlist rank Rank(di). This is a positive integer assigned to each document in the Retr set, in descending order of similarity with respect to the matching function Match (q,d):

(4.6)

Sparck Jones [Sparck Jones, 1972] and others have historically referred to a document’s rank in Ret as its “coordination level” (cf. Eq. 3.36). Strictly speaking, coordination level refers to the number of keywords shared by document and query. In Boolean retrieval systems, sensitive only to the presence or absence of keywords, ranking by coordination level may be the only available measure on document/query similarity.

For long queries, hitlist rank and coordination level are likely to be similar, because it is unlikely that different documents will match exactly the same number of words from the query. But for short queries, it is likely that coordination level will only partially order the Retr set. This is why van Rijsbergen, p. 161, speaking of the Boolean systems typical at that time, said, “Unfortunately, the ranking generated by a matching function is rarely a simple ordering, but more commonly a weak ordering.” Most modern search engines, however, exploit keyword weightings and can provide much more refined measures, thereby providing a total ordering of the hitlist.

According to the Probability Ranking Principle (cf. Section 5.5.1), a retrieval system is performing optimally if it retrieves documents in order of decreasing probability of relevance. For now we simply assume that there is a total ordering imposed over Retr. We will use the hitlist ranking to effectively define a series of retrievals. Setting a very high threshold on this ordering would mean retrieving a very small set, while setting a lower threshold will retrieve a much larger one.

Now consider a particular query q and the set Relq of relevant documents associated with it. Assuming that Retr is totally ordered makes it possible for us to define the fundamental analytic tool for search engine performance: the Recall/Precision curve (Re/Pre curve). The basic procedure is to consider each retrieved document in hitlist rank order and to ask for the precision and recall of the retrieval of all documents up to and including this one.

Consider the first of the two hypothetical retrievals shown in Table 4.2.

With respect to this query, we will assume there are exactly five relevant documents out of a total of 25 in the corpus. The very first one retrieved is deemed relevant; if we stopped retrieval at this point, our recall would be 0.2 (because we would have retrieved one of five relevant documents), and our precision is perfect (the one retrieved document is relevant). Our good luck continues as we consider the next document, which is also relevant; this generates a second Re/Pre data point of (0.4,1.0). We are not so lucky with the third document retrieved; precision drops to 0.67 and recall remains at 0.4. Proceeding down the retrieval in rank order, and plotting each point in this fashion gives the Re/Pre curve shown in Figure 4.10.

At this point we can already make several observations. Asymptotically, we know that the final recall must go to one; once we have retrieved every document we’ve also retrieved every relevant document. The precision will be the ratio of the number of relevant documents to the total corpus size. Ordinarily, unless we are interested in very general queries or very small sets of documents, this ratio will be very close to zero.

The other end of the curve, however, turns out to be much less stable. We would hope that a retrieval system’s very first candidate for retrieval, the document with hitlist rank = 1, will be relevant, but it may not be. Figure 4.11 shows a second pair of hypothetical data points (blue), corresponding to the case that a single irrelevant document is ranked higher than the relevant ones. This relatively small change in assessment creates a fairly dramatic effect on the curve, with real consequence once we need to juxtapose multiple queries’ curves (see Section 4.3.7). Such instability is an inevitable consequence of the definitions of Precision and Recall: If the first retrieved document happens to be relevant, its Re/Pre coordinates will be less than 1, and greater than 1; otherwise it will be < 0, 0 >.

Figure 4.12 puts this particular retrieval in the context of the best and worst retrievals we might imagine. The best possible retrieval (orange) would be to retrieve the five relevant documents first, and then all other documents. This would produce the upper, square Re/Pre curve. Alternatively, the worst possible retrieval (gray) would retrieve all but the relevant documents before returning these; this produces the lower line.

4.3.6 Normalized Recall and Precision

The best/worst “envelope” surrounding an actual Re/Pre curve is related to a similar comparison known as normalized recall [Rocchio, 1966]. Imagine plotting the fraction of relevant documents retrieved as a function of the fraction of the total number of documents retrieved. Such a function is plotted in Figure 4.13. Comparing the area between the actual retrieval and the worst case (yellow) to the total area between the best (orange) and worst cases (that above the blue region) is very much like the best/worst-case envelope of Figure 4.12.

We can derive expressions for this area. Let ri be the hitlist rank of the ith relevant document. Then the area under the curve corresponding to any actual query is (if we define r0=0):

(4.7)

In the best case ri = i:

(4.8)

when NRel . In the worst case ri = NDocNRel + i:

(4.9)

4.3.7 Multiple Retrievals across Varying Queries

It should come as no surprise, given the wide range of activities in which FOA is a crucial component, that there is enormous variability among the kinds of queries produced by users. The next step of our construction is therefore to go beyond a single query to consider the performance of a system across a set of queries.

One obvious dimension to this variability concerns the “breadth” of the query: How general is it? If the Rel set for a query is known, this can be quantified by generality, comparing the size of Rel to the total number of documents in the corpus:

(4.10)

There are many other ways in which queries can vary, and the fact that different retrieval techniques seem to be much more effective on some types of queries than others makes this a critical issue for further research. For now, however, we will treat all queries interchangeably but consider average performance across a set of them.

Figure 4.14 juxtaposes two Re/Pre curves corresponding to two queries. Query 1 is as before, and Query 2 is a more specific query, as evidenced by its lower asymptote. Even with these two queries, we can see that, in general, there is no guarantee that we will have Re/Pre data points at any particular recall level. This necessitates interpolation of data points at desired recall levels. The typical interpolation is done at prespecified recall levels, for example, 0, 0.25, 0.5, 0.75, and 1.0. As van Rijsbergen, p. 152 discusses, a number of interpolation techniques are available, each with its own biases. Because each new relevant document added to our retrieved set will produce an increase in precision (causing the saw-tooth pattern observed in the graph), simply using the next available data point above a desired recall level will produce an overestimate, while using the prior data point will produce an underestimate.

With preestablished recall levels, we can now juxtapose an arbitrary number of queries and average over them at these levels. For 30 years the most typical presentation of results within the IR community was the 11-point average curves, like those shown in Figure 4.15 [Salton and Lesk, 1971; Salton and Lesk, 1968]. (These data happen to show performance on the ADI corpus of Boolean versus weighted retrieval methods; they include only the last 10 data points.)

It is not uncommon to see research data reduced even further. If queries are averaged at fixed recall levels and then all of these recall levels are averaged together, we can produce a single number that measures retrieval system performance. Note the serious bias this last averaging across recall levels produces, however. It says that we are as interested in how well the system did at the 90 percent recall level as at 10 percent!? Virtually all users care more about the first screenfull, of hits they retrieve than the last.

This motivates another way to use the same basic recall/precision data. Rather than measuring at fixed recall levels, statistics are collected at the 10-, 25-, and 50-document retrieval levels. Precision within the first 10 or 15 documents is arguably a much closer measure of standard browser effectiveness than any other single number.

All such atempts to reduce the Re/Pre plot to a single number are bound to introduce artifacts of their own. In most cases the full Re/Pre curve picture is certainly worth a thousand words. Plotting the entire curve is straightforward and immediately interpretable, and it lets viewers draw more of their own conclusions.

We must guard against taking our intuitions based on this tiny example (with only 25 documents in the entire corpus) too seriously when considering results from standard corpora and queries. For example, our first query had fully 20 percent of the corpus as relevant; even our second query had 8 percent. In a corpus of a million documents, this would mean 80,000 of them were relevant!? Much more typical are queries with a tiny fraction, perhaps 0.001 percent, relevant. This will mean that the precision asymptote is very nearly zero. Also, we are likely to have many more relevant documents, resulting in a much smoother curve.

4.3.8 One-Parameter Criteria

This section began with recall and precision, the two most typical measures of search engine performance. From that beginning, richer, more elaborate characterizations of how well the system is performing have been considered. But even having the two measures of recall and precision, it is not a simple matter to decide whether one system is better or worse than another. What are we to think of a system that has good recall but poor precision, relative to another with the opposite feature?

For example, if we wish to optimize a search engine with respect to one or more design parameters (e.g., the exact form of the query/document matching function, cf. Section 5.3.1), effective optimization becomes much more difficult in multicriterial cases. Such thinking has generated composite measures based on the basic components of recall and precision.

For example, Jardine and van Rijsbergen [Jardine and van Rijsbergen, 1971; van Rijsbergen, 1973] originally proposed the F-measure for this purpose:

(4.11)

van Rijsbergen, p.174 * has since defined the closely related effectiveness of measure E, which uses to smoothly vary the emphasis given to precision versus recall:

(4.12)

The transform converts easily between the two formulations, with E = 1 – F. van Rijsbergen, p. 174 also presents an argument that a perfectly even-handed balance of precision against recall at = 0.5 is most appropriate. Setting = 0.5, = 1 also has the pleasing consequence that the F, statistic corresponds to the harmonic mean of Precision and Recall.

As discussed at some length in Section 7.4, it is possible to view retrieval as a type of classification task: Given a set of features for each document (e.g., the keywords it contains), classifiy it as either Rel or with respect to some query. Lewis and Gale [Lewis and Gale, 1994] used the F measure in the context of text classification tasks, and they recommend a focus on the same = 1.0 balance. Classification accuracy measures how often the classification is correct. If we associate the choice to retrieve a document with classifying it as Rel, we can use the variables defined in the contingency table of (Table 4.1):

(4.13)
Sliding Ratio

The fact that the Retr set is ordered makes it useful to compare two rank orderings directly. If the “correct,” idealized ranking is known (for example, one corresponding to perfectly decreasing probability of relevance), then an actual search engine’s hitlist ranking can be compared against this standard. More typically, the rankings of two retrieval systems are compared to one another.

Given two rankings, we will prefer the one that ranks relevant documents ahead of irrelevant ones. If our relevance assessments are binary, with each document simply marked as relevant or irrelevant,* the normalized recall measure considered in Section 4.3.6 (or the expected search length measure to be described in Section 4.3.10) is the best we can do in distinguishing the two rankings.

But if we assume instead that it is possible to impose a more refined measure Rel(di) than simply Rel/ (e.g.,recall the richer preference scale of Figure 4.1), more sophisticated measures are possible. In this case, we prefer a ranking that ranks di ahead of dj just in case Rel(di) > Rel(dj). One way to quantify this preference is to sum the Rel(di) for the NRet most highly ranked documents:

(4.14)

The ratio of this measure, computed for each of the two systems’ rankings, is called the sliding ratio score [Pollack, 1968]:

(4.15)

As NRet increases, this ratio comes closer to unity:

(4.16)

and so it is most useful for distinguishing between two rankings when only a small NRet is considered.

Point Alienation

The sliding ratio measure provides a more discriminating measure but depends entirely on the availability of metric Rel(di) measures for retrieved documents. As discussed in Section 4.1.1, it is much easier to derive nonmetric assessments directly from RelFbk data given naturally as part of users’ browsing:

(4.17)

In an effort to exploit the nonmetric preferences often provided by human subjects, Guttman [Guttman, 1978] defined a measure known as point alienation. Bartell has pioneered a variation of it for use with document rankings rated by RelFbk [Bartell et al., 1994a]. The basic idea is deceptively simple: Compare the difference in rank between two differentially preferred documents to the absolute difference of these ranks:

(4.18)

If d is really preferred over d' – (d d') – (e.g., if some user has marked d as Rel but said nothing about d'), we can hope that Rank(d) < Rank(d'),* and so the numerator (Rank(d) – Rank(d')) will be negative; if, on the other hand, the two documents are incorrectly ordered by the ranking, the numerator will be positive. Comparing this arithmetic difference to its absolute value, and then summing over the rankings for all pairs of documents (d, d') that are differentially preferred (d d') gives Equation 4.18

4.3.9 Test Corpora

By test corpora we refer to collections of documents that have associated with them a series of queries for which relevance assessments are available. One of the earliest such test sets was a collection of 1400 research papers on aerodynamics developed by C. Cleverdon in the mid-1960s, known as the Cranfield corpus [Cleverdon and Mills, 1963]. For most of the 1980s, a set of corpora known as CACM, CISI, INSPEC, MED, and NPL (sometimes referred to as the Cornell corpora) were developed, maintained, and distributed by Gerald Salton and his students at Cornell; it became the de facto standard for testing within the IR community. For some time, the most influential test corpora have been the TREC corpora associated with the Text Retrieval Evaluation Conference meetings [Harman, 1995].

Table 4.3 gives a sample of statistics for a number of the most widely used corpora. One obvious trend is the increasing size of these collections over time. The Reuters corpus classification labels are invaluable for training classifiers (cf. Section 7.4). With our AIT corpus, the OSHMED [Hersh, 1994] is one of the few to provide multiple relevance assessments of the same q, d pair.

Figure 4.16 shows a sample query from the TREC experiments. For this query, and hundreds like it, considerable manual effort has gone into assessing whether documents in the TREC corpus should be considered “relevant.” Note the way the “basic” query (Line 2) has been embellished with general and specific topical orientation (Lines 1 and 3), important terms and abbreviations have been explicated, etc. This is much more information than most users typically provide, but it also allows much more refined assessment of systems’ performance.

As the testing procedures of the TREC participants have developed over the years, multiple “tracks” have formed, corresponding to typical search engine usage patterns. The task on which we have focused throughout this section is termed ad hoc retrieval, in the sense that a constant corpus is repeatedly searched with respect to a series of ad hoc queries. This is distinguished from the routing task, which assumes a relatively constant standing set of queries (for example, corresponding to the interests of various employees of the same corporation). Then, an ongoing stream of documents is compared, with relevant documents routed to appropriate recipients.

More recently, a special type of routing termed filtering has also been considered. In the filtering task, the standing query is allowed to adapt to the stream of RelFbk generated by the users as they receive and evaluate routed documents (cf. Section 7.3).

4.3.10 Other Measures

The performance measures already listed are by far the most common ways in which search engines are evaluated in the literature. Several others, however, have been important in the past and may again prove useful in some situations.

Expected Search Length

The ordered list of relevance assessments described in Section 4.3.5 also recommends another, holistic evaluation of the entire retrieval’s behavior; this method is known as expected search length (ESL). ESL considers the length of a “path” as users walk down the ordered hitlist, measuring how many irrelevant documents were seen on this path before each relevant document; “expected” refers to the average length of each path ending in a relevant document. Cooper initially proposed this model to measure the work a search engine saves, in comparison to searching the entire collection at random [Cooper, 1968].

Given that a search engine retrieves documents in hitlist order, ESL also requires a criterion by which the users’ wandering paths are stopped. Van Rijsbergen, p. 160 discusses a number of predicates that might be used to terminate the search: some fixed number of relevant documents, some fraction of all relevant documents, etc. Because the generality of queries can vary considerably, it is desirable to terminate the ESL after some fixed fraction E of relevant documents has been retrieved.

For this same reason, it makes sense to normalize ESL with respect to the number of relevant documents we might expect to retrieve if we were retrieving at random. If we use NRet for the number of retrieved documents (i.e., those satisfying the predicate mentioned earlier), we can estimate the expected random search length RandSL as:

(4.19)

Then the expected search length reduction factor

(4.20)

captures the amount a real search method improves over the random case.

see exercise 2 Exercise 2

Operating Characteristic Curves

Swets [Swets, 1963] enumerated a number of abstract desiderata (quoted by van Rijsbergen, p. 155) that we might wish for any assessment measure. According to these, IR’s standard Re/Pre plot leaves much to be desired, in particular because this two-dimensional assessment makes direct comparison impossible. Swets therefore recommended an analysis from the perspective of signal detection, based on several key assumptions:

ASSUMPTION 1There is a “relevant” signal we wish to distinguish from background noise. We can consider the worst case to be comparison against an “irrelevant” signal, with both signals imposed over the data collection. We can imagine that this signal is generated by the presence or absence of some keywords.
ASSUMPTION 2These two signals are to be discriminated according to only a single dimension.
ASSUMPTION 3These signals are both distributed normally across the corpus.

In this idealized case, we get a picture similar to Figure 4.17. Then, because our corpus has been ordered by the ranking, the goal becomes to select a value Rank that best separates these two modal distributions.

Using a simple retrieval rule that retrieves a document if its value is above the threshold Rank, wherever we place this threshold we are bound to make two types of errors. There will be some Rel documents that fall below our threshold (blue in Figure 4.17) and some irrelevant documents that fall above it (red). Following signal detection theory we can call the first set “FALSE–” errors and the second “FALSE+” errors. (These are often called Type 1 and Type 2 errors, respectively.) Note that the ratio of the right tail of the Rel curve (the area not blue in Figure 4.17) to the total area under the Rel curve corresponds exactly to the Recall measure defined earlier (Equation 4.2) while the ratio of the right tail of the NRel curve (red in Figure 4.17) to the total area under the NRel curve corresponds exactly to Fallout Equation 4.4.

The parametric curve defined by the percentage of Rel versus documents retrieved as is varied is called the operating characteristic curve. Obviously, if these two distributions are identical, this curve will be exactly a diagonal line, from (0,0) to (1,1). If the mean value of the Rel distribution is greater than that of the , the operating characteristic curve is moved closer to the upper-left corner, as shown in Figure 4.18.

While Swets (and subsequently others [Robertson, 1969; Bookstein and Kraft, 1977]) then considered fairly elaborate tests to discriminate the relative performance of retrieval systems with respect to such curves, it is fair to say that the 1979 assessment of van Rijsbergen, p. 154 still stands:

...although the Swets model is theoretically attractive and links IR measurements to ready-made and well-developed statistical theory, it has not found general acceptance amongst workers in the field.

Optimal selection of Rank depends on specification of the costs (losses) of making FALSE+ or FALSE– errors. For example, if you are an overworked and underpaid law clerk and you read an irrelevant document (FALSE+), you’ve wasted precision attention, but that’s all; if you miss a reference you should have found (FALSE–) the cost might be huge. But if you’re a partying undergraduate with one more term paper between you and summer vacation, your assessments might be quite different. Section 5.5.6 gives an example of how explicit models of these various costs can be incorporated within a Bayesian decision-making framework.

4.4 RAVE: A Relevance Assessment VEhicle

Section 4.3.2 argued that the opinions of many users concerning the relevance of a document to a query provides a more robust characterization than any single expert. It seems, however, that our move from omniscient to consensual relevance has only made the problem of evaluation that much more difficult: Test corpora must be large enough to provide robust tests for retrieval methods, and multiple queries are necessary to evaluate the overall performance of an IR system. Getting even a single person’s opinion about the relevance of a document to a particular query is hard, and we are now interested in getting many!

This section describes RAVE, a relevance assessment vehicle that demonstrates that it is possible to operationally define relevance in the manner we suggest. RAVE is a suite of software routines that allow an IR experimenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus. It has been used with a number of different classes of students to collect the relevance assessments used for evaluation with respect to the AIT corpus; your teacher may have you participate in a similar experiment. It can also be used to collect assessments for other document corpora and query sets.

4.4.1 RAVeUnion

It would be most useful if, for every query, the relevance of every document could be assessed. However, the collection of this many assessments, for a corpus large enough to provide a real retrieval test, quickly becomes much too expensive. But if the evaluation goal is relaxed to being the relative comparison of one retrieval system to one or more alternative systems, assessments can be constrained to only those documents retrieved by one of the systems.

We therefore follow the pooling procedure used by many other evaluators, viz., using the proposed retrieval methods themselves as procedures for identifying which documents are worth assessing.

The first step in constructing a RAVE experiment is to combine the ranked retrieval lists of the two or more retrieval methods, creating a single list of documents ordered according to how interested we are in having them assessed by a subject.

RAVeUnion produces the most straightforward “zipper” merge of the lists, beginning with the most highly ranked and alternating one.More elaborate ways of merging ranked lists The output of RAVeUnion is a file of (query, document) pairs along with a field that indicates if the pair was uniquely suggested by only one of the methods. This last information can be used to compare the average relevance scores of documents suggested by one method alone to those retrieved by more than one.

4.4.2 RAVePlan

A second challenge is to find the desired density or redundancy of sample points. That is, for each document that we believe may be relevant to a query, how many subjects should evaluate it? The answer will vary depending on such factors as the number of participants, their expertise, their motivation to produce quality judgments, how long each will spend rating documents, etc. A higher density means that fewer documents will be evaluated, but also that the intersubject cumulative assessment is likely to be more statistically stable. This can be especially important when RelFbk is to be used to train the system.

The trade-off between the most important of these factors is captured in the following formula:

(4.21)

where

x = number of documents to be evaluated for each query
N = number of subjects
R = expected subject efficiency (votes/user/time)
S = desired density (votes/document)
T = time spent by subjects
Q = number of queries to be evaluated

Note that this formula ignores the overlap between queries that occurs when users see a document that may be relevant to two or more of the queries in their list. Care must be taken, therefore, to minimize expected overlap between the topical areas of the queries. We have also found that the assessment densities constructed using this formula are unfortunately uneven. The main source of these is variability in R, the rate at which subjects are able to produce relevance assessments.

RAVePlan takes as input a list of Q query specifications, a list of N subject logins, the desired density S, and the number of documents R*T that should be allocated to each subject. The query specifications indicate which queries can go in which fields and which queries should not be shown together. This allows us to limit possible interactions between queries about similar topics.A better RAVePlan

RAVePlan outputs two files. The plan file, which is an input to the RAVE interactive application, lists each subject ID along with the queries and document numbers that have been selected for that subject. The assignments file is a list of document-query pairs, which tells us which query we expect the document to be relevant to. RAVeCompile uses this file after data collection is complete to generate true and false-positive measures.

4.4.3 Interactive RAVE

The interactive portion of RAVE is an HTML document, as shown in Figure 4.19. The top of RAVE’s window displays the three queries against which the subject is to judge each document. Two queries are short sentences or phrases, in this case REASONING ABOUT UNCERTAINTY and LEGAL APPLICATIONS, and the third is a scrolling pane containing the text of the long, RelFbk-defined document (in this example, the thesis is A PLANNING MODEL WITH PROBLEM ANALYSIS...). While the subject is asked to judge the documents shown to him or her as being about the two short queries, the task associated with the RelFbk-document query is to find “documents like this.”

Beneath each query the RAVE window contains four radio buttons labeled “Not” (relevant), “Possibly” (relevant), “Relevant,” and “Critically”(relevant). Because we asked our subjects to spend two hours each but could not assume their participation would be continuous, there is a “Quit” button that allows the subject to suspend the session; when the subject launches RAVE again, the session is resumed where he or she left off. Finally, the “Next” button is pressed after the subject has read and recorded his or her relevance assessments for a document.

4.4.4 RAVeCompile

RAVeCompile processes the votes recorded by interactive RAVE, creating a relevance assessment file and some statistics about each query and subject. The first step is to map the four-valued relevance assessments into the Boolean relevant/nonrelevant discriminations typically used by standard IR evaluation techniques. RAVeCompile lets the experimenter configure a simple predicate of the following form:

s = a · NPoss + b · NRel + c · NCrit

NVote = NPoss + NRel + NCrit

RelP = (NVote Quorum) (s · NVote MinAvg)

where

s = weighted aggregate score across relevance levels
a = weight assigned to votes of “possibly relevant”
b = weight assigned to votes of “relevant”
c = weight assigned to votes of “critically relevant”
NVote = total number of active votes collected for (q, d) pair
Quorum = minimum number of votes required for (q, d) to be considered relevant
MinAvg = minimum weighted sum required for (q, d) to be considered relevant

In one set of experiments [Belew and Hatton, 1996], the data used two predicates constructed in this fashion. These are:

  • permissive: if (two or more POSSIBLE votes) or (at least one RELEVANT vote)
  • stringent: if (two or more RELEVANT votes) or (at least one CRITICAL vote)

4.5 Summary

In this chapter we began by making some assumptions about users of a search engine in order to figure out how well the system is satisfying users’ information needs. We focused on two separate notions of assessment: first, assessing the relevance of documents retrieved by the system in response to a particular query, and second, assessing the search engine’s overall utility through aggregating relevance judgments provided by many users performing many queries.

Section 4.1 discussed both metric and nonmetric relevance feedback and the difficulties in getting users to provide relevance judgments for documents in the retrieved set. We saw, however, that relevance feedback could be used to either suggest query refinements to the users or to modify the underlying document representations to improve future system performance.

The concept of consensual relevance introduced in Section 4.2 addresses an issue raised in Chapter 1, where we asked what success criteria can be used in evaluating a search engine. Consensual relevance tells us that relevant documents are those documents that many users find to be useful. We can ask how useful a particular search engine is, or compare one search engine with another, by posing the question: How useful (relevant) do users find the documents retrieved in response to queries?

To answer that question we quantified several measures of system performance. The generality of a query is a measure of what fraction of documents in the corpus is relevant to the query. Fallout measures the fraction of irrelevant documents found in the retrieved set of a given query. The key notions of recall, the fraction of relevant documents in the retrieved set, and precision, the fraction of retrieved documents that are relevant, allow us to make direct comparisons between two search engines’ performances on any query. Other methods of comparison include sliding ratio, point alienation, expected search length, and operating characteristic curves.


TERMS INTRODUCED IN THIS CHAPTER

11-point average (131)
ad hoc retrieval (137)
berrypicking (110)
centroid (111)
classification accuracy (133)
composite (132)
consensual (106)
contingency table (122)
Cornell corpora (135)
costs (141)
document modifications (115)
effectiveness (133)
expected search length (137)
expected search length reduction factor (138)
F-measure (132)
fallout (123)
filter (114)
filtering (137)
generality (130)
harmonic mean (133)
hitlist (124)
hitlist rank (125)
intersubject reliability (117)
interpolation (130)
litigation support (121)
multicriterial (132)
nonmetric (108)
normalized recall (128)
object recognition (107)
omniscient expert (106)
operating characteristic curve (140)
operationalizing (106)
permissive (146)
point alienation (134)
pooling (120)
preference (109)
query session (110)
recall (123)
Recall/Precision curve (125)
relevance (106)
relevance assessment (106)
relevance feedback (105)
routing (137)
sliding ratio (134)
stringent (146)
test corpora (135)
total ordering (125)
Back to TOC