4 | ![]() | ||||||||||||||||||||||||||||||||||||||||||||
Assessing the RetrievalWeve come a long way since Chapter 1, where we first sketched the full range of activities we might consider FOA. As Chapter 2 considered the various ways of breaking text into indexable features, and Chapter 3 explained the various ways of weighting combinations of these features to identify the best matches to a query, I hope you have been aware of how many alternatives have been mentioned! That is, rarely has there been a single method that can be proven to be better than all others. IR has traditionally been driven by empirical demonstrations, and the range of commercial competitors now trying to provide the best search of the WWW makes it likely this performance orientation will continue. But whether we are search engineers, scientists objectively assessing one particular technique, or consumers of WWW search engine technology interested in buying and using the best, a solid basis of performance assessment is critical. Several perspectives on assessment are possible. In the first chapter FOA was viewed as a personal activity, adopting the users points of view. Section 4.1 will continue in this theme, considering how users assess the results of their retrievals and how they can express their opinions using relevance feedback (RelFbk). Oddy is credited with first identifying this important stream of data, naturally provided by users as a part of their FOA browsing [Oddy, 1977; Belkin et al., 1982]. But in this book we are also concerned with FOA from the IR system builders point of view. Ideally, we would like to construct a search engine that robustly finds the right documents for each query and for each user. The second section of this chapter discusses performance measures of statistical properties that are reliable across large numbers of users and their highly variable queries. The key to these measures is having some insight into which documents should have been retrieved, typically because some idealized omniscient expert has determined (within a specially constructed experimental situation) that certain documents should have been retrieved. Alternatively the RelFbk of many users can be combined to form a consensual opinion of relevance, as described in Section 4.4. A concrete notion of relevance would seem a fundamental precondition for understanding either an individuals RelFbk or how this can be used to assess a search engine. But in this respect, information retrieval generally, and RelFbk in particular, is like many other academic areas of study (including artificial intelligence and genetics) in that the lack of a fully satisfactory definition of the core concept (information, intelligence, genes, and so on) has not entirely stopped progress. That is, a great deal can be done by operationalizing RelFbk to be simply those relevance assessment behaviors produced as part of an FOA dialog. This operational simplification will hold us until fundamental issues of language and communication are again addressed in Section 8.2.1 4.1 Personal Assessment of Relevance*4.1.1 Cognitive AssumptionsGarbage in, garbage out is one of the first insights every software developer learns, and FOA is no exception. The primary source of data considered by traditional IR methods, and the focus of Chapter 2 and Chapter 3 of this text, are the documents of the corpus, particularly the keywords they contain.(Chapter 6 will consider the use of other document attributes.) A fundamental feature of the broader FOA view is that browsing users provide an equally important source of data concerning what keywords mean and what documents are about. It is therefore appropriate to begin by characterizing who these users we will be watching are. We begin with one important cognitive assumption we must make about our users: How thorough an FOA search do they wish to perform? Is this an important search to which the users are willing to dedicate a great deal of time and attention, or will a quick, cursory answer suffice? For example, Chapter 1 mentioned how much less thorough the typical undergraduate (doing some quick research before submitting a term paper) is than the Ph.D. candidate (who wants to ensure his or her proposed dissertation topic is new). The typical WWW searcher seems satisfied with only a few useful leads, but the professional searcher (a lawyer looking for any case that might help, a doctor looking for any science that might heal a patient) will search diligently if there is even a small chance of finding another relevant document. This kind of variability can be observed not only across different classes of users but even across the same user at different times. In for a fact, stay for a lesson Prototypic RetrievalsFrom the perspective of cognitive psychology, the task facing users who are asked to produce relevance feedback (RelFbk) can best be described as one of object recognition, in the tradition of Rosch and others [Rosch and Mervis, 1975; Rosch, 1977]. The object to be recognized is an internally represented prototypic document satisfying the users information need. In this case, the prototype corresponds to the model the subjects maintain of ideally relevant documents. As users consider an actual retrieved documents relevance, they evaluate how well it matches the prototype. Barry and others have suggested the many and varied features over which prototypes can be defined [Barry, 1994]. Only a small number of these may be revealed by any one of the users queries, of course. Here we will simply assume that users are capable of grading the quality of this match. Users might be asked to score the quality of relevance match according to a five-point scale like that shown in Figure 4.1. Users can qualify the middle Relevant response either by weakening it (Possibly Relevant) or strengthening it (Critically Relevant). Such distinctions are often made in experimental settings (e.g., in the use of the STAIRS retrieval system by lawyers [Blair and Maron, 1985]) and relate to the different purposes for FOA that different users may have. To make these distinctions concrete, we might imagine Critically Relevant to apply only to those documents that must be read even for an undergrad term paper, while Possibly Relevant would be much more broadly applied to those a Ph.D. student needs as part of his or her literature review. For now, however, we will simplify the types of RelFbk to allow users to reply with only a single grade of Relevant ( RelFbk is NonmetricAs we move from a cognitive understanding of the users tasks to statistical analyses of their behaviors, we must understand one important feature of the RelFbk data stream: RelFbk is nonmetric data. That is, while users find it easy and natural to critique retrieved documents with Think for a moment just why this is hard, by imagining your reactions to a typical retrieval. Is the first document to be rated 10 or 6743? If you rate the first document as 10, the second as 6, and the third as 2, then you must ensure that the third document is exactly as much less relevant than the second as the second is from the first. Trying to keep all Rel(di) assessments consistent in the metric sense, for many retrieved documents or any other set, makes people crazy. This is not only a property of relevance assessments. A large literature on psychological assessment [Kruskal, 1977a; Kruskal, 1977b; Shepard et al., 1972] has demonstrated that human subjects can quite easily and reliably sort objects into piles, and that they like one pile better than another. Yet the same people find it more difficult to quantify how much they like each object, let alone make these quantitative assessments consistent with one another. Reliable estimates of both cognitive qualities would be necessary if we are to have a true preference metric.* Rather than assuming that users can provide a separate score for each retrieved document, we will treat this as an ordered nonmetric scale of increasing preference:
Each of these assumptions about what relevance is and how it can be measured is a matter of considerable debate [Wilson, 1973; Froehlich, 1994] and is likely to be the topic of much future work. It is also interesting to note that in our later attempts at a comprehensive model of how and why humans use language, relevance again plays a central role (cf. Section 8.2.2). 4.2 Extending the Dialog with RelFbkFigure 4.2 focuses on a single instance of RelFbk, shown as a labeling over the set of Retr documents. But beyond any one reaction to a single retrieved set product, a central premise of the FOA process is that users reactions to just-retrieved documents provide the pivotal linkage between assessments, to form the FOA search dialog. This is perhaps more clear in Figure 4.3, where RelFbk is used to link a series of reactions into a query session. Attempts to support this searching process, and then attempts to rigorously evaluate how well software systems support browsing users as they FOA, is one of the most vexing issues within IR evaluation [Daniels et al., 1985; Saracevic et al., 1988; Larson, 1996; Oddy et al., 1992; ODay and Jeffries, 1993; Cawsey et al., 1992; Russel et al., 1993; Koenemann and Belkin, 1996]. Part of the problem is the misconception that if a search engine works perfectly and the user issues the perfect magic bullet query, out will spill all and only relevant documents! Such simplistic definitions of optimality come naturally to computer scientists; library scientists, who are used to the naturalistic behaviors of real patrons in their libraries, know that a much more extended and nebulous form of support is required. Marcia Batess famous berry picking metaphor [Bates, 1986; Bates, 1989] is useful here:
In addition to highlighting the same iterative browsing behavior central to FOAs characterization of the dialog, the evolving character of the information need in Batess metaphor is also important. Imagine that you are in the forest on an idyllic day with only one purpose: to fill your bucket with the best blueberries you can find. Early in the day, with your whole afternoon in front of you, you are likely to be very choosy. At this juncture, you could bump into a bush full of blueberries that were not as ripe or as large as you imagine must exist somewhere else in the forest, and not drop a single one into your basket. But late in the afternoon, if you have had poor luck and little to show for your efforts, you could come across an even worse bush and grab every single berry, even the shriveled ones! Applying this metaphor to FOA is provocative in many respects. For example, it suggests that maintaining an explicit representation of the retrieved document basket might be a useful addition to any search engine interface. It predicts a time course to the distribution of users RelFbk assessments. For now, we simply observe that it seems quite likely that an assessment of one documents relevance will depend greatly on the basket of other documents we have already seen. The general idea of thinking of an evolutionary ecology of information foraging [Pirolli and Card, 1997] has become less metaphoric and more concrete as information search agents (like the InfoSpiders described in Section 7.6) explore the environment of the WWW. 4.2.1 Using RelFbk for Query RefinementWhile Figure 4.2 showed the retrieved set of documents as a simple set, it is interesting to impose the RelFbk labeling on the retrieved documents viewed in vector space. Figure 4.4 shows a query vector and a number of retrieved documents,together with a plausible distribution of RelFbk over them. That is, we can expect that there is some localized region in vector space where It is less reasonable, however, to imagine that the negatively labeled The vector space view also lets us easily portray two very different uses to which RelFbk information might be applied. Most typically, RelFbk is used to refine the users query. Figure 4.6 represents this refinement in terms of two changes we can make to the initial query vector. The first is to take a step toward the centroid of the positive RelFbk cluster. The size of this stepNeural-net style learning reflects how confident we are that the positive cluster centroid is a better characterization of the users interests than the original query. There is one important difference between the query and even a slight perturbation of it toward a clusters centroid: While the original query vector is often very sparse and result from just those few words used in the users original query, any movement toward the centroid will include (linear combinations of) all those keywords used in any of the positively labeled documents. The additional difficulty in implementing this more densely filled feature vector becomes a serious obstacle in many system implementations. The fact that RelFbk-refined queries involve many more nonzero keyword entries also means that query weighting and matching techniques may be sensitive to this difference. Seminal work on the use of RelFbk was done by Salton, especially with students Rocchio, Brauen, and Ide in the late 1960s and early 1970s [Rocchio, 1966; Brauen, 1969; Salton, 1972]. More recent students have extended the theory of query refinement and related it to topics in machine learning [Buckley et al., 1994; Buckley and Salton, 1995; Allan, 1996]. Some of these experiments explored a second modification to the query vector. In addition to moving toward the d+ center of
One possible strategy is to take a single element d of the irrelevant retrieved documents (for example, the most highly ranked irrelevant retrieval) and define the direction of movement with respect to it alone. As we have discussed in connection with Figure 4.2, RelFbk helps to link individual queries into a browsing sequence. And so, although we have focused here on the simplest form of query refinement, with respect to the users initial queries, RelFbk can be given again and again. An initial query vector is moved toward the centroid of documents identified as relevant (and perhaps away from an especially bad one); this modified query instigates a new retrieval, which is again refined. In practice, it appears that such adjustments result in diminishing returns after only a few iterations of query refinement [Salton and McGill, 1983; Stanfill and Kahle, 1986b]. However, Section 7.3 will discuss a type of FOA in which a document corpus is constantly changing and the users interest in a topic is long-lived. In this case, we can imagine the query as a filter against which a constant stream (e.g., of newly published Web documents) is applied. RelFbk has also been used in this setting, to make ongoing changes to the query/filter that continue to improve its effectiveness [Allan, 1996]. Using RelFbk for query refinement produces results that are immediately satisfying to the users. First, it automatically generates a new query with which they can continue their browsing process. Second, the statistical analysis of positively labeled retrieved documents can provide other forms of useful information to the users as well. For example, rather than simply retrieving a new set of documents, new keywords that are not in the users original queries but are present in positively labeled documents at statistically significantly levels can be suggested to the users as new vocabulary. Conversely, words that were in the original query but are negatively correlated with d+ (and/or positively correlated with d) can also be identified. 4.2.2 Using RelFbk to Adapt Documents IndicesAn alternative use of RelFbk is to make changes to the documents rather than to the query. Previously, the argument was that it is sensible to make the query look more like those documents the users liked; now the argument is the converse: Documents found relevant to a query should be described more like the query used to identify them. Changes made to document vectors according to this heuristic are known as document modifications and are shown in Figure 4.7. Note that unlike query modification, adaptive document modifications made in response to RelFbk are not expected to be of (immediate) use to the users who provide it. Instead, the hope is that these changed document representations will be available later, to others who might be searching. As Salton [Salton and McGill, 1983] described the goal:
This provocative proposal, allowing a search engine to learn from its users, is considered in much greater detail in Chapter 7. 4.2.3 SummaryWe have been discussing RelFbk from the individual users point of view. Weve focused on how this information might be collected and how it might be used, both in the short term to modify users retrievals and in a longer-term way to change the documents indices. Now we want to consider a third use for RelFbk information. When we have more than one system to use for retrieval and would like to evaluate which is doing the better job, users assessments of retrieved documents relevance can be used as grades. If one system can consistently, across a range of typical queries, more frequently retrieve documents that the users mark as relevant and fewer that they mark as irrelevant, then that system is doing a better job. 4.3 Aggregated Assessment: Search Engine PerformanceThe last section considered the assessment problem from the perspective of a single individual, the browsing user. Now we would like to generalize on this individual performance to attempt to obtain statistically significant observations. 4.3.1 Underlying AssumptionsAs with all scientific models, our attempts to evaluate the performance of a search engine rests on a number of assumptions. Many of these involve the user and simplifying assumptions about how users assess the relevance of documents. 1. Real FOA versus Laboratory Retrieval. From the FOA perspective, users retrieve documents as part of an extended search process. They do this often, and because they need the information for something important to them. If we are to collect useful statistics about FOA, we must either capture large numbers of such users in the act (i.e., in the process of a real, information-seeking activity) or attempt to create an artificial laboratory setting. The former is much more desirable but makes strong requirements concerning a desirable corpus, a population of users, and access to their retrieval data. So, typically, we must work in a lab. The first big assumption, then, is that our lab setting is similar to real life; i.e., guinea pig users will have reactions that mirror real ones.With Web search engines dont we have access to enormous numbers of users searching the same corpus? [SG] 2. Intersubject Reliability. Even if we assume we have a typical user and that this user is engaged in an activity that at least mirrors the natural FOA process, we have to believe that this user will assess relevance the same as everyone else! But clearly, the educational background of each user, the amount of time he or she has to devote to the FOA process relative to the rest of the task, and similar factors will make one users reaction differ from anothers. For example, there is some evidence that stylistic variations also impact perceived relevance [Karlgren, 1996]. The consensual relevance statistic (cf. Section 4.3.2) is one mechanism for aggregating across multiple users. This becomes a concern with intersubject reliability. If we intend to make changes to document representations based on one users RelFbk opinions, we would like to believe that there is at least some consistency between this users opinion and those of others. This is a critical area for further research, but some encouraging, preliminary results are available. For example, users of a multilingual retrieval system that presents some documents in their first language (mother tongue) and others in foreign languages they read less well, seem to be able to provide consistent RelFbk data even for documents in their second, weaker language [Sheridan and Ballerini, 1996]. 3. Independence of Interdocument Relevance Assessments. Finally, notice that the atomic element of data collection for relevance assessments is typically a (queryi, documentj) pair: documentj is relevant to queryi. Implicitly, this assumes that the relevance of a document can be assessed independently of assessments of other documents. Again, this is a very questionable assumption. Recall also that often the proxy on which the users relevance assessment depends is distinct from the document itself. The user sees only the proxy, a small sample of the document in question, for example, its title, first paragraph, or bibliographic citation. While we must typically take user reaction to the proxy as an opinion about the whole document, this inference depends critically on how informative the proxy is. Cryptic titles and very condensed citation formats can make these judgments suspect. And of course the users ultimate assessment of whether a document is relevant, after having read it, remains a paradox. 4.3.2 Consensual RelevanceIn most search engine evaluation, the assumption has been that a single expert can be trusted to provide reliable relevance assessments. Whether any one, omniscient individual is capable of providing reliable data about the appropriate set of documents to be retrieved remains a foundational issue within IR. For example, a number of papers in a recent special issue of the Journal of the American Society for Information Systems devoted to relevance advocated a move toward a more user-centered, situational view of relevance[Froehlich, 1994]. Our attention to the opinions of individual users suggests the possibility of combining evidence from multiple human judges. Rather than having relevance be a Boolean determination made by a single expert, we will consider relevance to be a consensual, central tendency of the searching users opinions. The relevance assessments of individual users and the resulting central tendency of relevance is suggested by Figure 4.8. Two features of this definition are significant. First, consensual relevance posits a consumers perspective on what will count as IR system success. A documents relevance to a query is not going to be determined by an expert in the topical area, but by the users who are doing the searching. If they find it relevant, its relevant, whether or not some domain expert thinks the document should have been retrieved. Second, consensual relevance becomes a statistical, aggregate property of multiple users reactions rather than a discrete feature elicited from any one individual. By making relevance a statistical measure, our confidence in the relevance of a document (with respect to a query) increases as more relevance assessment data are collected. This reliance on statistical stability creates a strong link between IR and machine learning (cf. Chapter 7). Allens investigation into idiosyncratic cognitive styles of browsing users [Allen, 1992] and Wilburs assessment of the reliability of RelFbk across users [Wilbur, 1998] provide a more textured view of how multiple relevance assessments can be compared and combined. It seems, however, that our move from omniscient to consensual relevance has only made the problem of evaluation that much more difficult. Test corpora must be large enough to provide robust tests for retrieval methods, and multiple queries are necessary to evaluate the overall performance of a search engine. Getting even a single persons opinion about the relevance of a document to a particular query is hard, and we are now interested in getting many! However, software like RAVE (cf. Section 4.4) allows an IR experimenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus. 4.3.3 Traditional Evaluation MethodologiesBefore surveying all of the ways in which evaluation might be performed, it is worthwhile to sketch how it has typically been done in the past [Cleverdon and Mills, 1963]. In the beginning, computers were slow and had very limited disk space and even more limited memories; initial test corpora needed to be small, too. One benefit of these small corpora was that it allowed at least the possibility of having a set of test queries compared exhaustively against every document in the corpus. The source of these test queries, and the assessment of their relevance, varied in early experiments. For example, in the Cranfield experiments [Lancaster, 1968], 1400 documents in metallurgy were searched according to 221 queries generated by some of the documents authors. In Saltons experiments with the ADI corpus, 82 papers presented at a 1963 American Documentation Institute meeting were searched against 35 queries and evaluated by students and staff experts associated with Saltons lab [Salton and Lesk, 1968]. Lancasters construction of the MEDLARS collection was similar [Lancaster, 1969]. As computers have increased in capacity, reasonable evaluation has required much larger corpora. The Text Retrieval Conference (TREC), begun in 1992 and still held annually, has set a new standard for search engine evaluation [Harman, 1995]. The TREC methodology is notable in several respects. First, it avoids exhaustive assessment of all documents by using the pooling method, a proposal for the construction of ideal test collections that predates TREC by decades [Sparck Jones and van Rijsbergen, 1976]. The basic idea is to use each search engine independently and then pool their results to form a set of those documents that have at least this recommendation of potential relevance. All search engines retrieve ranked lists of k potentially relevant documents, and the union of these retrieved sets is presented to human judges for relevance assessment. In the case of TREC, k = 100 and the human assessors were retired security analysts, like those that work at the National Security Agency (NSA) watching the worlds communications. Because only documents retrieved by one of the systems being tested are evaluated there remains the possibility that relevant documents remain undiscovered, and we might worry that our evaluations will change as new systems retrieve new documents and these are evaluated. Recent analysis seems to suggest that, at least in the case of the TREC corpus, evaluations are in fact quite stable [Voorhees, 1998]. An important consequence of this methodological convenience is that unassessed documents are assumed to be irrelevant. This creates an unfortunate dependence on the retrieval methods used to nominate documents, which we can expect to be most pronounced when the methods are similar to one another. For example, if the alternative retrieved sets are the result of manipulating single parameters of the same basic retrieval procedure, the resulting assessments may have overlap with, and hence be useless for comparison of, methods producing significantly different retrieval sets. For the TREC collection, this problem was handled by drawing the top 200 documents from a wide range of 25 methods, which had little overlap [Harman, 1995]. Vogt [Vogt and Cottrell, 1998] has explored how similarities and differences between retrieval methods can be similarly exploited as part of combined, hybrid retrieval systems (cf. Section 7.5.4). It is also possible to sample a small subset of a corpus, submit the entire sample to review by the human expert, and extrapolate from the number of relevant documents found to an expected number across the entire corpus. One famous example of this methodology is Blair and Marons assessment of IBMs STAIRS retrieval system [Blair and Maron, 1985] of the early 1980s. This evaluation studied the real-world use of STAIRS by a legal firm as part of a litigation support task: 40,000 memos, design documents, etc. were to be searched with respect to 51 different queries. The lawyers themselves then agreed to evaluate the documents relevance. As they reported:
This is a difficult methodology, but it allows some of the best estimates of Recall available. And the news was not good: On average, retrievals captured only 20 percent of relevant documents!Killing the messenger In short, methodologies for valid search engine evaluations require much more sophistication and care than is generally appreciated. Careful experimental design [Tague-Sutcliffe, 1992], statistical analysis [Hull, 1993], and presentation [Keen, 1992] are all critical. 4.3.4 Basic MeasuresFigure 4.9 shows the relationship between relevant (Rel) and retrieved (Retr)sets as a Venn diagram, against the backdrop of the universe U of the rest of the documents of the corpus. Obviously, our focus should be on those documents that are in the intersection of Rel and Retr and on making this intersection as large as possible. Informally, we will be most happy with a Rel set when it best overlaps with the Retr set, and therefore we seek evaluation measures that reflect this. The basic relations between the sizes of these sets can also be captured in the contingency table of Table 4.1. We know we want the intersection of the Rel and Retr sets to be large, but large relative to what?! As mentioned in Chapter 1, if we are most focused on the Rel set and use it as our standard of comparison, wed like to know what fraction of these weve retrieved. This ratio is called recall:
Anticipating the probabilistic analysis of Section 5.5, we can think of Recall as (an estimate of) the conditional probability that a document will be retrieved, given that it is relevant: Pr(Ret|Rel). Conversely, if we instead focus on the Ret set, we are most interested in what fraction of these are relevant; this ratio is called precision:
Similarly, this is the probability that a document will be relevant, given that it is retrieved: Pr(Ret|Rel). A closely related but less common measure is called fallout, where we (perversely!) focus on the irrelevant documents and the fraction of them retrieved:
The close relationship between these three measures can be defined precisely, if the generality G of the query (cf. Section 4.3.7) is known:
This is Pr(Ret| Ideally, of course, wed like a system that has both high precision and high recall: only relevant documents and all of them. But real-world, practical systems must select documents based on features that are only statistically useful indicators of relevance; we can never be sure. In this case efforts made to improve recall must retrieve more documents, and it is likely that precision will suffer as a consequence. The best we can hope for is some balance. In some applications it is nevertheless desirable to evaluate IR system performance according to a single measure rather than the two-dimensional Recall/Precision criteria.Single dimensions for simple minds We will return to this topic in Section 4.3.8. 4.3.5 Ordering the Retr Set
The next step is to move beyond thinking of Retr as simply a set. We will suppose that retrieved documents are returned in some order by the search engine, reflecting its assessment of how well each document matches the query. Following current Web vernacular, we will call this ordering of the Ret set a hitlist and a retrieved documents position its hitlist rank Rank(di). This is a positive integer assigned to each document in the Retr set, in descending order of similarity with respect to the matching function Match (q,d):
Sparck Jones [Sparck Jones, 1972] and others have historically referred to a documents rank in Ret as its coordination level (cf. Eq. 3.36). Strictly speaking, coordination level refers to the number of keywords shared by document and query. In Boolean retrieval systems, sensitive only to the presence or absence of keywords, ranking by coordination level may be the only available measure on document/query similarity. For long queries, hitlist rank and coordination level are likely to be similar, because it is unlikely that different documents will match exactly the same number of words from the query. But for short queries, it is likely that coordination level will only partially order the Retr set. This is why van Rijsbergen, p. 161, speaking of the Boolean systems typical at that time, said, Unfortunately, the ranking generated by a matching function is rarely a simple ordering, but more commonly a weak ordering. Most modern search engines, however, exploit keyword weightings and can provide much more refined measures, thereby providing a total ordering of the hitlist. According to the Probability Ranking Principle (cf. Section 5.5.1), a retrieval system is performing optimally if it retrieves documents in order of decreasing probability of relevance. For now we simply assume that there is a total ordering imposed over Retr. We will use the hitlist ranking to effectively define a series of retrievals. Setting a very high threshold on this ordering would mean retrieving a very small set, while setting a lower threshold will retrieve a much larger one. Now consider a particular query q and the set Relq of relevant documents associated with it. Assuming that Retr is totally ordered makes it possible for us to define the fundamental analytic tool for search engine performance: the Recall/Precision curve (Re/Pre curve). The basic procedure is to consider each retrieved document in hitlist rank order and to ask for the precision and recall of the retrieval of all documents up to and including this one. Consider the first of the two hypothetical retrievals shown in Table 4.2. With respect to this query, we will assume there are exactly five relevant documents out of a total of 25 in the corpus. The very first one retrieved is deemed relevant; if we stopped retrieval at this point, our recall would be 0.2 (because we would have retrieved one of five relevant documents), and our precision is perfect (the one retrieved document is relevant). Our good luck continues as we consider the next document, which is also relevant; this generates a second Re/Pre data point of (0.4,1.0). We are not so lucky with the third document retrieved; precision drops to 0.67 and recall remains at 0.4. Proceeding down the retrieval in rank order, and plotting each point in this fashion gives the Re/Pre curve shown in Figure 4.10. At this point we can already make several observations. Asymptotically, we know that the final recall must go to one; once we have retrieved every document weve also retrieved every relevant document. The precision will be the ratio of the number of relevant documents to the total corpus size. Ordinarily, unless we are interested in very general queries or very small sets of documents, this ratio will be very close to zero. The other end of the curve, however, turns out to be much less stable. We would hope that a retrieval systems very first candidate for retrieval, the document with hitlist rank = 1, will be relevant, but it may not be. Figure 4.11 shows a second pair of hypothetical data points (blue), corresponding to the case that a single irrelevant document is ranked higher than the relevant ones. This relatively small change in assessment creates a fairly dramatic effect on the curve, with real consequence once we need to juxtapose multiple queries curves (see Section 4.3.7). Such instability is an inevitable consequence of the definitions of Precision and Recall: If the first retrieved document happens to be relevant, its Re/Pre coordinates will be less than 1, and Figure 4.12 puts this particular retrieval in the context of the best and worst retrievals we might imagine. The best possible retrieval (orange) would be to retrieve the five relevant documents first, and then all other documents. This would produce the upper, square Re/Pre curve. Alternatively, the worst possible retrieval (gray) would retrieve all but the relevant documents before returning these; this produces the lower line. 4.3.6 Normalized Recall and PrecisionThe best/worst envelope surrounding an actual Re/Pre curve is related to a similar comparison known as normalized recall [Rocchio, 1966]. Imagine plotting the fraction of relevant documents retrieved as a function of the fraction of the total number of documents retrieved. Such a function is plotted in Figure 4.13. Comparing the area between the actual retrieval and the worst case (yellow) to the total area between the best (orange) and worst cases (that above the blue region) is very much like the best/worst-case envelope of Figure 4.12. We can derive expressions for this area. Let ri be the hitlist rank of the ith relevant document. Then the area under the curve corresponding to any actual query is (if we define r0=0):
In the best case ri = i:
when NRel
4.3.7 Multiple Retrievals across Varying QueriesIt should come as no surprise, given the wide range of activities in which FOA is a crucial component, that there is enormous variability among the kinds of queries produced by users. The next step of our construction is therefore to go beyond a single query to consider the performance of a system across a set of queries. One obvious dimension to this variability concerns the breadth of the query: How general is it? If the Rel set for a query is known, this can be quantified by generality, comparing the size of Rel to the total number of documents in the corpus:
There are many other ways in which queries can vary, and the fact that different retrieval techniques seem to be much more effective on some types of queries than others makes this a critical issue for further research. For now, however, we will treat all queries interchangeably but consider average performance across a set of them. Figure 4.14 juxtaposes two Re/Pre curves corresponding to two queries. Query 1 is as before, and Query 2 is a more specific query, as evidenced by its lower asymptote. Even with these two queries, we can see that, in general, there is no guarantee that we will have Re/Pre data points at any particular recall level. This necessitates interpolation of data points at desired recall levels. The typical interpolation is done at prespecified recall levels, for example, 0, 0.25, 0.5, 0.75, and 1.0. As van Rijsbergen, p. 152 discusses, a number of interpolation techniques are available, each with its own biases. Because each new relevant document added to our retrieved set will produce an increase in precision (causing the saw-tooth pattern observed in the graph), simply using the next available data point above a desired recall level will produce an overestimate, while using the prior data point will produce an underestimate. With preestablished recall levels, we can now juxtapose an arbitrary number of queries and average over them at these levels. For 30 years the most typical presentation of results within the IR community was the 11-point average curves, like those shown in Figure 4.15 [Salton and Lesk, 1971; Salton and Lesk, 1968]. (These data happen to show performance on the ADI corpus of Boolean versus weighted retrieval methods; they include only the last 10 data points.) It is not uncommon to see research data reduced even further. If queries are averaged at fixed recall levels and then all of these recall levels are averaged together, we can produce a single number that measures retrieval system performance. Note the serious bias this last averaging across recall levels produces, however. It says that we are as interested in how well the system did at the 90 percent recall level as at 10 percent!? Virtually all users care more about the first screenfull, of hits they retrieve than the last. This motivates another way to use the same basic recall/precision data. Rather than measuring at fixed recall levels, statistics are collected at the 10-, 25-, and 50-document retrieval levels. Precision within the first 10 or 15 documents is arguably a much closer measure of standard browser effectiveness than any other single number. All such atempts to reduce the Re/Pre plot to a single number are bound to introduce artifacts of their own. In most cases the full Re/Pre curve picture is certainly worth a thousand words. Plotting the entire curve is straightforward and immediately interpretable, and it lets viewers draw more of their own conclusions. We must guard against taking our intuitions based on this tiny example (with only 25 documents in the entire corpus) too seriously when considering results from standard corpora and queries. For example, our first query had fully 20 percent of the corpus as relevant; even our second query had 8 percent. In a corpus of a million documents, this would mean 80,000 of them were relevant!? Much more typical are queries with a tiny fraction, perhaps 0.001 percent, relevant. This will mean that the precision asymptote is very nearly zero. Also, we are likely to have many more relevant documents, resulting in a much smoother curve. 4.3.8 One-Parameter CriteriaThis section began with recall and precision, the two most typical measures of search engine performance. From that beginning, richer, more elaborate characterizations of how well the system is performing have been considered. But even having the two measures of recall and precision, it is not a simple matter to decide whether one system is better or worse than another. What are we to think of a system that has good recall but poor precision, relative to another with the opposite feature? For example, if we wish to optimize a search engine with respect to one or more design parameters (e.g., the exact form of the query/document matching function, cf. Section 5.3.1), effective optimization becomes much more difficult in multicriterial cases. Such thinking has generated composite measures based on the basic components of recall and precision. For example, Jardine and van Rijsbergen [Jardine and van Rijsbergen, 1971; van Rijsbergen, 1973] originally proposed the F-measure for this purpose:
van Rijsbergen, p.174 * has since defined the closely related effectiveness of measure E, which uses
The transform As discussed at some length in Section 7.4, it is possible to view retrieval as a type of classification task: Given a set of features for each document (e.g., the keywords it contains), classifiy it as either Rel or
Sliding RatioThe fact that the Retr set is ordered makes it useful to compare two rank orderings directly. If the correct, idealized ranking is known (for example, one corresponding to perfectly decreasing probability of relevance), then an actual search engines hitlist ranking can be compared against this standard. More typically, the rankings of two retrieval systems are compared to one another. Given two rankings, we will prefer the one that ranks relevant documents ahead of irrelevant ones. If our relevance assessments are binary, with each document simply marked as relevant or irrelevant,* the normalized recall measure considered in Section 4.3.6 (or the expected search length measure to be described in Section 4.3.10) is the best we can do in distinguishing the two rankings. But if we assume instead that it is possible to impose a more refined measure Rel(di) than simply Rel/
The ratio of this measure, computed for each of the two systems rankings, is called the sliding ratio score [Pollack, 1968]:
As NRet increases, this ratio comes closer to unity:
and so it is most useful for distinguishing between two rankings when only a small NRet is considered. Point AlienationThe sliding ratio measure provides a more discriminating measure but depends entirely on the availability of metric Rel(di) measures for retrieved documents. As discussed in Section 4.1.1, it is much easier to derive nonmetric assessments directly from RelFbk data given naturally as part of users browsing:
In an effort to exploit the nonmetric preferences often provided by human subjects, Guttman [Guttman, 1978] defined a measure known as point alienation. Bartell has pioneered a variation of it for use with document rankings rated by RelFbk [Bartell et al., 1994a]. The basic idea is deceptively simple: Compare the difference in rank between two differentially preferred documents to the absolute difference of these ranks:
If d is really preferred over d' (d 4.3.9 Test CorporaBy test corpora we refer to collections of documents that have associated with them a series of queries for which relevance assessments are available. One of the earliest such test sets was a collection of 1400 research papers on aerodynamics developed by C. Cleverdon in the mid-1960s, known as the Cranfield corpus [Cleverdon and Mills, 1963]. For most of the 1980s, a set of corpora known as CACM, CISI, INSPEC, MED, and NPL (sometimes referred to as the Cornell corpora) were developed, maintained, and distributed by Gerald Salton and his students at Cornell; it became the de facto standard for testing within the IR community. For some time, the most influential test corpora have been the TREC corpora associated with the Text Retrieval Evaluation Conference meetings [Harman, 1995]. Table 4.3 gives a sample of statistics for a number of the most widely used corpora. One obvious trend is the increasing size of these collections over time. The Reuters corpus classification labels are invaluable for training classifiers (cf. Section 7.4). With our AIT corpus, the OSHMED [Hersh, 1994] is one of the few to provide multiple relevance assessments of the same Figure 4.16 shows a sample query from the TREC experiments. For this query, and hundreds like it, considerable manual effort has gone into assessing whether documents in the TREC corpus should be considered relevant. Note the way the basic query (Line 2) has been embellished with general and specific topical orientation (Lines 1 and 3), important terms and abbreviations have been explicated, etc. This is much more information than most users typically provide, but it also allows much more refined assessment of systems performance. As the testing procedures of the TREC participants have developed over the years, multiple tracks have formed, corresponding to typical search engine usage patterns. The task on which we have focused throughout this section is termed ad hoc retrieval, in the sense that a constant corpus is repeatedly searched with respect to a series of ad hoc queries. This is distinguished from the routing task, which assumes a relatively constant standing set of queries (for example, corresponding to the interests of various employees of the same corporation). Then, an ongoing stream of documents is compared, with relevant documents routed to appropriate recipients. More recently, a special type of routing termed filtering has also been considered. In the filtering task, the standing query is allowed to adapt to the stream of RelFbk generated by the users as they receive and evaluate routed documents (cf. Section 7.3). 4.3.10 Other MeasuresThe performance measures already listed are by far the most common ways in which search engines are evaluated in the literature. Several others, however, have been important in the past and may again prove useful in some situations. Expected Search LengthThe ordered list of relevance assessments described in Section 4.3.5 also recommends another, holistic evaluation of the entire retrievals behavior; this method is known as expected search length (ESL). ESL considers the length of a path as users walk down the ordered hitlist, measuring how many irrelevant documents were seen on this path before each relevant document; expected refers to the average length of each path ending in a relevant document. Cooper initially proposed this model to measure the work a search engine saves, in comparison to searching the entire collection at random [Cooper, 1968]. Given that a search engine retrieves documents in hitlist order, ESL also requires a criterion by which the users wandering paths are stopped. Van Rijsbergen, p. 160 discusses a number of predicates that might be used to terminate the search: some fixed number of relevant documents, some fraction of all relevant documents, etc. Because the generality of queries can vary considerably, it is desirable to terminate the ESL after some fixed fraction E of relevant documents has been retrieved. For this same reason, it makes sense to normalize ESL with respect to the number of relevant documents we might expect to retrieve if we were retrieving at random. If we use NRet for the number of retrieved documents (i.e., those satisfying the predicate mentioned earlier), we can estimate the expected random search length RandSL as:
Then the expected search length reduction factor
captures the amount a real search method improves over the random case. Operating Characteristic CurvesSwets [Swets, 1963] enumerated a number of abstract desiderata (quoted by van Rijsbergen, p. 155) that we might wish for any assessment measure. According to these, IRs standard Re/Pre plot leaves much to be desired, in particular because this two-dimensional assessment makes direct comparison impossible. Swets therefore recommended an analysis from the perspective of signal detection, based on several key assumptions: In this idealized case, we get a picture similar to Figure 4.17. Then, because our corpus has been ordered by the ranking, the goal becomes to select a value Rank Using a simple retrieval rule that retrieves a document if its value is above the threshold Rank The parametric curve defined by the percentage of Rel versus While Swets (and subsequently others [Robertson, 1969; Bookstein and Kraft, 1977]) then considered fairly elaborate tests to discriminate the relative performance of retrieval systems with respect to such curves, it is fair to say that the 1979 assessment of van Rijsbergen, p. 154 still stands:
Optimal selection of Rank 4.4 RAVE: A Relevance Assessment VEhicleSection 4.3.2 argued that the opinions of many users concerning the relevance of a document to a query provides a more robust characterization than any single expert. It seems, however, that our move from omniscient to consensual relevance has only made the problem of evaluation that much more difficult: Test corpora must be large enough to provide robust tests for retrieval methods, and multiple queries are necessary to evaluate the overall performance of an IR system. Getting even a single persons opinion about the relevance of a document to a particular query is hard, and we are now interested in getting many! This section describes RAVE, a relevance assessment vehicle that demonstrates that it is possible to operationally define relevance in the manner we suggest. RAVE is a suite of software routines that allow an IR experimenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus. It has been used with a number of different classes of students to collect the relevance assessments used for evaluation with respect to the AIT corpus; your teacher may have you participate in a similar experiment. It can also be used to collect assessments for other document corpora and query sets. 4.4.1 RAVeUnionIt would be most useful if, for every query, the relevance of every document could be assessed. However, the collection of this many assessments, for a corpus large enough to provide a real retrieval test, quickly becomes much too expensive. But if the evaluation goal is relaxed to being the relative comparison of one retrieval system to one or more alternative systems, assessments can be constrained to only those documents retrieved by one of the systems. We therefore follow the pooling procedure used by many other evaluators, viz., using the proposed retrieval methods themselves as procedures for identifying which documents are worth assessing. The first step in constructing a RAVE experiment is to combine the ranked retrieval lists of the two or more retrieval methods, creating a single list of documents ordered according to how interested we are in having them assessed by a subject. RAVeUnion produces the most straightforward zipper merge of the lists, beginning with the most highly ranked and alternating one.More elaborate ways of merging ranked lists The output of RAVeUnion is a file of (query, document) pairs along with a field that indicates if the pair was uniquely suggested by only one of the methods. This last information can be used to compare the average relevance scores of documents suggested by one method alone to those retrieved by more than one. 4.4.2 RAVePlanA second challenge is to find the desired density or redundancy of sample points. That is, for each document that we believe may be relevant to a query, how many subjects should evaluate it? The answer will vary depending on such factors as the number of participants, their expertise, their motivation to produce quality judgments, how long each will spend rating documents, etc. A higher density means that fewer documents will be evaluated, but also that the intersubject cumulative assessment is likely to be more statistically stable. This can be especially important when RelFbk is to be used to train the system. The trade-off between the most important of these factors is captured in the following formula:
where x = number of documents to be evaluated for each query Note that this formula ignores the overlap between queries that occurs when users see a document that may be relevant to two or more of the queries in their list. Care must be taken, therefore, to minimize expected overlap between the topical areas of the queries. We have also found that the assessment densities constructed using this formula are unfortunately uneven. The main source of these is variability in R, the rate at which subjects are able to produce relevance assessments. RAVePlan takes as input a list of Q query specifications, a list of N subject logins, the desired density S, and the number of documents R*T that should be allocated to each subject. The query specifications indicate which queries can go in which fields and which queries should not be shown together. This allows us to limit possible interactions between queries about similar topics.A better RAVePlan RAVePlan outputs two files. The plan file, which is an input to the RAVE interactive application, lists each subject ID along with the queries and document numbers that have been selected for that subject. The assignments file is a list of document-query pairs, which tells us which query we expect the document to be relevant to. RAVeCompile uses this file after data collection is complete to generate true and false-positive measures. 4.4.3 Interactive RAVEThe interactive portion of RAVE is an HTML document, as shown in Figure 4.19. The top of RAVEs window displays the three queries against which the subject is to judge each document. Two queries are short sentences or phrases, in this case REASONING ABOUT UNCERTAINTY and LEGAL APPLICATIONS, and the third is a scrolling pane containing the text of the long, RelFbk-defined document (in this example, the thesis is A PLANNING MODEL WITH PROBLEM ANALYSIS...). While the subject is asked to judge the documents shown to him or her as being about the two short queries, the task associated with the RelFbk-document query is to find documents like this. Beneath each query the RAVE window contains four radio buttons labeled Not (relevant), Possibly (relevant), Relevant, and Critically(relevant). Because we asked our subjects to spend two hours each but could not assume their participation would be continuous, there is a Quit button that allows the subject to suspend the session; when the subject launches RAVE again, the session is resumed where he or she left off. Finally, the Next button is pressed after the subject has read and recorded his or her relevance assessments for a document. 4.4.4 RAVeCompileRAVeCompile processes the votes recorded by interactive RAVE, creating a relevance assessment file and some statistics about each query and subject. The first step is to map the four-valued relevance assessments into the Boolean relevant/nonrelevant discriminations typically used by standard IR evaluation techniques. RAVeCompile lets the experimenter configure a simple predicate of the following form:
where
In one set of experiments [Belew and Hatton, 1996], the data used two predicates constructed in this fashion. These are:
4.5 SummaryIn this chapter we began by making some assumptions about users of a search engine in order to figure out how well the system is satisfying users information needs. We focused on two separate notions of assessment: first, assessing the relevance of documents retrieved by the system in response to a particular query, and second, assessing the search engines overall utility through aggregating relevance judgments provided by many users performing many queries. Section 4.1 discussed both metric and nonmetric relevance feedback and the difficulties in getting users to provide relevance judgments for documents in the retrieved set. We saw, however, that relevance feedback could be used to either suggest query refinements to the users or to modify the underlying document representations to improve future system performance. The concept of consensual relevance introduced in Section 4.2 addresses an issue raised in Chapter 1, where we asked what success criteria can be used in evaluating a search engine. Consensual relevance tells us that relevant documents are those documents that many users find to be useful. We can ask how useful a particular search engine is, or compare one search engine with another, by posing the question: How useful (relevant) do users find the documents retrieved in response to queries? To answer that question we quantified several measures of system performance. The generality of a query is a measure of what fraction of documents in the corpus is relevant to the query. Fallout measures the fraction of irrelevant documents found in the retrieved set of a given query. The key notions of recall, the fraction of relevant documents in the retrieved set, and precision, the fraction of retrieved documents that are relevant, allow us to make direct comparisons between two search engines performances on any query. Other methods of comparison include sliding ratio, point alienation, expected search length, and operating characteristic curves. | |||||||||||||||||||||||||||||||||||||||||||||
![]() | |||||||||||||||||||||||||||||||||||||||||||||