7 | ![]() | |||||||||||||||||||||||||||||||||||||||||||||||
Adaptive Information Retrieval7.1 BackgroundMost of the techniques described in the last chapter built on representational and inference methods originally developed within AI in the 1970s and 80s. Today, these methods are sometimes called Good Old-Fashioned AI (GOFAI), to distinguish them from more recent advances. There are many ways to characterize this change (see Russell and Norvigs text for an alternative account [Russell and Norvig, 1995, p. 827] and cf. Section 6.9 and Section 7.8), but the most important is: AI is now centrally concerned with learning the representations it uses rather than assuming that some smart knowledge engineer has entered it manually. To be concrete, imagine that you are to act as a librarian with respect to your own email. We have assumed at several points that you are collecting vast amounts of email, but perhaps are only now starting to think how it should be classified for subsequent retrieval. If we hire a librarian, we can reasonably expect him or her to bring certain useful skills to this new job and to continue to learn ways of doing it better. As their boss we must provide regular feedback that points out both good and bad aspects of their work. If this person was having their first annual review and they were no better at finding useful information than the day they were hired, we would have reason for concern. The preceding chapters have surveyed a number of techniques for supporting the FOA task, but their utility is immediately apparent and we do not expect it to improve. This chapter is concerned with adaptive techniques: those that improve their performance over time, in response to feedback they receive on prior performance. We can idealize our goal for the learning system in terms of a person a clever, resourceful, adaptive librarian. Figure 7.1 gives an overview of how machine learning fits into the space of existing IR techniques. The horizontal axis is meant to indicate the amount of manual effort expended to improve the corpus. These activities may include constructing a controlled vocabulary, forming good lexical index terms, including phrases, building thesauri relating the keywords to one another, etc. The vertical axis attempts to capture something like ease of use for FOA. Such ease-of-use metrics are notoriously difficult to quantify, but some indicators may include search time to find a known item. Prior to the widespread application of search engine technologies, brought on by efforts like B. Kahles Wide Area Information System (WAIS) and G. Saltons SMART system, to search text meant to grep across textual fields. Because grep and related search methods rely on regular expressions for queries, and because regular expressions cant be conveniently composed with Boolean operators, early search systems provided only these search techniques. But with the introduction of search engine technologies, the goal became to build an index, much like a librarian might construct for a collection of books or documents. These indices have been at the core of our FOA discussion. Figure 7.1 extends this progression further. While it is rare to have any textual corpus receive manual attention from a librarian or editor, and so there are very few manual indices, a very few corpora have received even more extensive editorial enhancement. The Encyclop This becomes the goal for our machine learning techniques. They will turn out to form a natural extension of the statistical techniques underlying automatic index construction. Peter Turney maintains a useful bibliography of Machine Learning Applied to Information Retrieval references generally, and of Text Classification Resources in particular. Finally, it is always a mistake to view the relationship between algorithmic (artificially intelligent) methods and the natural, human intelligent behaviors they mimic as an opposition. The most constructive systems we can build are ones that leverage editorial capabilities with new computational tools. The editors workbench is a good metaphor for such designs. 7.1.1 Training against Manual IndicesWe will be especially concerned with corpora that have benefited from extensive manual indexing. For example, the articles in the Encyclop First, the manual classification of documents to categories can be used as training data in the context of supervised learning (see Section 7.4). Second, manually constructed representations provide a kind of upper bound on what we can hope our automatic learning techniques can build. Ultimately, however, we can expect that the most successful applications will not oppose manual, editorial enhancement with automatic induction but rather will integrate learning into the editorial process. Machine learning can already do much of the job that has traditionally been done by human editors; and yet, many aspects of the editorial function will remain beyond our learning techniques for the foreseeable future. Harnessing machine learning as part of an editors workbench promises to leverage this scarce resource most effectively. Of course, corpora that have benefited from such careful manual attention are few and far between. Much more typical is the textual corpus without any manual indexing whatsoever. The third advantage, then, of those special corpora that do have attending editorial enhancements is that if our learning techniques can generate analogous structures on these special collections, we can realistically expect the same techniques to generate useful structure on other collections as well. 7.1.2 Alternative Tasks for LearningOne obvious task we might ask of a learning algorithm is that it learn the Rank( ) function. Following the Probability Ranking Principle (Section 5.5.1), we might attempt to learn the probability that a document is relevant, given a set of features it contains. Recall from Section 4.2.1 how RelFbk information can be used to modify a query vector so as to move it closer to those documents a user has marked as relevant. As discussed in Section 4.2.2., The same technique can be used to move documents toward the query! The radically different consequence of this change is that while query modification is only useful once to the user benefiting from it, document modification changes the documents representation for all subsequent users queries. We will also be concerned with another task. Rather than assigning a single real number to each document (Pr(Relevance), for example), we shall attempt (using the same set of document features) to classify a document into one of a small number of potential categories. Most simply, we may be interested in binary classification of documents into one of two categories. An obvious use of binary classification would be to classify into Relevant and Nonrelevant classes. More complex classifications into one of C different classes are also possible. For example, imagine that you would like to have your email client automatically sort your incoming email into various mail folders you have used historically. Having decided how many classes to use, we must also determine whether our assignment to these classes is binary or if we should provide the probability that it belongs to a specific class. Classification techniques are discussed in Section 7.4. 7.1.3 Sources of FeedbackTwo distinct classes of machine learning techniques can be applied to the FOA problem. These can be distinguished on the basis of the type of training feedback given to the learning system. The most powerful and best-understood are supervised learning methods, where every training instance given to the learning system comes with an explicit label as to what the learning system should do. Using an email example, suppose we want talk announcements to consistently go into one folder, mail from our family to go in another, and spam to be deleted. In terms of supervised learning, this regime requires that we first provide a training set (cf. Section 7.4). In our case, the training set would consist of email messages and the C mail categories we have used to classify them in the past. After training on this data set, we hope that our classifier generalizes to new, previously unseen messages and classifies them correctly as well. A second class of machine learning techniques makes weaker assumptions concerning the availability of training feedback. Reinforcement learning assumes only that a positive/negative signal tells the learning system when it is doing a good/bad job. In the FOA process, for example, RelFbk generates a reinforcement signal (saying whether it was a good or bad thing that a document was retrieved). Note that RelFbk does not count as supervised learning; in general, we do not know all of the documents that should have been retrieved with respect to a particular query. Supervised training provides more information in the sense that every aspect of the learners action (retrieval) can be contrasted with corresponding features of the correct action. Reinforcement information, on the other hand, aggregates all of these features into a single measure of performance. The difference between these two kinds of learning is especially stark in the FOA context. To provide reinforcement information, users need only react to each document and say whether they are happy or sad it was retrieved. In order to do supervised training, the user would need to identify the perfect retrieval, requiring the user to evaluate each and every document in the corpus! The distinction between the supervised retrieval and that shaped by RelFbk highlights the need to be explicit about which kinds of feedback are hard for the user and which are easier. The discussion of RAVE made some of our assumptions concerning cognitive overhead clear (Section 4.4), but this is another important area for further study. Here we will continue to assume that RelFbk is easy to acquire. Learning from the Entire SessionNote that each iteration of the FOA conversation is but a link in the larger dialog leading from the users initial information need to the end of their search; cf. Figure 7.2. First, the continuity of the same search pursuing the same information need from iteration of the FOA dialog to the next iteration helps to constrain interpretations of the users RelFbk; we know which documents they have seen previously, we may know something about how a document retrieved in iteration i + 1 is related to one from iteration i and why it was retrieved, and so on. But perhaps the most important reason to model multiple queries as part of a single session is that it is the total, aggregate difference between their cognitive states at the beginning of the session and at the end of the session that should most concern us. Users begin the search with their best characterization of what it is they want to FOA, and they leave for one of a number of potential reasons:
Our interpretation of what the system could or should learn from this termination will be radically different, and so determining which of these reasons pertains becomes particularly important to establish. Distributions over Learning InstancesAnyone familiar with basic statistics will appreciate how sensitive our estimate of an average is on the set of examples we happen to select for our sample. Learning methods are similarly sensitive to the amount and distribution of training data. For example, the ratio of the number of features to the number of instances can affect learning performance significantly [Moulinier et al., 1996; Dagan et al., 1997]. Define a positive training instance It is less obvious that the ratio of positive to negative training instances is also important. In binary classification tasks we can expect an even number of positive and negative training instances. But when classifying into a larger number of categories, it is common to treat each training instance as a positive instance of one class and a negative instance of all the others. Continuing our email classification example, we can expect that every classified message in our training set corresponds to a positive instance of one classification and simultaneously as a negative example of all the others. This makes efficient use of the training data, but also generates a 7.2 Building Hypotheses about DocumentsWe will talk about competing hypotheses, for example, rules that successfully separate our spam email from our familys email. If only very simple hypotheses are to be considered, a relatively small amount of data can be used to select between them. For example, if our hypothesis is that spam email always contains the phrase $$$$ BIG MONEY $$$$$, a small amount of training data is sufficient to confirm or disconfirm this rule [Sahami et al., 1998]. But if we wish to consider elaborate discrimination rules, for example, including many keywords and/or date information, it takes much more data to tease apart all the alternatives. The volume of training data available, then, provides a very real constraint on how complex the hypotheses we can consider can be and how statistically reliable we expect rules to perform on unseen test data. 7.2.1 Feature SelectionWhen you are thinking about how you classifiy your email, keywords contained in your email are almost certainly some of the features you think of first. Recall, however, that the keyword vocabulary can be very large. Using this feature space, then, individual document representations will be very sparse. In terms of the vector space model of Section 3.4, many of the vector elements will be zero. To use Littlestones lovely expression, Irrelevant attributes abound [Littlestone, 1988], and so it should come as no surprise that his learning techniques are especially appropriate in the FOA learning applications discussed in Section 7.5.3. Efforts to control the keyword vocabulary and make the lexical features as meaningful as possible are therefore important preconditions for good classification performance. For example, name-tagging techniques (cf. Section 6.6.1) that reliably identify proper names can provide valuable classification features. A proper name tagger would be one that was especially sophisticated about capitalization, name order, and abbreviation conventions. When both peoples proper names and institutional names (government agencies, universities, corporations, etc.) are capitalized, the recognition of complex, multitoken phrases becomes possible. In part because of the difficult issues lexical, keyword-based representations entail, it is worth thinking briefly about some of the alternatives. There are also less-obvious features we might use to classify documents. Meta-data associated with the document, for example, information about its date and place of publication, is one possibility. Geographic place information associated with a document can also be useful; cf. Section 6.6.1. Finally, recall the bibliographic citations that many documents contain (cf. Section 6.1). The set of references one document makes to others (representable as links in a graph) can be used as the basis of classification in much the same way as its keywords. In summary, while keywords provide the most obvious set of features on which classifications can be based, these result in very large and sparse learning problems. Other features are also available and may be more useful. It is important to note, however, that careful Bayesian reasoning about dependencies among keyword features is a very difficult problem, as discussed in Section 5.5.7. Attempting to extend this inference to include other heterogeneous types of features must be done carefully. Distribution-Based SelectionBecause our typical assumption has been that keywords occur independently, it should come as no surprise that when we try to reduce from the full set of all keywords in the Vocab to a smaller set, a good way to decide which features are most useful is to pick those that are most independent of any others. That is, we can hope that two keywords that are statistically dependent can be merged into a single one. As mentioned in Section 3.3.6, entropy captures the amount of randomness in (or uncertainty about) some random variable:
When the distribution of a random variable X is conditionally dependent on that of a second random variable Y, the conditional entropy of X on Y (a.k.a. post-Y entropy of X) can be similarly defined [Papoulis, 1991, pp. 54954]:
If knowledge of values of Y reduces our uncertainty about the distribution of X, it is natural to think that Y informs us about X. Mutual information I captures this notion:
This information is mutual in the sense that it is a symmetric relation between the two variables; Y tells us as much about X as X does about Y: I (X, Y) = I (Y,X). If the mutual information I(ki, kj) between all pairs of keywords in K is known, van Rijsbergen, p. 123 recommends selecting the maximum spanning tree, which maximizes the total information across all edges of the tree:
Note, however, that the mutual information statistic has an intrinsic bias toward keyword pairs ki, kj in which the individual keyword frequencies fi and fj are intermediate. The post-Y entropy of X can only reduce it [Papoulis, 1991]:
In terms of keyword frequencies, then, the mutual information of a pair of keywords is limited by their marginal entropies. This implies that very rare and very common keywords are penalized with respect to the mutual information measure. For this reason, it is worthwhile considering the relation between mutual information as a measure of keyword interdependencies and the eigenstructure analysis (cf. Section 5.2.3) of the keyword cross-correlation matrix J. Mutual information considers the full joint probability between the two keywords, while methods like singular value decomposition (SVD) and principal component analysis (PCA) consider only the cross-correlation matrix. When random variables happen to fit a normal distribution, these correlation statistics are sufficient, but that is unlikely in the case of our keywords. A second important difference is that correlation-based methods construct new feature variables, out of linear combinations of the initial keyword tokens. Mutual informationbased methods (or at least van Rijsbergen, p. 123s MST-based construction) select the best variables from a constant set. Selection Based on Fit to a Classification TaskRather than using distributional statistics among the keywords themselves as the basis for feature selection, it is also possible to look for those features that are most fit with respect to some classification task [Lewis and Hayes, 1994] (cf. Section 7.4). Both mutual information and correlation statistics can be used in either supervised or unsupervised learning situations, considering either the mutual information I(k,c) of each keyword k with respect to the class c in the former case or Fishers linear discriminant [Duda and Hart, 1973, p. 114] in the latter. Using the classification performance criterion, Yang and Pederson considered a wide range of potential measures and found that simply measuring fk, the document frequency of keyword k, provided an effective measure over potential keyword features [Yang and Pedersen, 1997]. In fact, using document frequency as a criterion, they were able to remove 98 percent of all keywords while retaining (even improving slightly!?) classification performance. Given the efficient way in which fk can be collected (relative to MI, 7.2.2 Hypothesis SpacesHowever they are selected, the features discussed in the previous section must now be composed into hypotheses concerning how we might describe documents. There is an extraordinary range of alternatives represented here. Some of these are shown in Figure 7.3. Decision trees are formed by asking a question about individual features and using the answers to these questions to navigate through a series of tests until documents are ultimately classified at the leaves. Weighted, linear combinations of the features can also be formed. Neural networks are best viewed as nonlinear compositions of weighted features [Crestani, 1993; Crestani, 1994; Gallant, 1991; Kwok, 1995; Wong et al., 1993]. Boolean formulas can be formed from sentences using simple conjunctive or disjunctive combinations. Our focus here will be on Bayesian networks, which attempt to represent probabilistic relationships among the features. In any of these cases, machine learning techniques must be sensitive to their inductive bias. That is, given a fixed amount of data, we must have some a priori preference for some kinds of hypotheses over others. For example, decision tree learning algorithms [Quinlan, 1993] prefer small trees and neural networks prefer smooth mappings [Mitchell, 1997]. A common feature of all of these learning algorithms is a general preference for parsimony, or simplicity. This preference is typically attributed first to William of Occam (ca. 1330). Occams Razor has been used since then to cleave simpler hypotheses from more complex ones. Another motivation for the parsimony bias has been realized more recently within machine learning: Simple hypotheses are more likely to accurately go beyond the data used to train them to classify other unseen data. That is, very complicated hypotheses have a tendency to over-fit to the training data given a learning algorithm, but the good fit they can accomplish on this set is not matched when the same classification is done on new data. The issues involved in evaluating a classifiers performance are an important topic within machine learning [Mitchell, 1997]. 7.3 Learning Which Documents to RouteArguably, the first use of RelFbk information in an adaptive IR context came out of the SMART group led by Gerald Salton. As was discussed in Section 4.2.2, Saltons vector space model lends itself to a representation of documents and queries that suggest ways of producing better matches, as shown in Figure 7.4. Rocchios characterization of the task is best described as routing: We imagine that a stream of all the worlds news is available to some institution (for example, a corporation), and many members of this institution (employees) are expected to build long-standing queries characterizing their interests. These queries are passed against the incoming stream of documents, and those matching a particular employees query are routed to him or her [Schutze et al., 1995a]. From this employees point of view, he or she will be receiving a stream of documents, all of which are of interest to him or her. If he or she identifies those documents that are not relevant, the resulting corpus can be viewed as a training set and used to adjust the filter. This is a concrete application of the binary classification technology. Brauen, in particular, considered document vector modifications, resulting in dynamic document spaces [Brauen, 1969]. Documents marked as relevant can be moved closer to the query that successfully retrieved them in several ways. First and most obviously, features shared by query and document can have their weight increased. Features in the document but not in the query can have their weight decreased, and terms in the query but not in the document can be added to the documents representation with small initial weights. From the perspective of modern machine learning, Brauens method would be considered a type of gradient descent learning algorithm: The disparity between document and query creates a gradient along which small changes are made. One difference, however, is that Brauen imagined a batch learning scenario, with the entire set of labeled documents available simultaneously. Online learning algorithms, on the other hand, make small, incremental adaptive changes in immediate response to every learning instance (document). Each individual step of learning (e.g., weight update in neural networks) must be very small, in order to guarantee convergence toward the globally optimal value. The size of the small change is controlled by Idealizing our learning task to produce a perfect match (i.e., the dot product of query and document is 1.0) on relevant documents and no match on irrelevant documents, we can treat this behavior as the target for our error correction learning. Let Rd stand for this Boolean relevant/not relevant classification of each document. Then (as discussed further in Section 7.4), we can hope to have available a training set T of documents that have been labeled a priori as Rd = 0 or Rd = 1, specifying whether they are considered relevant. With such evidence as to what documents we do and don't consider relevant, we can define precisely how well we have learned. A typical error measure or loss function can be defined as the squared difference between the actual vector match and the target RdThe q/d gradient runs in both directions:
7.3.1 Widrow-HoffThe Widrow-Hoff (a.k.a. least mean squared (LMS)) algorithm is the best-understood and principled approach to training a linear system to minimize this squared error loss [Widrow and Hoff, 1960]. It does this by making a small move (scaled by the parameter
It is also important to remember that changes made to a single document in response to a single query can make no guarantees about improved performance with respect to other documents and other queries. For example, two documents might both be moved closer to a query (as proposed by Brauen and Rocchio) but their relative rankings may not change at all! 7.3.2 User Drift and Event TrackingOne interesting feature of the training set generated by the routing task is the odd distribution of positive and negative examples it generates. Initially, we can imagine that this filter is very inaccurate; i.e., we are likely to see many negative examples. Later, when we hope it is better trained, the filter has nearly perfect performance and the system gets very few negative examples. Further, no users interests remain static. As discussed in the next chapter, one common purpose for the FOA activity is to become educated (cf. Section 8.3.4), and this is an elusive, ever-changing goal. The world changes, and what users read changes their opinions of what needs to be done and what the new questions are. In brief, documents they used to find relevant become irrelevant. This has been called concept drift [Klinkenberg and Renz, 1998]. When the world changes, this corresponds to documents, and the news they contain changes too. This side of the dynamic is called topic tracking [Allan et al., 1998; Baker et al., 1999]. Jaime Carbonells (of CMU) approach is to first identify that a concept change has occurred and then to adjust a time window on the stream of incoming training data over which a new invariant is then identified (personal communication). The distribution of RelFbk generated by the filtering task, where a standing query is allowed to adapt to a stream of RelFbk generated by users who receive and evaluate routed documents (cf. Section 4.3.9), provides an especially interesting form of learning task, because of its temporal dimension. Initially, the set of documents routed to users must depend on the same fundamental matching function shared by other search engine tasks. But as RelFbk, in response to the first retrievals, comes to affect the users characterizations of interest, only a skewed sample (relative to the initial distribution) of potential documents is shown to the users, and only these can be the basis of subsequent RelFbk. This tension between exploration of the universe of potentially relevant documents and exploitation of those documents that prior RelFbk makes it seem are most likely to be perceived as relevant by the users is familiar to other reinforcement learning situations [Sutton and Barto, 1998]. 7.4 ClassificationThe classification task is...classic:), and a wide range of technologies for accomplishing it have been developed [Duda and Hart, 1973; Carlin and Louis, 1996]. Even in the relatively recent context of text-based domains, many techniques have been applied [Lewis and Hayes, 1994]. Here we will focus primarily on extending the probabilistic approach of Section 5.5, but well consider other alternatives in Section 7.5. Andrew McCallum and colleagues have developed a software suite called RAINBOW that is very useful for experiments into text classification. We begin by assuming that we have been given a set of classes C = {c1, c2, ... cC} and have been asked to produce a function that classifies a new document into one of these classes. McCallum gives a good overview of the Bayesian approach applied to text:
The major features of the training regime are shown in Figure 7.5. During the first phase, a correspondence has been established manually between each example document di and some classification ci in the training set T:
(In Figure 7.5, the classifications are imagined to be part of a hierarchical classification system, as will be discussed in detail in Section 7.5.5.) These data are used somehow (for now its okay to think of it as magic) to tune the set of parameters We seek the posterior probability of a particular class, given the evidence provided by a new document, Pr(c | d). The second step is then, given these probabilities for all classes, to return the one that maximizes this likelihood. Bayes rule is typically invoked in such situations to invert this conditional dependency:
The likelihood Pr(d | c) can be more easily estimated if we assume classes are represented as mixtures of the documents features. We can think of hypothetical documents (that we imagine belong to the class) as generated by some model, the parameters
These measures allow us to precisely state how well a learned model captures regularities found in the training set: The model is doing a good job if it applies the same classifications as observed in the training data. But this criterion also highlights the dependence of any learning method on the training data T used to construct it, which cannot be overemphasized. Independent of the range of hypotheses considered, and of the learning methods used to build the best possible model, our ability to inductively generalize from the training data to new examples (e.g., unclassified documents) depends entirely on how typical the training data are. Within computational learning theory this dependence is captured by the assumption that training data and subsequent trials are drawn from the same distribution [Valiant, 1984; Kearns and Vazirani, 1994]. In practice, performance of a classifier on the training set is bound to be an optimistic overestimate of how well it can generalize to previously unseen data. The most common way to guard against such overestimates is to artificially hold out some of the training set as a separate test set. Training proceeds as before on all the training data but not on the test set, and then the system is evaluated on the test set, which provides a more reasonable estimate as to how the system will fare. More sophisticated cross-validation procedures begin by partitioning the available training data into k subsets. Iteratively, each partition is used as the holdout set while the remaining 7.4.1 Modeling DocumentsThe general framework of empirical Bayesian estimation is broad and powerful enough that it has been applied in many contexts. The hard work comes, however, in specifying just how the parametric model One critical, simplifying assumption shared by both models is that the features occur independently in the documents. As we have discussed a number of times, any such naïve Bayesian model will miss a great many of the interactions arising among real words in real documents. It is somewhat curious, then, that such naïve classifiers do as well as they do [Domingos and Pazzani, 1997]. Multivariate BernoulliArguably the simplest model captures only the presence or absence of words in the document. That is, the document is modeled as the composition of k keywords drawn from the Vocab as so many independent Bernoulli trials. We imagine that a document d is constructed by repeatedly selecting |d| words for each position in the document.Ordered sequence If we associate a biased coin with each keyword k, we can decompose the desired model
i.e, the prior probability of each class c and the probability that a keyword is present given that we know a document containing it is in class c. Then the naïive Bayesian assumption allows us to assume that the keywords occur at each positional location independently of one another:
MultinomialAn alternative model of document generation is as repeated draws from an urn of words containing all the keywords in the Vocab. This gives rise to a multinomial model that is sensitive to the frequency fdk of keywords occurrence in d, rather than just their presence or absence:*
7.4.2 Training a ClassifierThe parameters
we seek the parameters One piece of this is easy to estimate: The prior probability of the class Pr(c) is how frequently one classification is observed in T relative to the others. Using a twiddle hat to distinguish estimates
Estimating Within the Bayesian framework, Probabilists religious wars these statistical sensitivities are addressed by providing priors for the underlying word-events of document generation. 7.4.3 PriorsWhen we are attempting to estimate Recall that the multivariate Bernoulli model associates a single biased coin
Note that in this statistic the nonoccurrence of keywords not present in the document also affects our estimate. This somewhat odd Mark of Zero [Lewis et al., 1996] should make us feel less sanguine about what is captured by any Bernoulli model. The multinomial alternatives (assuming the same priors as before) for these estimates are:
Empirically, the multinomial model seems to support better classification performance, especially when larger vocabulary sizes are considered [McCallum and Nigam, 1998]. 7.5 Other Approaches to Classification7.5.1 Nearest-Neighbor MatchingOne of the most straightforward ways to classify documents is to make a rote memory of the training set T and retrieve those documents from T that are most similar to a new document to be classified [Cost and Salzberg, 1993; Larkey and Croft, 1996; Yang and Pedersen, 1997; Larkey, 1998a]. This corresponds to using the |d · d| similarity metric discussed in Section 5.2. A weighted sum of the k most similar documents classifications can be used to pick the best match. 7.5.2 Boolean PredicatesA different approach to the text classification task has been proposed by Cohen in his RIPPER system [Cohen, 1996a; Cohen, 1996b]. The space of hypothesis considered by RIPPER is a set of Boolean rules composed over a space of keywords. A simple example is shown in Figure 7.6, which extends the obvious definition of IRELAND to include documents that mention violent IRA activities. Like decision lists, RIPPERs rule sets are easier for human experts to interpret than a large system of Bayesian probabilities. RIPPER is an example of a covering learning algorithm; cf. Figure 7.7. This means that it iteratively forms conjunctions of Boolean predicates that cover some of the positive instances of a Boolean classification while excluding all the negative instances. In the next iteration, the positive instances, which were covered previously, are removed from the training set, and a new conjunctive clause is formed, which again covers some more positive instances while excluding all negative ones. Ultimately, then, the rule set will be in disjunctive normal form (DNF). RIPPER also includes optimizations to simplify rules by removing conditions that do not affect performance and by picking conditions that provide the most information gain [Quinlan, 1993]. Finally, Cohen adapted these rule learning techniques to the text domain by adding set-valued attributes. These special attributes collapse a documents representation to be simply the set of words it contains. RIPPERs rules can then include tests for sets of words, rather than having to test the presence or absence of each word individually. 7.5.3 When Irrelevant Attributes AboundIn documents where irrelevant attributes abound [Littlestone, 1988] (e.g., when any one document contains a small fraction of the full vocabulary but still more than are important in a classifier), the rapid winnowing of features is critical. One approach is to use boosting methods, which begin with many very weak hypotheses but focus in on the most successful [Schapire et al., 1998]. Kivinen and Warmuths exponentiated gradient EGs algorithm extends from Winnows binary features and output to real-valued quantities [Warmuth, 1997]. The result shares much with the Widrow-Hoff model (cf. Section 7.3.1), but rather than having weight changes making an additive change in the direction of the gradient, EG recommends making a multiplicative change to each element of the document vector. Using Rd as in Equation 7.6:
where d Another very recent approach is to apply Vapniks support vector machines (SVM) [Vapnik, 1995]. Rather than searching for dichotomizing planes within a representational space that has been predefined (e.g., the hyperplanes that gradient descent methods adjust), SVMs search in the dual space defined by the set of training instances for kernels (representations) wherein the classes can be conveniently separated! As Joachims [Joachims, 1998] recently emphasized, the way in which these techniques avoid searching the vast space of potential keyword features seems to make SVM a very appropriate technology for this application. 7.5.4 Combining ClassifiersThe preceding chapters have described a wide range of potentially useful retrieval techniques. A very reasonable response is to ask if perhaps the best possible retrieval system isn't some sort of mixture of these various techniques. For example, Bartell [Bartell et al., 1994a] has considered simple linear combinations of experts like those shown in Figure 7.8. His experiments considered two experts, one that used a set of simple words as features and a second that did more elaborate phrase extraction. Holding the sum of the two experts contributions constant, their relative contribution describes a circle of fixed radius. Figure 7.9 shows a set of 228 test queries and the optimal weighting of phrase and term experts for each. That is, for a particular query, the optimal contribution of ranking information from the phrase and term expert is determined. Also shown is a line corresponding to the optimal balance between phrase and term expert across all queries. In general, the phrase expert was not very useful. On some queries, however, it was able to improve performance significantly. The way in which individual queries make special demands of the retrieval system is perhaps the most striking feature of these results. As search engine technologies have developed, the composition of hybrid systems involving multiple systems has required a more black box composition. That is, rather than manipulating a single feature of the retrieval system (e.g., term versus phrase features), the combination has been of a systems net ranking [Thompson, 1990a; Thompson, 1990b]. Collection fusion refers to the problem of combining results coming from disjoint corpora [Towell et al., 1995; Yager and Rybalov, 1998]. (In the emerging environment of combined corpora and primarily publisher-driven search engines, corpora have become confounded with the search engines allowed to search them.) Diamond (personal communication) has hypothesized several effects we might imagine from fusing multiple search engines:
Note how the first of these focuses on differences in the systems-based treatment of documents and the second on queries (cf. Section 3.3.3).Dark Horse effect, too In some ways, this combination of classifiers is reminiscent of earlier work using multiple query representations [Belkin et al., 1993]. Vogt has recently performed an exhaustive analysis of linear combinations for all 61 search engines submitted to the TREC5 competition [Vogt and Cottrell, 1998]. His primary conclusion, following Lee [Lee, 1997], is that two systems are best combined linearly when there is a great deal of overlap in the set of relevant documents they identify, while their retrieval of nonrelevant documents is nearly disjoint. In terms of Diamonds qualitative expectations, linear combinations are best able to support the chorus effect. Schutze et al. and Larkey have considered combining various types of special-function classifiers [Schutze et al., 1995b; Larkey and Croft, 1996]. 7.5.5 Hierarchic ClassificationIn many areas of careful scholarship, classification labels are not merely members of a big set, but rather they are organized hierarchically into systems of BT/NT hypernymy (cf. Section 6.3). For example, the U.S. Patent Office has 400 top-level classifications with 135,000 subclasses [Larkey, 1998b]. These classes are part of a hierarchic tree going down 15 levels. A simple example suggested by Mitchell and others use of the UseNet newsgroup hierarchy [Mitchell, 1997, McCallum et al., 1998] is shown in Figure 7.10. Here the probablility of the keyword BELIEFS is conditionalized with respect to a series of increasingly specific newsgroups. Let ch be a hierarchic classification, meaning that it is part of a taxonomy rooted at c0 and connected via a path of ancestor classifications
This notation is meant to capture the relationship shown in Figure 7.11. McCallum et al. creatively applied the statistical technique known as shrinkage to the problem of text classification [McCallum et al., 1998]. Parameter estimates of child classes, which will have very few data instances, can be shrunk toward the data-rich ancestors, and the contributions of each ancestors classification are then linearly combined:
7.6 Information-Seeking Agents7.6.1 Exploiting Linkage for Context*All samples of language, including the documents indexed by Web search engines, depend heavily on shared context for comprehension. A documents author makes assumptions, often tacit, about the intended audience, and when this document appears in a traditional medium (conference proceedings, academic journal, etc.), it is likely that typical readers will understand it as intended. But one of the many things the Web changes is the huge new audience it brings to documents, many of whom will not share the authors intended context. But because most search engines attempt to index indiscriminately across the entire WWW, the global word frequency statistics they collect can only reflect gross averages. The utility of an index term, as a discriminator of relevant from irrelevant items, can become a muddy average of its application across multiple distinct subcorpora within which these words have more focused meaning [Steier and Belew, 1994a; Steier and Belew, 1994b]. Hypertext information environments like the Web contain additional structure information [Chakrabarti et al., 1998a].Querying for link topology This linkage information is typically exploited by browsing users. But linkage topology the spatial structure imposed over documents by their hypertext links to one another can be used to generate a concrete notion of context within which each document is understood: Two documents and the words they contain are imagined to be in the same context if they are close together in this space. Even in unstructured portions of the Web, authors tend to cluster documents about related topics by letting them point to each other via links. Such linkage topology is useful inasmuch as browsers have a better-than-random expectation that following links can provide them with guidance. If this were not the case, browsing would be a waste of time. This suggests that agents (infobots, spiders, etc.) that navigate over such structural links might be able to discover this context. For example, agents browsing through pages about ROCK CLIMBING and ROCK N ROLL should attribute different weights to the word ROCK, depending on whether the query they are trying to satisfy is about music or sports. Where an agent is situated in an environment (neighborhood of highly interlinked documents) provides it with the local context within which to analyze word meanings a structured, situated approach to polisemy. The words that surround links in a document provide an agent with valuable information to evaluate links and thus guide its path decisions a statistical approach to action selection. The idea of decentralizing the index-building process is not new. Dividing the task into localized indexing, performed by a set of gatherers, and centralized searching, performed by a set of brokers, has been suggested since the early days of the Web by the Harvest project [Bowman et al., 1994]. WebWatcher [Armstrong et al., 1995] and Letizia [Lieberman, 1997] are agents that learn to mimic the user by looking over his or her shoulder while browsing. Then they perform look-ahead searches and make real-time suggestions for pages that might interest the user. Fab [Balabanovic, 1997] and Amalthaea [Moukas and Zacharia, 1997] are multiagent adaptive filtering systems inspired by genetic algorithms, artificial life, and market models. Term weighting and relevance feedback are used to adapt a matching between a set of discovery agents (typically search engine parasites) and a set of user profiles (corresponding to single- or multiple-user interests). Here we focus on InfoSpiders, a multiagent system developed by Fillipo Menczer [Menczer et al., 1995; Menczer, 1997; Menczer and Belew, 1998; Menczer, 1998; Menczer and Belew, 2000]. In InfoSpiders an evolving population of many agents is maintained, with each agent browsing from document to document online, making autonomous decisions about which links to follow and adjusting its strategy. Populationwide dynamics bias the search toward more promising areas and control the total amount of computing resources devoted to the search activity. Basic features of the algorithm are discussed, and then an example of how these agents perform as searchers through a hypertext version of the Encyclop 7.6.2 The InfoSpiders AlgorithmInfoSpiders searches online for information relevant to the user by making autonomous decisions about what links to follow. Algorithm 7.1 is the InfoSpiders implementation of the local selection algorithm. A central part of the system is the use of optional relevance feedback. The user may assess the relevance of (some of) the documents visited by InfoSpiders up to a certain point. Such relevance assessments take place asynchronously with respect to the online search and alter the subsequent behaviors of agents online by changing the energy landscape of the environment. The process is akin to the replenishment of environmental resources; the user interacts with the environment to bias the search process. Let us first overview the algorithm at a high level; representation-dependent details will be given in the following subsections and experimental parameter values in the next section. The user initially provides a list of keywords and a list of starting points, in the form of a bookmark file.* The search is initialized by prefetching the starting documents. Each agent is positioned at one of these documents and given a random behavior (depending on the representation of agents) and an initial reservoir of energy.
Each agent senses its local neighborhood by analyzing the text of the document where it is currently situated. This way, the relevance of all neighboring documents those pointed to by the hyperlinks in the current document is estimated. Based on these link relevance estimates, the agent moves by choosing and following one of the links from the current document. Next, the agents energy is updated. Energy is needed to survive and move, i.e., to continue to visit documents on behalf of the user. Agents are rewarded with energy if the visited documents appear to be relevant. The e() function is used by an agent to evaluate the relevance of documents. If a document has previously been visited and assessed by the user, the users assessment is used; if the document has not been visited before, its relevance must be estimated. This mechanism is implemented via a cache, which also speeds up the process by minimizing duplicate transfers of documents. While in the current, client-based implementation of InfoSpiders this poses no problem; caching is a form of communication and thus is a bottleneck for the performance of distributed agents. In a distributed implementation, we imagine that agents would have local caches. When using the current implementation to simulate the performance of distributed InfoSpiders, we will simply set the cache size to zero. Agents are charged energy costs for the network load incurred by transferring documents. The cost function c() should depend on used resources, for example, transfer latency or document size. For simplicity we will assume a constant cost for accessing any new document and a (possibly smaller) constant cost for accessing the cache; this way stationary behaviors, such as going back and forth between a pair of documents, are discouraged. Just as for graph search, instantaneous changes of energy are used as reward and penalty signals. This way agents adapt during their lifetime by Q-learning. This adaptive process allows an agent to modify its behavior based on prior experience by learning to predict which are the best links to follow. Depending on its current energy level, an agent may be killed or selected for reproduction. In the latter case, offspring are recombined through the use of local crossover, whereby an agent can only recombine with agents residing on the same document, if there are any. Offspring are also mutated, providing the variation necessary for adapting agents by way of evolution. Finally, the user may provide the system with relevance feedback. It is important to stress that this process is entirely optional InfoSpiders can search in a completely unsupervised fashion once it is given a query and a set of starting points. Relevance feedback takes place without direct online interactions between user and agents. The user may assess any visited document D with feedback
The word feedback list is maintained to keep a global profile of which words are relevant to the user. The ultimate output of the algorithm is a flux of links to documents ranked according to some relevance estimate modulo relevance assessments by the user. The algorithm terminates when the population becomes extinct for lack of relevant information resources or when it is terminated by the user. 7.6.3 Adapting to Spatial ContextIn order to test InfoSpiders, agents were allowed to breed in a controlled test environment previously used by Steier [Steier and Belew, 1994a; Steier, 1994]: a portion of the Encyclop Now consider two agents, A and B, born at the same time and attempting to satisfy the same query, but in different places within this hypergraph of EB document pages. As Table 7.1 shows, the original query words were displaced from their top positions and replaced by new terms. For example, PRIVAT and ALLEVI had relatively low weights, while FOUNDAT appeared to have the highest correlation with relevance feedback at this time. As and Bs keyword vectors are shown in Table 7.2. In the course of the evolution leading to A and B through their ancestors, some query terms were lost from both genotypes. A was a third-generation agent; its parent lost ALLEVI through a mutation in favor of HULL. At As birth, PRIVAT was mutated into TH. B was a second-generation agent; at its birth, both ALLEVI and PRIVAT were replaced by HULL and ADDAM, respectively, via mutation and crossover. These keyword vectors demonstrate how environmental features correlated with relevance were internalized into the agents behaviors. The difference between A and B can be attributed to their evolutionary adaptation to spatially local context. A and B were born at documents DA and DB, respectively, whose word frequency distributions are partly shown in Table 7.3. TH represented well the place where A was born, being the second most frequent term there; and ADDAM represented well the place where B was born, being the third most frequent term there. By internalizing these words, the two situated agents are better suited to their respective spatial contexts. 7.7 Other Learning Applications and IssuesThis chapter has focused primarily on applications of classification techniques that are among the most successful with machine learning. There are, however, a wide range of other ways that adaptive mechanisms have been incorporated within the FOA context [Bono, 1972]. For example, Gordon has applied genetic algorithm optimization to the construction of effective queries [Gordon, 1988]. The success of social collaborative filtering, where adaptation in response to the activities of one user is extrapolated to other users [Shardaanand and Miaes, 1995], promises to become an important part of the WWW industry (as recently demonstrated by Microsofts purchase of Firefly, for example). Learning to extract structured, database-like relations from the vast WWW of unstructured documents is another important new direction [Howe and Dreilinger, 1997, Craven et al., 1998]. Here we mention several other ways that adaptation mechanisms can be exploited and some new problems arising from this dynamism. 7.7.1 Adaptive LensesA discussion of the routing or filtering task should make it clear that learning technology can be placed at many levels within the FOA system. A personal classification tool can be very useful to a single individual organizing his or her own email. These categories need not make any sense to or be consistent with those used by anyone else. But when multiple users all browse through shared corpora, it becomes possible for one persons browsing experience to benefit anothers. Of course, interuser consistency in RelFbk will help to determine just how statistically correlated these training signals are. But if users are clustered as part of socially cohesive groups (for example, a research lab full of students and faculty pursuing the same research questions) and they are searching documents of shared interests (for example, reprints they have all collected on topics of mutual interest, as part of a journal club perhaps), it is not unreasonable to believe that their assessments will be very similar indeed. Figure 7.12 shows a single individual with a classification system on his or her own machine. But their searches often go through a second classifier or adaptive lens that they share with other members of a computer science group. The figure also shows two different social groups of users, for example, computer scientists and cognitive scientists, each learning different connotations for the phrase NEURAL NETWORK, perhaps computationally and physiologically skewed, respectively. Finally, these smaller groups are merged into progressively larger groups of users over which RelFbk assessments are pooled. 7.7.2 Adapting to Fluid Language UseThe advantages of my benefiting from your experience training a system are obvious. Just as obvious, though, are ways in which you and I might differ in our notions of relevant documents. Even this comparison is only possible if we each evaluate exactly the same query, and how often will that happen?! If we are to change our representations of documents based on users opinions, should we value expert opinions over those of novices? Common sense would suggest that if we could ascertain that one user was indeed expert in a particular topical area, then their assessments of relevance should perhaps carry more weight. But based at least on the experience of lawyers searching case law [Cohen, 1991; Sprowl, 1976] experts are less likely to search in their own area of expertise, probably because they already know what they will find there. The typical user searches in areas they know less about. In that respect, perhaps the novices RelFbk is, in fact, a better characterization of what we should attempt to satisfy. Once the index representation is allowed to change in response to users RelFbk, we are faced with the question of how fast this change should occur. Some of these questions were mentioned in terms of conceptual drift by users and topic tracking in news sources (cf. Section 7.3.2). Adopting the longer-term archival perspective of a librarian perhaps, how quickly should we want our adaptive system to track current trendy terms? (cf. Wired magazines Tired/Wired Memes feature). When old terms of art have been rendered obsolete, how can we nevertheless maintain an archival record of this previous terminology?ETHOLOGY 7.8 Symbolic and Subsymbolic LearningMost of the learning applications we have discussed apply to the Index relation between keywords and documents. But there are many other syntactic clues associated with documents from which we can also learn. Chapter 6 discussed a number of these heterogeneous data sources. But as we attempt to learn with and across both structured attributes and statistical features (recall the distinction of Section 1.6), it is important to keep several key differences in mind. The distinction between symbolic features of each document (e.g., date and place of publication or author), which represent unambiguous features of the world that human experts can reliably program directly, and the much larger set of subsymbolic features from which we hope to compose our induced representation [Rose and Belew, 1989] becomes especially important as we attempt to combine both manually programmed and automatically learned knowledge as part of the same system [Belew and Forrest, 1988]. Even among these attributes, however, there is room for learning about their meaning. For example, while a scientific paper may have many nominal authors, it is often only one or two to whom most readers will attribute significant intellectual contribution. While papers often have extensive bibliographies, some of these also are more significant than others and can be considered supporting or antagonistic (see Section 6.1). For all these reasons, FOA is an especially ripe area for AI and machine learning. The fact that documents are composed of semantically meaningful tokens allows us to make especially strong hypotheses about how they should be classified. One fundamentally important feature of the FOA activity (unless the WWW alters our world entirely!) is that there will always be more instances of document readings than of document writings. That is, while we can imagine spending a huge effort analyzing any text, there are fundamental limits to how much we can learn about it from only the features it contains. But each time a document is retrieved and read by a reader, we can potentially learn something new about the meaning of this document from this new persons perspective. Machine learning techniques are mandatory if we are to exploit the information provided by this unending stream of queries. As discussed in Chapter 6, the histories of IR and AI have crossed many times in the past, generally in head-on collisions rather than constructively. But as AI has moved from a concern with manually constructed knowledge representations to machine learning, and as IR has begun to consider how indexing structures can change with use, these two methodologies can be expected to increasingly overlap. | ||||||||||||||||||||||||||||||||||||||||||||||||
![]() | ||||||||||||||||||||||||||||||||||||||||||||||||