5

Back to TOC

Mathematical Foundations

No inquiry can go very far without a rigorous notion of where it’s been. Mathematics continues to provide our most reliable representation for the construction of new hypotheses and their testing. This chapter pulls together a wide range of mathematical issues arising in the FOA context.

5.1 Derivation of Zipf’s Law for Random Texts

As before, we begin by defining a word to be any sequence of characters separated by spaces. Let us therefore consider an alphabet of M characters, interspersed with a specially designated space character . We will consider an especially simple model (similar to that used by many others [Li, 1992; Miller, 1957; Hill, 1970; Hill, 1974]) in which a random monkey generates words by hitting all keys – space and letters – with equal probability p:

(5.1)

We can use lexicographic trees to conveniently organize words of length k, say, by the order in which the k characters occur prior to the terminating space, as shown in Figure 5.1 This shows a set of M + 1 trees, each rooted in the words’ starting character. Leaf nodes at level k are all labeled with the probability of the sequence of k – 1 characters prior to the space occurring at level k.

One immediate observation is that Nk, the number of words wi of length k or less, is:

(5.2)

In an infinitely long sequence of characters generated according to Equation 5.1, we will expect to find a “word” wk terminating at level k (i.e., a string of k unbroken nonspace characters bracketed by two spaces) with probability defined in terms of the independent character probabilities p:

(5.3)

We can compute c, the constant of proportionality, by including all the Mk words of length k and summing these probabilities over all possible words (including unrealistic, infinitely long ones!):Double-counting spaces?!

(5.4)

Next consider the rank of these words. Because the probability of a word’s occurrence is an exponentially decreasing function of its length, we know that the M highest ranked words are the one-character words; next come the M2 two-letter words; and so on. Using Equation 5.2 we therefore know how the rank rk of all words wk terminating on level k must be bounded above and below:

(5.5)

where denotes a compromise “average” rank for all the Mk equiprobable words.*

Note that Equation 5.4 and Equation 5.5 define the words’ probability and rank, respectively, in terms of the common metric k. As Li [Li, 1992] notes, Zipf’s law is fundamentally about this transformation, from an exponential distribution onto a rank variable.

Solving both equations for k:

(5.6)

we can now set them equal and derive an expression for a word’s probability in terms of its rank:

(5.7)

This has the functional form required by Mandelbrot’s generalized Zipf’s law (cf. Equation 3.2):


where
(5.8)

5.1.1 Discussion

Obviously these formulas all depend on the alphabet size selected, and this will certainly not be fixed across the systems considered. For example, it is not unusual for an IR system to “fold case,” i.e., to treat upper- and lowercase letters interchangeably, but many also preserve this case information. The capitalization of proper names will sometimes provide critical clues for appropriate index terms. Similarly, our choice of which characters we use to break the stream into wordlike tokens has consequence.

see exercise 1 Exercise 1


see exercise 2 Exercise 2

Fortunately for the robustness of Zipf’s law, the alphabets typically considered in these analyses are generally large enough that differences between only uppercase or upper- and lowercase alphabets are inconsequential. Figure 5.2, showing how quickly becomes nearly unity as the size of the character set grows, also makes it clear why Zipf’s simpler hyperbolic form is an adequate approximation.

It is also interesting to note a potential connection between Zipf’s law and information theory. Mandelbrot [Mandelbrot, 1953; Mandelbrot, 1982] initially attempted to derive Zipf’s law as the solution minimizing the average cost per unit of information conveyed by a text. George Miller seems to have found this effort amusing:

...the random placement of spaces which leads to Zipf’s rule is actually the optimal solution. Our monkeys are doing the best possible job of encoding information word-by-word, subject to the constraints we impose on them. If we were as smart as the monkeys we, too, would generate all possible sequences of letters and so make better use of our alphabet. [Miller, 1957]

5.2 Dimensionality Reduction

As Section 3.5 has already suggested, our interest in matching queries and documents within a vector space can benefit greatly from other kinds of matching that have arisen in other kinds of vector spaces [Schutze, 1993]. We review several other basic features of the mathematical topic of linear algebra before applying them to the problem of FOA.

It is worth remembering that similarity information can come from many sources. For example, we will later (cf. Section 6.1) have much to say about how citation structure can be represented as a graph, with each document represented by a node in the graph and a directed edge going from di to dj, when dj appears in the bibliography of di. Note that the documents’ citation information is entirely independent of the words they contain and can be the basis for another characterization of the topical similarity between documents:

Two documents are about the same topic to the extent that they share the same documents in their bibliographies.

For now, such bibliometric data need only seem like a plausible, new way to analyze document content. Chapter 6 will discuss other features that we might also exploit.

Our goal in casting similarity matching of queries with documents in general mathematical terms is to make the resulting solutions sufficiently broad to handle any kind of features, keywords, or bibliographic citations.

5.2.1 A Simple Example

Imagine that we’ve collected data on the HEIGHT and WEIGHT of everyone in a classroom of N students. If these are plotted, the result would be something like Figure 5.3. Notice the correlation around an axis we might call something like SIZE. Students vary most along this dimension; it captures most of the information about their distribution. It is possible to capture a major source of variation across the HEIGHT/WEIGHT sample because, just as with our keywords, the two quantities are correlated.

In this section we analyze similar statistical correlations among the keywords and documents contained in the much larger vector space model first mentioned in Section 3.4. Recall that in the vector space model, the Index relation placing D NDoc vectors corresponding to the corpus documents within the space V, where V NKw (for vocabulary size), is defined by its keyword vocabulary.

Here we describe this in the terms of linear algebra,* where J = Index is a D V element matrix.Beyond the puny three dimensions of human existence

Attempts to reduce this large dimensional space into something smaller are called dimensionality reduction. There are two reasons we might be interested in reducing dimensions. The first is probably more obvious: It’s a very unwieldy representation of documents’ content. Individual documents will have many zeros, corresponding to the many words in the corpus V not present in an individual document; the vector space matrix is very sparse. Dimensionality reduction is a search for a representation that is denser, more compressed.

Another reason might be to exploit what has become known as latent semantic relationships among these keywords. When we make each term in our vocabulary a dimension, we are effectively assuming they are orthogonal to one another; we expect their effects to be independent. But many features of FOA suggest that index terms are highly dependent, highly correlated with one another. If that’s the case, we can exploit that correlation by capturing only those axes of maximal variation and throwing away the rest.

5.2.2 Formal Notions of Similarity

Two features of the FOA problem can help us to focus on what is known as the Minkowski metric [Luenberger, 1969; Jain and Dubes, 1988]. First, the result of our calculations will be a real-valued weight associating a keyword with a document or query, and we can assume that this is a continuous quantity. Further, we can make the somewhat more questionable assumption that these weights make “natural” use of zero, and so index weights also fall on what is known as a “ratio” scale. With these two assumptions, the Minkowski metrics are defined as:

(5.9)

where L 1. The most common version is the L = 2 norm, and we will use it here. The L = 1 (“Manhattan distance”) and L = (“sup” norm, where k is replaced with max) are also seen often.

A metric is a scalar function over pairs of points in the vector space. Minkowski metrics satisfy three critical properties:

(5.10)
(5.11)
(5.12)
(5.13)

The measure Sim(x,x) of a vector with itself is what we typically think of as the length of the vector, or more precisely, its norm, .

Two other important features of metric spaces follow from these axioms:

5.2.3 Singular Value Decomposition

Just as students’ HEIGHT and WEIGHT are correlated along the dimension SIZE, we can guess that (at least some small sets of) keywords are correlated and that the vector space representation of a document corpus (cf. Section 3.4) might be simplified in a similar way. The goal is to “reduce the dimensionality” of the documents’ representation in the same way we reduced that of our students’ sizes.

But although we can draw a simple picture revealing the correlational structure between two dimensions, the picture is much more complicated when we try to conceive of the full V 105 space of index terms. For the D vectors we seek a smaller k < V-dimensional solution. We still have as many documents as we had before, but we’re going to use a different set of descriptors.

One of the most effective ways to characterize the correlational structure among large sets of objects is via eigenfactor analysis. The technique we consider is called singular value decomposition. This factors any rectangular matrix into three terms:

(5.14)

where U and A are each orthonormal and L is a diagonal matrix “connecting” them.Mathematical details

As before (cf. Equation 3.39), using inner product to measure similarity, we can define X:

(5.15)

Because J is a D V matrix, X is a D D symmetric matrix capturing all interdocument similarities. Then U is the system of eigenvectors of X, and L has the square roots of the eignenvalues, , along its diagonal. By convention, we order these in decreasing order:

(5.16)

Large eigenvalues correspond to dominant correlations and so, just as looking for the SIZE dimension that captured the main interaction between HEIGHT and WEIGHT, we will rely on the first k dimensions to capture the dominant modes of interaction in J:

(5.17)

This operation is shown schematically in Figure 5.4.Dimensionality reduction using neural networks

As always, whenever we throw something away (viz., the small eigenvectors), the result must be an approximation. That is, there will be a difference between our reduced-dimension representation Jk and the original J.

One easy way to measure this discrepancy is by referring to the D2 interdocument similarities latent in the X matrix and considering how different they are in the approximate matrix Xk

(5.18)

using the L2 norm to measure deviation. Then:

(5.19)

In fact, approximating X with its reduced k-rank SVD decomposition turns out to be optimal, in the sense that it results in minimal Err over all other rank-k alternatives [Bartell et al., 1992].

5.2.4 How Many Dimensions k to Reduce To?

A central question remains: How many dimensions k should be used? To date, the only answers are empirical. Early experiments suggest that using too few dimensions dramatically degrades performance.Earlier attempts to reduce dimensionality The original MED corpus used in early LSI experiments [Deerwester et al, 1990; Dumais, 1991] happens to be distinguished by a particularly topically focused vocabulary. While a few hundred dimensions may suffice to discriminate this small set of documents, it might be necessary to use more dimensions when describing a broader domain of discourse. Empirically, however, reduction to around 500 dimensions seems to provide significant improvements to even very large corpora. For example, Figure 5.5 shows one experiment [Landauer and Dumais, 1997], where k is varied over four orders of magnitude. Some attempts to select k theoretically, borrowing a rank-plus-shift method from signal processing, have also been made [Zha and Zhang, 1998]. The most satisfying answer may come from probabilistic models that relate the raw frequency statistics to an underlying distribution [Papadimitriou et al., 1998; Hoffman, 1999].

5.2.5 Other Uses of Vector Space

The same interdocument similarity information captured in the X = JJT matrix can be used for other purposes, too. For example, Section 7.5.1 will discuss one approach to the problem of classifying documents known as nearest neighbor.

The Index captures patterns of keyword usage across a corpus of documents. The preceding sections have held the corpus constant and used this data to analyze transformations of the keyword dimensions, but the converse is also possible. For example, Section 6.3 will discuss the representation of interkeyword relationships known as thesauri. One simple baseline for keywords is their pairwise similarities, as captured by J:

(5.20)

This produces a V V symmetric, square matrix capturing all interkeyword similarities, exactly analogous to the interdocument similarities of Equation 5.15.

Littman has also considered an interesting application of LSI to the problem of searching across multilingual corpora [Littman et al., 1998].

5.2.6 Computational Considerations

see exercise 3 Exercise 3

All these techniques depend, of course, on being able to compute the SVD for the Index relation. Given the enormous size of this matrix – typically 104 keywords times 105 to 109 documents – this is a nontrivial concern. However, the fact that these matrices are sparse (i.e., any one document vector has a relatively small fraction of keywords associated with it) means that special techniques can be used. In particular, Berry has developed methods based on Lanczos and subspace iteration methods in his SVD-Pack implementation [Berry, 1992; Berry et al., 1994; Letsche, 1996].SVD is patentable?!

Once the document corpus has finally been compressed, another serious recurring computation must be done: Every time a user issues a query, it must be transformed from the original space of “raw” keywords into the reduced k-dimensional space. This means we must keep the matrix Uk available to transform each query into the reduced representation.

see exercise 4 Exercise 4

Another computational issue has to do with the effect of new documents being added to the collection, or older ones being removed. This is one particular typeTemporal drift of drift we might expect in FOA; Section 7.3.2 discusses several others. The addition or deletion of any one document will affect the SVD statistics of the entire corpus very slightly, especially if it is a very large corpus.

see exercise 5 Exercise 5

5.2.7 “Latent Semantic” Claims

Within the IR community, SVD was first applied to the Index matrix by Deerwester et al. [Deerwester et al., 1990; Dumais, 1991] and was called latent semantic indexing (LSI). The “latent semantic” claim derives from the authors’ belief that the reduced dimension representation of documents in fact reveals semantic correlations among index terms. Further, they argue that evidence collected across entire corpora transcend individually “fallible” document instances. That is, while one document’s author might use the word CAR and another the synonym AUTO, the correlation of both of these with other terms like HIGHWAY, GASOLINE, and DRIVING will result in an abstracted document feature/dimension on which queries using either keyword, CAR or AUTO, will project equivalently. “Synonymous” retrieval has been accomplished!

Landauer and Dumais have recently extended this algebraic manipulation of the Index relation into an ambitious model of human memory [Landauer and Dumais, 1997]. Much of psychology is concerned with the problem of how children, those most powerful learning agents, are able to learn so much from such a “poverty of the stimulus.” That is, by many forms of analysis the stimuli driving learning do not by themselves contain sufficient information to induce the elaborate conceptual structures children demonstrate.

Applying these ideas to textual corpora, Landauer and Dumais “trained” an LSI model with presentation of paragraph after paragraph drawn from more than 30,000 encyclopedia articles. Using retrieval on a standardized synonym test as their performance measure, the emerging eigenvector representation (compressed to 300 dimensions) showed a rate of improvement comparable to that of schoolchildren! Because this performance required “indirect inference” like that supported by LSI eigenvectors and beyond what could be accomplished on the basis of simple word cooccurrence alone, Landauer and Dumais suggested that LSI provides an important model of human memory.A new argument for nurture

[S]ome domains of knowledge contain vast numbers of weak correlations that ... can greatly amplify learning by a process of inference....[A] substantial portion of the information needed ... can be inferred from the contextual statistics of usage alone. [Landauer and Dumais, 1997]

At the very least, LSI demonstrates how traditional associative memory models [James, 1893; Hebb, 1949; Baddeley, 1976; Anderson and Kline, 1979] can be extended to exploit higher-order correlations.

The earlier work trying to connect a small number of factor-analytic, “semantically” meaningful dimensions [Koll, 1979; Borko and Bernick, 1963] is also interesting in this respect. Jones and Furnas have also investigated how well cosine/inner product reflects human similarity judgments [Jones and Furnas, 1987]. In any case, a cognitive interpretation of these issues promises to remain an active area of investigation.

We now consider how RelFbk assessments might also be used to provide document-similarity information, and how they can be used to reduce the dimensionality of documents’ representations.

5.3 Preference Relations

5.3.1 Multidimensional Scaling

In many kinds of retrieval interfaces, it is easy and natural for a user to indicate a preference for one retrieved document over another. The RelFbk discussed in Section 4.1 is our standard characterization of this information:IR has historically ignored preferences

(5.21)

The literature on multidimensional scaling (MDS) was developed to deal with assessments by human subjects of the similarity or difference among multiple stimulus patterns [Borg and Lingoes, 1987]. A robust and important result from this analysis is that people can much more easily and consistently provide nonmetric assessments of the similarity than if they are forced to specify metric quantities. Rather than an experimenter arbitrarily imposing dimensions they believe important,

We can ask the subjects to globally judge, without criteria provided by the experimenter, the overall similarities of different facial expressions. The proximities are then mapped into [MDS] distances x, and the configuration examined for systematic characteristics of the distribution of points. [Borg and Lingoes, 1987 p.72, emphasis added]

The “facial expression” example is a reference to Woodworth’s experiments in 1938 [Woodworth, 1938]. Imagine a hypothesis that facial expressions can be characterized in terms of two dimensions of variability. We could test this by showing human subjects pairs of pictures and asking them to judge how similar or dissimilar the two facial expressions are, in terms of “difference in emotional expression or content.” Then imagine they are given some proximity p scale – from 1,2,... k – along which they are to rank the picture pairs; each pair of pictures would therefore have a proximity score pij. If these pictures are to be characterized in a two-dimensional plane, we can also associate with each pair of pictures a distance dij according to their two-dimensional coordinates:Proximity can capture similarity or dissimilarity/distance

pij proximity (evaluation)
dij distance (in two-dimensional plane)

The MDS analysis described by Borg and Lingoes (they actually prefer the term “similarity structure analysis,” or SSA) is a key contribution. It is an algorithm for iteratively moving vectors corresponding to the objects of evaluation (pictures of facial expressions) within an arbitrary dimensional space so as to minimize as much as possible the stress they experience, relative to their pairwise proximities. We think that the pictures have been well placed if those with similar proximities are close together as measured by this distance.

Within any such space, we can replace the pairwise distances dij with di associated with each point (picture) measuring the distance of this point from the origin of the space. Proximities pi are defined in terms of the projections of these points onto the space’s principal component vectors.

To quantify this notion of correspondence between humans’ proximity assessments and their embedding in arbitrary spaces, Guttman has defined a measure of monotonic correspondence between two variables:

(5.22)

This measure captures the extent to which the placement of the items within the two-dimensional space is consistent with the proximities. This formula really is as simple as it looks: It is approximately a correlation of proximities with distances, “gated” by the comparison of differences in the numerator and absolute values of differences in the denominator. The value of µ2 is always greater than simple linear correlation, and the two quantities are exactly equal when a linear relationship holds between proximity and distance.

In the FOA retrieval situation, the obvious measure of interest is the matching function Match(q,d) (cf. Section 4.3.4) score with respect to a (henceforth implicit) query q:

(5.23)

The J measure provides a criterion for the retrieval function Match(·). In experimental situations the only preferences available are that , but in natural retrieval situations, users’ richer RelFbk preference data can be used.

A particularly interesting use of this criterion is as part of error correction learning (cf. Section 7.3). If we assume that the ranking function R has certain free variables , that we again have a training set of documents T, and that the J criterion is differentiable with respect to , a gradient search procedure can be used to adjust toward an optimal retrieval:

(5.24)

For example, Bartell et al. [Bartell et al., 1994b; Bartell et al., 1998] consider the problem of picking a document-query similarity measure R parameterized by to vary across a broad range of alternatives:

(5.25)
(5.26)

This characterization of similarity score includes standard inner product (2 = 0, 1 can be anything), cosine (1 = 2, 2 = 0.5), and pseudo-cosine (1 = 1, 2 = 1) as special cases.

Through empirical testing using the CISI experimental corpus (cf. Section 4.3.9) and its binary relevance assessments for preference relations (), Bartell et al. found optimal values of 1 = 2.5, 2 = 0.3, a curious and nonstandard type of similarity measure indeed, but functionally quite similar to standard cosine. In fact, changing the matching function to this “optimal” value improves performance (as measured by average precision) almost not at all.

5.3.2 Information in RelFbk

Section 5.2.7 raised several important cognitive questions arising from attempts to study “semantic” interpretation of LSI/SVD dimensions. The connection to the psychologically important analysis of MDS adds even more. If we interpret RelFbk information as constraints about which documents a user likes, how much can we reduce dimensionality without violating the RelFbk constraints they are giving us? That is, how much can we reduce the representational space before we’re changing somebody’s order? Second, how many RelFbk statements do we need to accurately determine a good compression? How many people have to tell us things before we have enough information to form this reduced representation? These questions also connect to ones related to learning these representations (cf. Chapter 7) and to our mechanisms, as part of the interface or as part of a special experimental system like RAVE, for efficiently collecting large volumes of RelFbk (cf. Section 4.4). Most of these questions remain unanswered, but a beginning is simply counting the number of preference constraints provided by each RelFbk labeling of a hitlist by a user.

First note that the total number of preference statements required to completely determine n elements grows very rapidly as , corresponding to the fact that preference order information is defined over pairs of pairs of evaluated documents. Against this backdrop, each RelFbk labeling produces:

(5.27)
(5.28)
(5.29)
(5.30)
(5.31)

When this information is used as part of error correction learning, another useful quantity is the number of documents over which the RelFbk preference relation and the ranked list disagree:

(5.32)

Because this relatively small number of data points will always be dwarfed by the number of total preferences across the entire corpus, the goal becomes to constrain the application of RelFbk data to those subsets of the corpus possibly appropriate to a particular query. And because we intend to learn from the browsing users, we can afford to be patient: The documents are only written once but browsing users will continue to read them and provide RelFbk for some time.

5.3.3 Connections between MDS and LSI

Bartell et al. [Bartell et al., 1994a] have shown that any time the basis of MDS is S, a positive semidefinite matrix of d observations of t variables (cf. Equation 5.15), the LSI decomposition together with inner product similarities provides an optimal MDS scaling. That is, under reasonable conditions MDS and LSI arrive at identical results.

This correspondence is more than a mathematical curiosity. MDS allows generalization of the scaling to any other source of document-document similarity information! There are many possible sources of information regarding ways to analyze the similarity of a number of documents; bibliometric information is one practical alternative already mentioned. MDS is potentially important because it allows us to go beyond the set of static features of the corpus, to consider the unending stream of RelFbk generated by browsing users. Chapter 4 discussed users’ RelFbk assessments in detail. Recall that one of the critical features of RelFbk information is that it is nonmetric. MDS allows us to define optimal scalings directly from nonmetric RelFbk data.

5.4 ClusteringVan Rijsbergen’s long shadow

5.4.1 The Cluster Hypothesis

We have been talking about a measure of association between query and document, but as discussed in Section 5.2.5, we can use X = JJT (cf. Equation 5.15) to capture similarities or differences among the documents themselves. Van Rijsbergen’s cluster hypothesis suggests how useful this data might be:

Closely associated documents tend to be relevant to the same request.[van Rijsbergen, p.45]

The cluster hypothesis suggests that if we first compute a measure of distance X among all the documents, we could cluster those documents that seem to be close to one another or seem to be about the same topic and retrieve them all in response to a query that matched any of them. If a query happens to match one document based on an overlap of features, then we will claim that other documents that are similar to it should also be retrieved.

As van Rijsbergen, p.38 notes, the fact that most measures of association are monotonic with respect to one another makes it sufficient that clustering methods only respect the same rank ordering of their results; once again the nonmetric quality of critical FOA quantities is striking.

5.4.2 Clustering Algorithms

Because of its widespread application, clustering is one of the most well-studied problems within statistics and computer science [Jain and Dubes, 1988; Griffiths et al., 1986]. It can be stated in generic terms, in terms of an arbitrary similarity measure between items to be clustered. Especially in iterative applications of clustering, the relative frequency of various clustering techniques matters. The art, therefore, in applying a clustering method in a particular application depends greatly on the particular features of items to be clustered and the number of partitions within which these items are to be clustered. W. Willett has provided an excellent survey of applications of clustering applied to various aspects of the FOA task [Willet, 1988]. More recently, Zamir has considered clusterings of document “snippets” (roughly the same as the paragraph units proposed in Section 2.3) rather than on complete documents [Zamir and Etzioni, 1998].

The most typical clustering method for application within the FOA context is single-link hierarchical clustering. This is considered an agglomerative technique, because it begins with each data point considered to be in its own cluster and then iteratively asks which two clusters should be combined.Divisive/partitional clustering

The most typical method for performing this task builds a minimum spanning tree (MST), iteratively merging the two documents that are closest together and then including the merged nodes. The result is a tree from the finally merged root to each of the documents as a leaf. Alternatively, the complete set of interdocument similarity measures can be progressively “filtered” by a gradually increasing threshold, which is used to define whether two documents are considered connected by an edge. We can then ask for certain properties of the emerging graph, for example, stopping when all components become connected. To keep things simple and focused on robust assumptions [van Rijsbergen, p.59], O(n2) algorithms are assumed.

One early motivation for clustering algorithms was efficient disk access: If documents were preclustered appropriately, it would be likely that a “page” of disk memory containing one document matching a query might also contain other elements of the same cluster. That is, an efficient physical disk allocation algorithm might try to ensure that clustered documents were written to the same block.

This example suggests how MST representations of the documents might also support efficient query/document matching algorithms. In brief, the MST can be interpreted as a “decision tree,” with a query beginning at the root and comparing itself to a candidate representation (e.g., centroid) of each cluster.

The cluster hypothesis suggests that if a document that is similar to the query is relevant, then other documents similar to it are likely to also be relevant. By clustering neighboring documents, nearest neighbor documents should also be retrieved. But as Cutting et al. note, for a partitioning cluster to be useful in this application k must be very near N [Cutting et al., 1992]. We do not want to retrieve more than a linear constant of other documents when we retrieve one.

ScatterGather is an intereseting example of how partitional clustering can be applied in a provocative way [Cutting et al., 1992; Hearst and Pedersen, 1992]. Beginning with a clustering of the entire corpus, initial queries are posed by selecting some of the clusters. Documents identified as members of these clusters form a new subcorpus; clustering is applied across these, and so on. Evaluating the utility of such a novel browsing pattern is difficult, but the ability of the ScatterGather interface (as well as a similar effort applied to the MacOS interface called Piles [Rose et al., 1993] to improve retrieval effectiveness is quite clear. A problematic feature of the ScatterGather procedure is its sensitivity to the selection of initial “buckshot” random seeds. But cutting doesn’t necessarily see this as a bug:

Indeed, the lack of determinism might be interpreted as a feature, since the user then had the option of discarding an unrevealing partition in favor of a fresh re-clustering. [Cutting et al., 1992, p.324]

5.5 Probabilistic Retrieval

It should be clear by now that there are few logical grounds on which we could irrefutably prove that any document is relevant to any query. Our best hope is to retrieve relevant documents with high probability; i.e., we can only know about uncertainly. Beginning in the early 1970s, the most seminal thinkers within IR have attempted to connect fundamental concepts like “relevance” from within IR to probability theory (see [Cooper, 1971; Cooper, 1973; Bookstein and Swanson, 1974; Cooper and Maron, 1978; Cooper, 1983] and other references throughout this section). As typical search engines moved past simpler Boolean retrieval to ranked retrievals [Cooper, 1988], and more recently as representational languages like Bayesian networks have become widespread within artificial intelligence (cf. Section 5.5.7), a probabilistic foundation for FOA has become an increasingly central concern. The derivation of the basic components of the Bayesian decision models presented here follows van Rijsbergen, p.115 quite closely; the lecture notes on probabilistic IR developed by Nobert Fuhr provide a similar treatment of this background.

5.5.1 Probability Ranking Principle

We begin with an important assumption called the Probability Ranking Principle (PRP):Who stated the PRP?

If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance ... the overall effectiveness of the system to its user will be the best that is obtainable. [van Rijsbergen, p.113]

This assumption reduces the problem of building an optimal IR system to one of ordering documents in order of decreasing probability of relevance. Defining our retrieval task in these terms is optimal in the sense that it minimizes the amount of expected error in retrieval performance (cf. Section 5.5.6).The PRP hides another assumption

There are at least two possible interpretations of precisely what a probability of relevance, Pr(Rel), means, in terms of an underlying event space [Maron, 1977; Maron, 1982a; Maron and Kuhns, 1960]. The first is to imagine (again considering a particular query) that the “experiment” of showing the document to a user is repeated across multiple users. Alternatively, we can imagine that the same query/document relevance question is put repeatedly to the same user, who sometimes replies that it is relevant and sometimes that it isn’t. However we interpret Pr(Rel), we will focus on one particular query and compute Pr(Rel) conditionalized by any and all features we might associate with the document d.

Consistent with previous notation (cf. Section 4.3.4 and Section 5.3.1), we will define Rank(d) to be a positive integer assigned to each document in the Retrset, in descending order of similarity with respect to a (henceforth implicit) query q, and using the matching function Match(q,d):

(5.33)

According to the PRP, we can hope that our Match(q, d) function accurately reflects the probability of relevance:

(5.34)

(Henceforth, we again restrict our attention to a single query and drop the subscript q.) In fact we only need to require that it reliably ranks documents (again with respect to an implicit query q):

(5.35)

Finally, to emphasize the representation of the document into its constituent features, as well as the sensitivity of our notions of relevance on this representation (cf. [van Rijsbergen, 1992]), the remainder of this section will change notation slightly. Let x be a vector of features xi describing the document. The PRP is then restated:

(5.36)

5.5.2 Bayesian Inversion

If we had worked hard on a particular test corpus of documents to identify (always with respect to some particular query) which documents were Rel and which were not, it would be possible to carefully study which features xi were reliably found in relevant documents and which were not. Collecting such statistics for each feature would then allow us to estimate:

the probability of any particular set of features x, given that we know it is Rel. (Just which statistics we collect, and how, will be discussed in more detail in Section 7.4 as part of a more general classification task.) The retrieval question requires that we ask the converse: the probability that a document with features x should be considered relevant. This inversion is accomplished via the familiar Bayes rule:

(5.37)

5.5.3 Odds Calculation

In addition to evidence the features provide to show that a document is Rel, they may also provide evidence that it is not: .Rational world An odds calculation balances both probabilities as a ratio:

(5.38)

The Bayes rule can also be applied to this ratio:

(5.39)

The first term will be small; the odds of picking a relevant versus irrelevant document independent of any features of the document are not good. Still, Odds(Rel) can be expected to be a characteristic of an entire corpus or the generality of the current query but insensitive to any analysis we might perform on a particular document.

In order to calculate the second term, we need a more refined model of how documents are “constructed” from their features.

5.5.4 Binary Independence Model

Perhaps the simplest model proceeds by imagining binary, independent features; it is conventionally called (surprise!) the Binary Independence Model (BIM) [Robertson and Sparck Jones, 1976; van Rijsbergen, 1977]. First, the binary assumption is that all the features xi are binary. This is not a very restrictive assumption and is used only to simplify the derivation.

The much bigger assumption is that the documents’ features occur independently of one another. We have discussed the problems with such an assumption before. Van Rijsbergen, p. 120 quotes J. H. Williams’s expression of the paradox:

The assumption of independence of words in a document is usually made as a matter of mathematical convenience. Without the assumption, many of the subsequent mathematical relations could not be expressed. With it, many of the conclusions should be accepted with extreme caution. [Williams, 1965, emphasis in original]

The key advantage it allows is that the probability of a feature vector x becomes simply the product of the marginal probabilities of the individual features:

(5.40)

Very convenient – and very unrealistic.Maybe this assumption isn’t so bad?

Applying this decomposition to our odds calculation gives:

(5.41)

It will be convenient to introduce the variables pi and qi to capture the probabilities that feature xi is present, given that a document is or is not relevant:

(5.42)
(5.43)

The complementary probabilities concerning documents in which the feature is absent can also be defined easily:

(5.44)
(5.45)

These definitions break the product into two portions, the first having to do with those features that are present in a particular document and the second with those that are not:

(5.46)

Recall that both queries and documents live within the same vector space defined over the features xi. The two products of Equation 5.46 (defined in terms of presence or absence of a feature in a document) can be further broken into four subcases, depending on whether the features occur in the query. We next make another “background” assumption concerning all the features xi that are not in both the query and the document of current interest; we assume that the probability of these features being present in relevant and irrelevant documents is equal: pi = qi. In other words, for those terms we don’t care about (because they don’t affect this query/document comparison), we are happy to think that their occurrence is independent of their relevance.

Consider the sets D and Q shown in Figure 5.6 defined in terms of those features xi present and absent in the document and query, respectively.* Regrouping the two products of Equation 5.46 into four products created by the two sets D and Q, the terms cancel except in the intersection of the query and document (where the feature xi is present in both) and in Q\D, the set difference of Q less D:

(5.47)

In the retrieval situation we will exploit the sparseness that makes it much more efficient to keep track of where a feature does occur (xi = 1) than all the places it does not (xi = 0). Since the second product is defined over all the features of q except those in d, if we are careful to “pre-multiply” each feature in their intersection by a reciprocal, we can then safely multiply everything in the query by the same ratio:

(5.48)

The next section will show the utility of separating the last term, which depends on features of the document in question, from the first two, which do not, as part of an online retrieval calculation.

But first, it is worthwhile considering how we might attempt to estimate some of the required statistics [Robertson and Sparck Jones, 1976]. Fuhr [Fuhr, 1992], for example, considers the retrospective case when we have RelFbk from a user who has evaluated each of the top N documents in an initial retrieval and has found R of these to be relevant (as well as evaluating all the N – R remaining and found them to be irrelevant!). If a particular feature xi is present in n of the retrieved documents with r of these relevant, then this bit of RelFbk provides reasonable estimates for pi and qi:

(5.49)
(5.50)

5.5.5 Linear Discriminators

The retrieval problem is made simpler by noting that all we really need is a simple discrimination threshold, above which the evidence for Rel is sufficient that our retrieval system elects to show a user the document as part of a hitlist. We reflect this by making the Rank function proportional to the odds of relevance given the document:

(5.51)

Assuming the features xi are stochastically independent is very helpful, but manipulating the product of terms is still awkward. For example, we might reasonably want to regress (train) the feature variables against known Rel documents to compute weights for their relative contributions [Cooper et al., 1992; Gey, 1994]. While nonlinear regression algorithms (e.g., neural networks) do exist, regression with respect to a linear combination of features is most straightforward.

Logarithmic functions perform just this transformation (i.e., transforming products to sums) and are also guaranteed to be monotonic. Monotonicity guarantees that if a < b then log(a) < log(b), and the ranking scores we use to order our hitlist will not change the relative order of any two documents on our hitlist.The two steps of

  • first comparing two probabilities in an Odds ratio and then
  • taking logarithms to form linear combinations rather than products

arise so frequently that the two operators are often composed and considered a single LogOdds calculation.Weighting evidence

For our discrimination purposes, then, LogOdds will suffice:

(5.52)

LogOdds(Rel) reflects the prior probability that we are likely to find a relevant document in our corpus, independent of any features of the query/document. This could vary, for example, between very general and very specific queries, but again, it should not alter our ranking of documents pertaining to a particular query. Focusing exclusively on those factors that will be useful in discriminating one document from another, with respect to a particular query, Equation 5.52 simplifies considerably, and we come up with a Rank measure:

(5.53)

where the weight ci associated with each feature

(5.54)

is called the relevance weight (also known as retrieval status value (RSV)) of this feature, and is the query-invariant constant:

(5.55)

corresponding to the log of the first two document-insensitive terms of Equation 5.52.

5.5.6 Cost Analysis

The discussion of Section 4.3.4 suggests that, as with most real-world decisions, there is no perfect way to select relevant documents. Even if we can accomplish the PRP and order all the documents, there remains a retrieval threshold to set. If we set the threshold too high, we will not show users some documents they might wish to see, and if we set it too low we will show them too many.

But we can capture this tension by associating two costs (a.k.a. losses) with each of two possible sources of error. The first cost CRN is incurred when we retrieve an irrelevant document, the second CNR when we don’t retrieve a relevant document. To make these costs concrete, we might imagine that there is a limited resource (hitlist screen real estate, user search time), and the first cost is proportional to using up this precious resource. Similarly, the second cost might be proportional to the cost of losing a malpractice suit, when a legal case on point wasn’t found but should have been!

In terms of the LogOdds ranking function of Equation 5.54, the trade-off between these two costs can be realized by another term added to the constant of Equation 5.55:

(5.56)

We can easily imagine adding a knob to a browser reflecting the trade-off between these two costs [Russel et al., 1993].

5.5.7 Bayesian Networks

As will be discussed in great detail in Chapter 6, there are many types of information on which we might wish to base our retrieval. Most of this section has concerned probabilistic models for the crucial Index relation between keywords and documents, but we may wish to model other information probabilistically as well.

Bayesian networks (also known as inference or belief networks) are a modeling language within which many probabilistic relationships can be expressed as part of a common representation and used as part of a unified inference procedure [Pearl, 1988]. A Bayesian network is a graph in which nodes correspond to propositions and links correspond to conditional probabilistic dependencies between these propositions. A directed link from node p to node q is used to model the fact that p causes q, although other semantics (e.g., logical implication) are sometimes also used.

When representing interactions among n propositions, we must generally consider the possible dependency of each proposition on every other. To do this completely requires an exponential number of statistics, which is impractical for most situations and certainly if we attempt to model interactions between all the documents in our corpora and their keywords. Within Bayesian networks, this full set of statistics – the joint probability distribution – is replaced by a sparse representation only among those variables directly influencing one another. Interactions among indirectly related variables are then computed by propagating inference through a graph of these direct connections.

The key integration of probabilistic information across interacting variables is accomplished by specifying how each child node depends on the set of its parents’ values. A table of conditional dependency probabilities specifies, for each possible value of each parent node, the probability of each of the child variable’s value; see Figure 5.7. With these condiditional relationships specified for each node, querying a Bayesian network corresponds to placing prior probabilities on some elements of the network and then asking for the probability at other nodes.

One of the most comprehensive applications of Bayesian network representations to the propositions associated with FOA is due to Croft with Turtle and other students [Turtle, 1990; Turtle and Croft, 1990; Turtle and Croft, 1991; Turtle and Croft, 1992; Callan et al., 1995] and is shown in Figure 5.8. This graph shows four types of nodes: documents, keywords, queries, and a single information need node.Multiple representations of the same document

This graph is rooted on the document nodes. To use this representation, a single document is “instantiated,” meaning that we posit that only this one document is retrieved and we ask for the probability that the information need is satisfied. Given these estimates for each document, the hitlist is formed according to the Probability Ranking Principle. Estimates for the keywords’ dependencies on documents rely on the same weighting techniques discussed in Section 3.3.

Fuhr and Buckley have extended this formalism to include an additional level, which they call relevance description (·) [Fuhr and Buckley, 1991]. This is an arbitrary function over terms and documents (k,d). Using this descriptive layer, rather than computing

the probability of relevance given a particular keyword and a particular document, they propose to evaluate

where this is the “probability that a document will be judged relevant to an arbitrary query, given that one of the document’s index terms, which also occurs in the query, has the relevance description .” Fuhr argues that by separating the description function from the keyword and the document, a wider range of descriptions can be considered, and there is no longer a need to associate a probability of relevance with individual keywords or documents.

An alternative formulation proposed by Ribeiro and Muntz, following work by Wong and Yao, imagines the keyword vocabulary describing a universe of discourse [Ribeiro, 1995; Ribeiro and Muntz, 1996; Wong and Yao, 1995], as shown in Figure 5.9. Treating each keyword as a binary variable, any of the 2V possible subsets corresponds to a concept; any query and every document can be described as a concept within this space. The goal of retrieval becomes one of conceptual matching of a query against that of the documents. Either Pr(d|q) or Pr(q|d) can be used to reflect the strength of the concept matching relationship. In fact, these two quantities can generally be made equivalent with proper normalization [Wong and Yao, 1995].

Note that the Bayesian network generated from this perspective inverts the causal links from those proposed by Turtle and Croft! Riberiro and Muntz argue that their formulation is superior, because it treats queries and documents symmetrically (as we would expect for a concept matching retrieval). Further, the fact that conventional cosine-similarity matching requires normalization of the document vector over all index terms creates a dependence on terms not contained in the query. Any such dependence violates a fundamental condition for Bayesian network graphs [Pearl, 1988].

see exercise 6 Exercise 6

A good example of modeling within the Bayesian network formalism is to show how multiple query formulations, for example, Boolean and weighted vector space query strategies,can be modeled interchangeably. Figure 5.10 shows in detail two possible ways in which query nodes might be connected to particular keywords. The first shows a Boolean combination of keywords, and the second shows a network capturing dependencies between a phrase keyword and its constituent elements. Interaction among multiple queries all designed to satisfy the same information need, as discussed in terms of a query session in Section 4.2, can also be modeled.


TERMS INTRODUCED IN THIS CHAPTER

agglomerative (166)
associative memory (160)
Binary Independence Model (170)
citation (153)
classifying (158)
cluster (165)
cluster hypothesis (165)
conceptual matching (178)
dimensionality reduction (154)
discrimination threshold (173)
drift (159)
eigenfactor analysis (156)
features (169)
joint probability distribution (176)
latent semantic (154)
latent semantic indexing (159)
lexicographic trees (149)
logarithmic (174)
marginal probabilities (171)
minimum spanning tree (166)
monotonic (174)
monotonic correspondence (162)
multidimensional scaling (161)
nearest neighbor (167)
orthogonal (154)
orthonormal (156)
Probability Ranking Principle (168)
proper names (152)
rank-plus-shift (158)
regress (173)
relevance description (177)
relevance weight (174)
retrieval status value(RSV) (174)
singular value decomposition (156)
sparse (154)
stress (162)
thesauri (158)
Back to TOC