3

Back to TOC

Weighting and Matching against Indices

The Bible Code as Ouija Board: The fascination with the subliminal, the camouflaged, and the encrypted is ancient. Getting a computer to munch away at long strings of letters from the Old Testament is not that different from killing animals and interpreting the entrails, or pouring out tea and reading the leaves. It does add the modern impersonal touch – a computer found it, not a person, so it must be “ really there.” But computers find what people tell them to find. As the programmers like to say, “prophesy in, prophesy out.” [Menaud, 1996]

3.1 Microscopic Semantics and the Statistics of Communication

In the last chapter, we described the FOA process linguistically, in terms of words that occur in documents, morphological features of these words, structures organizing the sentences of documents, etc. We now want to treat all of these words – which have meaning to their authors and to us reading them – as a meaningless stream of data: word after word after word. (Imagine it coming from some SETI radio telescope, eavesdropping on the communication of some other planet!) We will now seek patterns and trends common to this data, using the same sorts of statistical tricks that physicists typically use on their data streams. What can we learn from looking at the statistics of our data stream, treating text as meaningless and attempting to infer a new notion of meaning from those statistics?

But now let’s narrow our focus, all the way down to the bits and characters used to represent the corpus as, for example, a file on a physical device, like a hard disk. Imagine that you are an archaeologist trying to study some civilization that left this evidence behind. How might you interpret this modern Rosetta Stone?

Let’s ignore those issues relating to basic ASCII encoding. That is, suppose we have special knowledge of a character set. Then the frequency of these characters’ occurrences would already give us a great deal of information. Anyone who has studied simple cipher techniques (or played Scrabble) knows that a table of most frequently used letters (cf. Table 3.1 [Welsh, 1988]) can be used to break simple codes.

In this chapter we will move another level above characters. We will consider morphological transformations we can perform on character sequences that help us to identify root words. We will briefly mention phrases by which multiple words can be joined into simple phrasal units.

At each level we will ask very similar questions: What is our unit of analysis; i.e., what are we counting? Then, what does the distribution of frequency occurrences across this level of features tell us about the pattern of their use? What can we tell about the meaning of these features, based on such statistics [Francis and Kucera, 1982]?

In fact, many influential thinkers have looked at such patterns among symbols. Going back to some of the most ancient writings suggests that statistical analyses of the original Hebrew characters and their positions within the two-dimensional array of the page reveals new “codes” [Witztum et al., 1994; Drosnin, 1997].

Donald Knuth, one of computer science’s most renowned theoreticians, has analyzed an apparently random verse (Chapter 3, verse 16) from 59 of the Bible’s books and used these verses as the basis of stratified sampling of the approximately 30,000 Biblical verses [Knuth, 1990]. He found, for example, that the 3:16 verses were particularly rich in occurrences of YHWH, the ancient Hebrew name for God. Personally, Knuth considered this analysis the source for “historical and spiritual insights,” as part of a Bible study class he led. But speaking scientifically, how can we find meaning in text, and how are such attempts to be distinguished from the kinds of “Ouija board” numerology criticized by Menaud in the quotation opening this chapter?

3.2 Remember Zipf

Looking at our corpus as a very long string of characters, something that even a monkey could generate, provides a useful baseline against which we can evaluate larger constructs.

Associate with each word w its frequency F(w), the number of times it occurs anywhere in the corpus. Now imagine that we’ve sorted the vocabulary according to frequency so that the most frequently occurring word will have rank r = 1, the next most frequently used word will have r = 2, and so on.

George Kingsley Zipf (1902–50) has become famous for noticing that the distribution we find true of our corpus is in fact very reliably true of any large sample of natural language we might consider. Zipf [Zipf, 1949] observed that the words’ rank-frequency distribution can be fit very closely by the relation:

(3.1)

This empirical rule is now known as Zipf’s law.

But why should this pattern of word usage, something we can reasonably expect to vary with author or type of publication, be so universal? What’s more, the notion of “word” used in this formula has varied radically: in tabulations of word frequencies by Yule and Thorndike, words were stemmed to their root form; Yule counted only nouns [Yule, 1924; Thorndike, 1937]. Dewey [Dewey, 1929] and Thorndike collected statistics from multiple sources; others were collected from a single work (for example, James Joyce’s Ulysses). The frequency distribution for a small subset of (nonnoise words in) our AIT corpus is shown in Figure 3.1. Note the nearly linear, negatively sloped relation when frequency is plotted as a function of rank, and both are plotted on log scales.

3.2.1 Looking for Meaning in All the Wrong Places (At the Character Level)

The ubiquity of data obeying Zipf’s law has made it a lightning rod, attracting a number of “explanations.” These explanations come from an extremely impressive set of original thinkers, in widely ranging disciplines:

  • Noam Chomsky, the most influential linguist of the past 30 years, together with George Miller, the mathematical psychologist famous for such insights as the “7 ± 2 chunks” of memory limitation [Miller, 1957];
  • Herbert Simon, the Nobel Prize–winning economist and one of the fathers of artificial intelligence [Simon, 1955]; and
  • Benoit Mandelbrot, the mathematician and physicist most famous for his work on fractals [Mandelbrot, 1953].

Herbert Simon, a keen observer of much cognitive activity, suggests that the ubiquity of Zipf’s law across heterogeneous collections should make us somewhat suspicious of its ability to address the “fine structure” of linguistics:

No one supposes that there is any connection between horse kicks suffered by soldiers in the German army and blood clots on a microscope slide other than that the same urn scheme provides a satisfactory abstract model of both phenomena. It is in the same direction that we shall look for an explanation of the observed close similarities among the five classes of distributions.... [Simon, 1955]

(With “urn,” Simon is referring to mathematical models, e.g., related to Poisson processes. See Section 3.3.2 for more on the “five classes” of Simon’s models.)

We therefore begin this section by reviewing a number of early attempts to explain the phenomena underlying Zipf’s law; its mathematical derivation is reserved for Chapter 5.

3.2.2 Zipf’s Own Explanation

To explain his empirical observations, Zipf, himself proposed a theoretical model that described the ultimate purpose of communication between authors and readers.

Zipf’s theory was extraordinarily broad, addressing not only (!) patterns in text but also patterns across all human activities. According to Zipf’s fundamental Principle of Least Effort all activities can be viewed as interactions between jobs needing to be done and tools developed to accomplish the jobs. In a mature society in which a variety of jobs and tools have existed for some time, a “reciprocal economy” forms. That is, there is a set of tools appropriate for doing certain jobs, and there is a set of jobs requiring certain tools. The Principle of Least Effort asserts that a person attempting to apply a tool to a job does so in order to minimize the probable effort in using that tool for that particular job.

In applying this principle to texts, Zipf makes an important correspondence – words work as tools, accomplishing jobs we need done. To simplify the situation greatly, imagine that the job an author is attempting to accomplish is simply to “point” to some referent, something in the world.Pointing Authors would find it most convenient to simply use one word all the time for all the jobs they are trying to accomplish. It makes their task much easier; picking the right word is effortless. The author has a pressure toward unification of the vocabulary.

From the reader’s point of view, it would be least ambiguous if a unique term were used for every possible function, every possible interpretation, every meaning. Readers therefore have a pressure toward diversification of the vocabulary. This leads to the vocabulary balance we observe in Zipf’s rule. Zipf hypothesized that interplay between the forces of diversification and unification results in the use of existing words, which does not extend the vocabulary in most situations, together with the inclusion of new words in those novel situations that demand them. The trick is to find an economy of language that best satisfies both writer and reader. Note, however, that the maintenance of the balance requires that authors receive feedback from their readers, confirming that they are both “pointing” to the same referent.

Blair has extended Zipf’s analysis, considering Zipf’s tool/job setting as it’s applied to our FOA task [Blair, 1990; Blair, 1992].He argues that one of the primary reasons FOA systems fail is that the vocabulary balance is upset. The system of descriptors indexing the authors’ works (for example, the library or the Web), standing between the authors who are writing the books and the searchers attempting to find them, breaks the feedback channel that keeps the shared vocabulary in balance when author and reader are in direct contact.

3.2.3 Benoit Mandelbrot’s Explanation

The early days of cybernetics were heady, and Zipf was not alone in seeking a grand, unifying theory to explain the phenomena of communication on computational grounds like those proving so successful in physics. Benoit Mandelbrot was equally ambitious.

Mandelbrot’s background as a physicist is clear when he considers the message decoder as a physical piece of apparatus, “... cutting a continuous incoming string of signs into groups, and recoding each group separately” [Mandelbrot, 1953]. This “differentiator” complements an “integrator,” which reconstitutes new messages from individual words. Within this model communication can be considered “fully analogous to the perfect gas of thermodynamics.” Minimizing the cost of transmission corresponds to minimization of free energy in thermodynamics.

Mandelbrot was also interested in how the critical parameter varied from one vocabulary to another. Extending the physical analogy of thermodynamic energy, the informational temperature or temperature of discourse is proportional to 1/, which Mandelbrot argues provides a much better measure of the richness of a vocabulary than simply counting the number of words it contains.

The value 1/ can also be used to relate our analysis of Zipf’s law to Mandelbrot’s fractals. If the letters of our alphabet are imagined to be digits of numbers base n + 1 and a leading decimal point is placed before each word, then each word corresponds to a number between 0 and 1.

The construction amounts in effect to cutting out of [0, 1] all the numbers that include the digit 0 otherwise than at the end. One finds that the remainder is a Cantor dust, the fractal dimension of which is precisely 1/. [Mandelbrot, 1982, p. 346]

Mandelbrot proposed a more general form of Zipf’s law:

(3.2)

which has proved important in analysis of the relationship between word frequencies and their rank (cf. Section 5.1).

Mandelbrot also suggested this as a potential model of cognition:

Whatever the detailed structure of the brain it recodes information many times. The public representation through phonemes is immediately replaced by a private one through a string of nerve impulses.... This recorded message presumably uses fewer signs than the incoming one; therefore when a given message reaches a higher level it will have been reduced to a choice between a few possibilities only without the extreme redundancy of the sounds. The last stages are “idea” stages, where not only the public representation has been lost, but also the public elements of information. [Mandelbrot, 1953, pp. 488–9]

Mandelbrot makes other provocative suggestions, for example, that schizophrenics provide the best test of his theory because these individuals impose the fewest “semantic constraints” on the random process (generating language) of interest?!

Although he was unsuccessful at his more ambitious goal of wedding a physical model of communcation to models of semantics such as Saussure’s, Mandelbrot was probably the first to characterize the real truth underlying Zipfian distributions. This derivation is put off until Chapter 5. We conclude here with one more historical perspective, due to Herbert Simon, and a couple of more modern rediscoveries of Zipfian phenomena.

3.2.4 Herbert Simon’s Explanation

Simon considered a very different model, focused on the author’s activity of constructing a text. Simon’s model is based on two assumptions. First, that new words – neologisms – introduced by the author and not previously used are introduced at some constant probability. Second, that the probability of a word having occurred exactly i times is proportional to the total number of occurrences of all words that have appeared exactly i times. These assumptions allow Simon to use the basic mathematics of “pure birth processes” to account for word frequency rankings.*

Simon was interested in models for the “associative” processes underlying authors’ cognition, “... sampling earlier segments of the word sequence” [Simon, 1955, p. 434] as they compose. He acknowledged that authors also use processes of “imitation: sampling segments of word sequences from other works of other authors, and, of course, from sequences he has heard.” Simon imagined this model as potentially applying to three different distributions:

  • the distribution of word frequencies in the whole historical sequence of words that constitute a language;
  • the distribution of word frequencies in a continuous piece of prose; and
  • the distribution of word frequencies in a sample of prose assembled from compositive sources.

He seems most engaged by the second of these alternatives, considering the activity of a particular author. (He uses James Joyce’s Ulysses as an example.)

Obviously this word-based model provides a very different explanation of Zipf’s law from Mandelbrot’s and Miller’s character-based ones. Simon was familiar with such models but argued [Simon, 1955, p. 435] that his own “averaging rather than [Miller’s] maximizing assumptions” are more desirable. But as Miller notes, “The assumption of maximization can be replaced by the assumption of random spacing” [Miller, 1957, p. 313]. Worse, in terms of the empirical bottom line, Simon’s equation does not fit available data as well.

3.2.5 More Recent Zipfian Sightings

The debate concerning these models dates back almost 40 years, but Zipfian distributions and attempts to explain them continue to arise. For example, many have been struck by language-like properties exhibited by the long sequences of genetic codes found in the DNA of all living species. That is, a simple “alphabet” of four nucleic acid base-pairs (BPs) (A, C, G, T in DNA) grouped into three-letter codons that mean one of twenty possible “words” corresponding to amino acids has led many to wonder what we might learn by viewing the genome as a linguistic object [Sereno, 1991].

Mantegna et al. [Mantegna et al., 1994] was led to consider the “word” frequency distributions of such words in the DNA “corpus.” They considered differences in the distributions across coding regions of the genome as well as noncoding regions that are never expressed. Their first result is that this sequence data does indeed contain “linguistic features,” especially in the noncoding regions. By analyzing various genetic corpora (e.g., approximately a million BPs taken from 14 mammalian sequences), they found that, in contrast to what we might expect of completely random sequences, the rank-frequency distribution of six-BP words could be well fit by a (log–log linear) Zipf exponent equal to –0.28. They conclude:

These results are consistent with the possible existence of one (or more than one) structured biological languages present in noncoding DNA sequences. [Mantegna et al., 1994, italic not in original]

Subsequent analysis, however, makes it quite clear that any such interpretations are ill-founded [Bonhoeffer et al., 1996]. Deviations from fully random sequence behavior can be attributed to two simple characteristics of biological sequence data. First, define H(n) to be the entropy of the distribution of n-length nucleotide sequences. Then the redundancy R(n) of length n words is:

(3.3)

so the nonrandom R(1) reflects a simple increase with the variance of the four base-pairs, a well-known biological fact. Further, very short range correlations between nucleic acids (easy to imagine given the basic three-letter genetic code) means that in DNA the most common words are simply combinations of the most probable letters, especially in regions of short repeats. There are still interesting questions (e.g., why coding and noncoding regions differ in their nucleic acid frequencies), but none that suggest any large-scale language-like properties within the DNA sequence.

A final recent example of how Zipf-like distributions arise is offered by analyses of WWW surfing behaviors [Huberman et al., 1998] and makes this same point (but cf. Section 8.1 for more recent, apparently contradictory data generated from massive Alta Vista logs). Consider each page click by a browsing user to be a “character” and the amount of time spent by the same user on a host to be the length of a “word.” Then (surprise!), empirical data capturing the rank-frequency distribution of each WWW surfing “ride” again show a (log–log linear) Zipfian relationship with slope equal to –1.5, as shown in Figure 3.2.

Huberman et al. also propose a model explaining this empirical data. Assume that the “value” (what we might think of as perceived relevance) V(L) of each page in a browsing sequence of length L goes up or down according to identical, independently distributed (iid) Gaussian random variables L:

(3.4)

Using economic reasoning, Huberman et al. then hypothesize:

...an individual will continue to surf, until the expected cost of continuing is perceived to be larger than the discounted expected value of the information to be found in the future .... Even if the value of the current page is negative, it may be worthwhile to proceed, because a collection of high value pages may still be found. If the value is sufficiently negative, however, then it is no longer worth the risk to continue. [Huberman et al., 1998]

If users’ browsing behaviors follow a random walk governed by these considerations, Huberman et al. show that the passage times to this cutoff threshold are given by the inverse Gaussian distribution:

(3.5)

where µ is the mean of the random walk length variable L, µ3/ is its variance, and is a scaling parameter.

3.2.6 Summary

I have provided this historical background because these eminent scientists’ stories remain compelling. The plausibility of the proposed theories, coupled with our retrospective knowledge of their incorrectness, also provides a sobering background as we attempt to infer semantic properties from the statistics arising in FOA. As we shall discuss in Chapter 5 (cf. Section 5.1), the real basis of Zipf’s law can be traced to much simpler mechanisms, relating only to patterns of characters rather than any underlying semantics or purposes. Benoit Mandelbrot, George Miller, and Noam Chomsky have shown that the underlying phenomena relating a word’s frequency to its rank order is obeyed as much by random text – generated by monkeys at typewriters, for example – as by samples of text (the Bible, James Joyce’s Ulysses, etc.) we seem to find more literate.

The fact that the simple, four-character sequence ZIPF should bring together such a rich combination of mathematical and semantic issues is ironic, to say the least. There is obviously a great deal we can predict about our language by assuming nothing more than we would about monkeys at keyboards. At the same time, the fact that we can change the meaning of a simple sequence of characters, for example, the title of this section REMEMBER ZIPF, dramatically by adding a single additional character to form either REMEMBER ZIPF! or REMEMBER ZIPF? should make it clear how much more there still is to say.

3.3 A Statistical Basis for Keyword Meaning

3.3.1 Lexical Consequences, Internal/External Perspectives

The plot in Figure 3.2 is based on word-frequency statistics like those shown in Table 3.2. Note that on the log-log plot in Figure 3.2, frequency is a nearly linear inverse function of rank.

One way to make the various lexical decisions considered in the last chapter is to consider the effects of various decisions in terms of statistics such as these. Table 3.3 shows the statistics for stemmed, nonnoise word tokens (shown in monospaced font, e.g., SYSTEM), together with noise words (shown in italics, e.g., the). As expected, the noise words occur very frequently. But it is interesting to contrast those very frequent words defined a priori in the negative dictionary with those that occur especially frequently in this particular corpus. In many ways these are excellent candidates for external keywords: characterizations of this corpus’s content, from the “external” perspective of general language use. That is, these are exactly the words (cf. NEURAL NETWORK, BASE, LEARN, WORLD, KNOWLEDGE) that could suggest to a WWW browser that the AIT corpus might be worth visiting. Once “inside” the topical domain of AI, however, these same words (cf. SYSTEM, MODEL, PROCESS, DESIGN) become as ineffective as other noise words, as internal keywords, discriminating the contents of one AIT dissertation from the next.

Table 3.3 also shows statistics both with and without stemming. For example, the token SYSTEM itself appeared only 8632 times; variations like SYSTEMS and SYSTEMATIC must account for the other 12,856. This simple example also demonstrates how issues of phrase recognition (cf. NEURAL NETWORK) and other messy issues (e.g., the presence of French noise words in some of the dissertation abstracts but not in our English negative dictionary) can arise in even the simplest, “cleanest” corpora.

3.3.2 Word Occurrence as a Poisson Process

When the words contained in a corpus are ranked and shown to be distributed according to a Zipfian distribution, an obvious but important observation can be made: The most frequently occurring words are not really about anything. Words like NOT, OF, THE, OR, TO, BUT, and BE obviously play an important functional role, as part of the syntactic structure of sentences, but it is hard to imagine users asking for documents about OF or about BUT. Define function words to be those that have only (!) a syntactic function, for example, OF, THE, BUT, and distinguish them from content words, which are descriptive in the sense that we’re interested in them for the indexing task. This is one of the first – but most certainly not the last – examples FOA makes using a priori determinations of a word’s semantic utility based on its statistical properties.

For example, we might hope that function words occur randomly throughout arbitrary text, while content words do not. One ubiquitous model of randomness is as a Poisson process, used in the past to model things like:

  • raisins’ distribution across slices of bread; or
  • misprints’ distribution across printed pages; or
  • the distribution of peoples’ birthdays across days of the year.

In the case of our documents, we’ll start with a slightly simpler Bernoulli model wherein we imagine an author making binary decisions, picking a keyword k with probability pk. Then in a document of length L the probability that a keyword was selected exactly n times in document d is:

(3.6)

In other words, we’d expect it to occur an average of pk · L times in a document of length L.

As L and p 0 (and the mean value p · L 1), the Poisson distribution:

(3.7)

converges to this same distribution. We will generally be interested in a large set of parameters k, each corresponding to a particular keyword k. If we imagine a Bernoulli-like experiment, where individual function words are placed with low probability and observed across the many “experiments” of words occurring in documents, we can expect that a particular word k will occur n times in a randomly selected document according to a Poisson distribution. (Because documents are of different lengths, we must also take care to normalize them all to the same number of experiments.)

As an example of how a Poisson model might be applied to good use, work pioneered by Bookstein and Swanson in the mid-1970s proposed that function words are distributed according to a relatively constant Poisson distribution, while content words are not [Bookstein and Swanson, 1974; Bookstein and Kraft, 1977, Croft and Harper, 1979]. That is, when a keyword is found in a document, it is for one of two possible reasons: Either it just happens (randomly) to be there, or it really means something. Robertson and Walker [Robertson and Walker, 1994] distinguish the latter elite occurrences of a keyword:

We hypothesize that occurrences of a term in a document have a random or stochastic element, which nevertheless reflects a real but hidden distinction between those ... “elite” documents which are about the concept represented by the term and those which are not. We may draw an inference about eliteness from the term frequency, but this inference will of course be probabilistic. Furthermore, relevance (to a query which may of course contain many concepts) is related to eliteness rather than directly to term frequency, which is assumed to depend only on eliteness. [Robertson and Walker, 1994, p. 233, underline not in original]

see exercise 1 Exercise 1


see exercise 2 Exercise 2

In addition to discriminating function from content words, the Poisson model has been used to measure the degree to which a content word is effective as a keyword for a document [Robertson and Walker, 1994]. If we assume that a potential keyword effectively describes some documents in a corpus but occurs at the level of chance throughout the rest of the corpus, the distribution of this keyword across the corpus can be described as the mixture of a Poisson process with some other distribution.

The so-called two-Poisson model models both distributions (i.e., one over the Rel documents that could accurately be characterized as about this keyword and a second over the rest of the documents, which are not) as Poisson, but with distinct means and , with the superscripts 1 and 2 referring to the Rel and distributions, respectively. One advantage of assuming that both distributions are Poisson and that we only need to discriminate between two classes (relevant versus nonrelevant) is that a single-parameter prel Pr(Relevance) controls the probability that the word w is relevant:

(3.8)

This probability can then be used as part of a decision theoretic model related to the costs of indexing too many or too few documents with a keyword w (cf. Section 5.5.6).

3.3.3 Resolving Power

Zipf observed that the frequency of words’ occurrence varies dramatically, and Poisson models explore deviations of these occurrence patterns from purely random processes. We now make the first important move toward a theory of why some words occur more frequently and how such statistics can be exploited when building an index automatically.

Luhn, as far back as 1957, said clearly:

It is hereby proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. [Luhn, 1957]

That is, if a word occurs frequently, more frequently than we would expect it to occur within a corpus, then it is reflecting emphasis on the part of the author about that topic. But the raw frequency of occurrence in a document is only one of two critical statistics recommending good keywords.

Consider a document taken from our AIT corpus, and imagine using the keyword ARTIFICIAL INTELLIGENCE with it. By construction, virtually every document in the AIT is about ARTIFICIAL INTELLIGENCE!? Assigning the keyword ARTIFICIAL INTELLIGENCE to every document in AIT would be a mistake, not because this document isn’t about ARTIFICIAL INTELLIGENCE, but because this term cannot help us discriminate one subset of our corpus as relevant to any query. If we change our search task to looking not only in our AIT corpus but through a much larger collection (for example, all computer industry newsletters), then associating ARTIFICIAL INTELLIGENCE with those articles in our AIT subcorpus becomes a good idea. This term helps to distinguish AI documents from others.

The second critical characteristic of good indices now becomes clear: A good index term not only characterizes a document absolutely, as a feature of a document in isolation, but also allows us to discriminate it relative to other documents in the corpus. Hence keywords are not strictly properties of any single document, but they reflect a relationship between an individual document and the collection from which it might be selected.

These two countervailing considerations suggest that the best keywords will not be the most ubiquitous, frequently occurring terms, nor those that occur only once or twice, but rather those occurring a moderate number of times. Using Zipf’s rank ordering of words as a baseline, Luhn hypothesized a modal function of a word’s rank he called resolving power, centered exactly at the middle of this rank ordering. If resolving power is defined as a word’s ability to discriminate, content, Luhn assumed that this quantity is maximal at the middle and falls off at either very high or very low frequency extremes, as shown in Figure 3.3.* The next step is then to establish maximal and minimal occurrence thresholds defining useful, midfrequency index terms. Unfortunately, Luhn’s view does not provide theoretical grounds for selecting these bounds, so we are reduced to the engineering task of tuning them for optimal performance.

We’ll begin with the maximal-frequency threshold, which is used to exclude words that occur too frequently. For any particular corpus, it is interesting to contrast this set of most-common words with the negative dictionary of noise words, defined in Section 2.3.2. While there is often great overlap, the negative dictionary list has proven itself to be useful across many different corpora, while the most frequent tokens in a particular corpus may be quite specific to it.

see exercise 3 Exercise 3

Establishing the low-frequency threshold is less intuitive. Assuming that our index is to be of limited size, including a certain keyword means we must exclude some other. This suggests that a word that occurs in exactly one document can’t possibly be used to help discriminate that document from others regularly. For example, imagine a word – suppose it is DERIVATIVE – that occurs exactly once, in a single document. If we took out that word DERIVATIVE and put in any other word, for example, FOOBAR, in terms of the word frequency co-occurrence statistics that are the basis of all our indexing techniques, the relationship between that document and all the other documents in the collection will remain unchanged. In terms of overlap between what the word DERIVATIVE means, in the FOA sense of what this and other documents are about, a single word occurrence has no meaning!

The most useful words will be those that are not used so often as to be roughly common to all of the documents, and not so rarely as to be (nearly) unique to any one (or a small set of) document. We seek those keywords whose combinatorial properties, when used in concert with one another as part of queries, help to compare and contrast topical areas of interest against one another.

3.3.4 Language Distribution

We next move beyond characteristics of single keywords to an analysis of the distribution of the entire set of index terms. Any index, whether constructed manually or automatically based on word frequency patterns, is defined by a tension between exhaustivity on the one hand and specificity on the other. An index is exhaustive if it includes many topics. It is specific if users can precisely identify their information needs.

Unfortunately, these two intuitively reasonable desiderata are in some sense at odds with one another, as suggested by Figure 3.4. The best explanation of this trade-off is in terms of precision and recall (cf. Section 4.3.4): High recall is easiest when an index is exhaustive but is not very specific; high precision is best accomplished when the index is not very exhaustive but is highly specific. If we assume that the same index must serve many users, each with varying expectations regarding the precision and recall of their retrieval, the best index will be at some balance point between these goals.

If we index a document with many keywords, it will be retrieved more often; hence we can expect higher recall, but precision may suffer. Van Rijsbergen has talked about this extreme as a “document” orientation, or representation bias [van Rijsbergen, pp. 24, 29]. A document-oriented approach to index-building focuses the system builder’s attention on a careful representation of each document, based on an analysis of what it is about.

However, an index’s fundamental purpose is to reconcile a corpus’s many document descriptions with the many anticipated users’ queries. We could equally well analyze the problem from a query-oriented perspective – How well do the query terms discriminate one document from another?

From the users’ perspective, we’d like to have these queries match meaningfully onto the vocabulary of our index. From the perspective of the corpus, we’d like to be able to discriminate one document from another. These are very different perspectives on an index, and they reflect a fundamental vocabulary mismatch [Furnas et al., 1987] between the way users describe their interests and the way documents have been described.

If an indexing vocabulary is specific, then a user should expect that just the right keyword in a magic bullet query will elicit all and only relevant documents. The average number of documents assigned to specific keywords should be low. In an exhaustive indexing, the many aspects of a document will each be reflected by expressive keywords; on average many keywords will be assigned to a document:

(3.9)

The important observation is that these two averages must be taken across different distributions. We already know from Zipf’s law that the number of occurrences varies dramatically from one keyword to another. Once we make an assumption about how keywords occur within separate documents, we can derive the distribution of keywords across documents. However, the distribution of keywords assigned to documents can be expected to be much more uniform; documents are about a nearly uniform or constant number of topics. Figure 3.5 represents the index as a graph, where edges connect keyword nodes on the left with document nodes on the right.The Index graph is a bipartite graph, with its nodes divided into two subsets (keywords and documents) and nodes in one set having connections only with those in the other. If we assume that the total number of edges must remain constant, we can assume that the total area under both distributions is the same. The quantity capturing the exhaustivity/specificity trade-off is therefore the ratio of Vocab to corpus size NDoc.

Although this analysis is crude, it does highlight two important features of every index. First, in most applications NDoc is fixed and Vocab is a matter of discretion, a free variable that can be tuned to increase or decrease specificity and exhaustivity. Second, certainly in most modern applications (i.e., with the huge disk volumes now common), NDoc >> Vocab. This is one of the most important ways in which experimental collections (including AIT) differ from real corpora. A useful indexing vocabulary can be expected to be of a relatively constant size, Vocab 103 to 105, while corpora sizes are likely to vary dramatically, NDoc 104 to 109.

Along similar lines, it is always useful to think about what this means in the context of the WWW, where the notion of a closed corpus disappears. The WWW is an organic, constantly growing set of documents; our vocabulary for describing it is more constrained.

Several other basic features of an index are shown in Figure 3.5. The flipped histogram along the left side is meant to reflect the Zipfian distribution of keywords, with the most frequent keywords beginning at the top. Recall that this distribution captures the total number of word occurrences, regardless of how these occurrences are distributed across interdocument boundaries. A second distribution is also sketched, suggesting how the number of documents versus word occurrences might be distributed; we can expect these two quantities to be at least loosely correlated. The distinction between intra- and interdocument word frequencies is a topic we’ll return to in Section 3.3.7.

3.3.5 Weighting the Index Relation

The simplest notion of an index is binary – either a keyword is associated with a document or it is not. But it is natural to imagine degrees of about-ness. We will capture this strength of association with a single real number, a weight, capturing the strength of the relationship between keyword and document. This weight can be used in two different ways. The first is to reduce the number of links to only the most significant relationships, those with the highest weights. In this respect a weighted indexing system is a more general formulation than a binary formulation; we can always go to a binary relation from the weighted one. This might make weights useful even if our retrieval method is Boolean (as it often was in early information retrieval (IR) systems). But a second and today more common reason for using a weighted indexing relation is that the retrieval method can exploit these weights directly.

One way to describe what this number means is probabilistic: We seek a measure of a document’s relevance, conditionalized on the belief that a keyword is relevant:

(3.10)

Note that this is a directed relation; we may or may not believe that the symmetric relation

(3.11)

should be the same. Unless otherwise specified, when we refer to a weight w we will intend it to mean wkd.

In order to compute statistical estimates for such probabilities we define several important quantities:

(3.12)

We will make two demands on the weight reflecting the degree to which a document is about a particular keyword or topic. The first one goes back to Luhn’s central observation [Luhn, 1961]: Repetition is an indication of emphasis. If an author uses a word frequently, it is because he or she thinks it’s important. Define fkd to be the number of occurrences of keyword k in a document d.

Our second concern is that a keyword be a useful discriminator within the context of the corpus. Capturing this notion of corpus-context statistically proves much more difficult; for now, we simply give it the name discrimk.

Because we care about both, we will devise our weight to be the product of the two factors, corresponding to their conjunction:

(3.13)

We will now consider several different index weighting schemes that have been suggested over the years. These all share the same reliance on fkd as a measure of keyword importance within the document and the same product form as Equation 3.13. What they do not share is how best to quantify the discrimination power discrimk of the keyword.

3.3.6 Informative Signals versus Noise Words

We begin with a weighting algorithm derived from information theory. Information theory has proven itself to be an extraordinarily useful model of many different situations in which some message must be communicated across a noisy channel and our goal is to devise an encoding for messages that is most robust in the face of this noise.

In our case, we must imagine that the “messages” describe the content of documents in our corpus. On this account, the amount of information we get about this content from a word is inversely proportional to its probability of occurrence. In other words, the least informative word in our corpus is the one that occurs approximately uniformly across the corpus. For example, the word THE occurs at about the same frequency across every document in the collection; its probability of occurrence in any one document is almost uniform. We gain the least information about the document’s contents from observing it.What is “information”?

Salton and McGill [Salton and McGill, 1983], following Dennis [Dennis, 1967], use Shannon’s classic binary logarithm to measure the amount of information conveyed by each word’s occurrence in bits and noise to be the absence of information:

(3.14)
(3.15)
(3.16)

Note that our evidence about the probability of a keyword occurring comes from statistics of how frequently it occurs. We must compare how frequently a keyword occurs in a particular document, relative to how frequently it occurs throughout the entire collection. We can calculate the expected noise associated with a keyword across the corpus, and from this we can infer its remaining signal. Signal then becomes another measure we can use to weight the frequency of occurrence of the keyword document:

(3.17)
(3.18)
(3.19)

Two hypothetical distributions, for a noise word and a useful index term, are shown in Figure 3.6. A noise word is equally likely to occur anywhere; its distribution is nearly uniform. On the other hand, if all of the occurrences of a keyword are localized in a few documents (conveniently clustered together in the cartoon of Figure 3.6) and mostly zero everywhere else, this is an informative word. You’ve learned something about the document’s content when you see it.

3.3.7 Inverse Document Frequency

Up to this point, we’ve been concerned only with the total number of times a word occurs across the entire corpus. Karen Sparck Jones has observed that, from a discrimination point of view, what we’d really like to know is the number of documents containing a keyword. This thinking underlies the inverse document frequency (IDF) weighting:

The basis for IDF weighting is the observation that people tend to express their information needs using rather broadly defined, frequently occurring terms, whereas it is the more specific, i.e., low-frequency terms that are likely to be of particular importance in identifying relevant material. This is because the number of documents relevant to a query is generally small, and thus any frequently occurring terms must necssarily occur in many irrelevant documents; infrequently occurring terms have a greater probability of occurring in relevant documents – and should thus be considered as being of greater potential when searching a database. [Sparck Jones and Willett, 1997, p. 307]

Rather than looking at the raw occurrence frequencies, we will aggregate occurrences within any document and consider only the number of documents in which a keyword occurs. IDF proposes, again using a “statistical interpretation of term specificity” [Sparck Jones, 1972], that the value of a keyword varies inversely with the log of the number of documents in which it occurs:

(3.20)

where Dk is as defined in Equation 3.12.

The formula in Equation 3.20 is still not fully specified in that the count Dk must be normalized with respect to a constant Norm. We could normalize with respect to the total number of documents in the corpus [Sparck Jones, 1972; Croft and Harper, 1979]; another possibility is to normalize against the maximum document frequency (i.e., the most documents any keyword appears in) [Sparck Jones 1979a; Sparck Jones, 1979b]:

(3.21)

Today the most common form of IDF weighting is that used by Robertson and Sparck Jones [Robertson and Sparck Jones, 1976], which normalizes with respect to the number of documents not containing a keyword (NDoc – Dk) and adds a constant of 0.5 to both numerator and denominator to moderate extreme values:

(3.22)

3.4 Vector Space

One of life’s most satisfying pleasures is going to a good library and browsing in an area of interest. After negotiating the library’s organization and finding which floor and shelves are associated with the call numbers of your topic, you are physically surrounded by books and books, all of interest to you. Some are reassuring old friends, already known to you; others are new books by familiar authors, and (best of all!) some are brand-new titles by unknowns.

This system works because human catalogers have proven themselves able to reliably and consistently identify the (primary!) topic of a book according to conventional systems of subject headings like the Library of Congress Subject Headings or the Dewey Decimal system.

Our goal is to abstract away from this very friendly notion of physical space in the library to a similar but generalized notion of semantic space in which documents about the same topic remain close together. But rather than allowing ourselves to be restricted by the physical realities of three-dimensional space and the fact that books can only be shelved in a single place in a library, we will consider abstract spaces of thousands of dimensions.Virtual spaces

We can make concrete progress toward these lofty goals beginning with the Index matrix relating each document in a corpus to all of its keywords. A very natural and influential interpretation of this matrix (due to Gerry Salton [Salton et al., 1975; Salton and McGill, 1983]) is to imagine each and every keyword of the vocabulary as a separate dimension of a vector space. In other words, the dimensionality of the vector space is the size of our vocabulary. Each document can be represented as a vector within such a space. Figure 3.7 shows a very simplified (binary) Index matrix, and a cartoon of its corresponding vector representation.

Estimates of the vocabulary size of a native speaker of a language approach 50,000 words; if you are articulate, your speaking and reading vocabularies might be 100,000 or more words. Assuming that we have a modest 106 document corpus, this matrix is something like 106 105. That’s a big matrix, even by modern supercomputing standards.Sparse vector spaces

In addition to the vectors representing all documents, another vector corresponds to a query. Because documents and queries exist within a common vector space, we naturally characterize how we’d like our retrieval system to work – just as we go to a physical location in the library to be near books about a topic, we seek those documents that are close to our query vector. This is a useful characterization of what we’d like our retrieval system to accomplish, but it is still far from a specification of an algorithm for accomplishing it. For example, it seems to require that the query vector be compared against each and every document, something we hope to avoid.Implementation hack

An even more important issue to be resolved before the vector space model can be useful is being specific about just what it means for a document and query to be close to one another. As will be discussed in Section 5.2.2, there are many plausible measures of proximity within a vector space. For the time being, we will assume the use of the inner product of query and document vectors as our metric:

(3.23)

People have difficulty imagining spaces with more than the three physical dimensions of experience, so it is no wonder that abstract spaces of 105 dimensions are difficult to conceptualize. Sketches like Figure 3.7 do the best they can to convey ideas in the three dimensions we appreciate, but it is critically important that we not let intuitions based on such small-dimensional experiences bias our understanding of the large-dimensional spaces actually being represented and searched.

3.4.1 Keyword Discrimination

We can immediately use this vector space for something useful, as the source of yet another approach to the question of appropriate keyword weightings. Recall that in Figure 3.7 our initial assumption was that each and every keyword was to be used as a dimension of the vector space. Now we ask: What would happen if we removed one of these keywords?

The first step is to extend the measure Sim(q, d) of document-query similarity to measure interdocument similarities Sim(di, dj) as well. Then, for an arbitrary measure of document-document similarity (e.g., the inner product measure mentioned earlier), we consider all pairs of documents and then the average similarity across all of them.There’s a quicker way to compute average similarity

(3.24)
(3.25)
(3.26)

Recall that our goal is to devise a representation of documents that makes it easy for queries to discriminate among them. Because each keyword corresponds to a dimension, removing one results in a compression of the space into K – 1 dimensions, and we can expect that the representation of each document will change at least slightly. For example, removing a dimension along which the documents varied significantly means that vectors that were far apart in the K-dimensional space are now much closer together.

This observation can be used to ask how useful each potential keyword is. If it is discriminating, removing it will result in a significant compression of the documents’ vectors; if removing it changes very little, the keyword is less helpful. Using the average similarity as our measure of how close together the documents are, and asking this question for each and every keyword, we arrive at yet another measure of keyword discrimination:

(3.27)
(3.28)
(3.29)

3.4.2 Vector Length Normalization

One good example involves the length of document and query vectors. So far, we have placed no constraint on the number of keywords associated with a document. This means that long documents, which, caeteris paribus, can be expected to give rise to more keyword indices, can be expected to match (more precisely, have nonzero inner product with) more queries and be retrieved more often. Somehow (as discussed in Section 1.4) this doesn’t seem fair: The author of a very short document who worked hard to compress the meaning into a pithy few paragraphs is less likely to have his or her document retrieved, relative to a wordy writer who says everything six times in six different ways!

These possibilities have been captured by Robertson and Walker in a pair of hypotheses regarding a document’s scope versus its verbosity:

Some documents may simply cover more material than others ... (the “Scope hypothesis”). An opposite view would have long documents like short documents but longer: in other words, a long document covers a similar scope to a short document, but simply uses more words (the “Verbosity hypothesis”).[Robertson and Walker, 1994, p. 235]

Once we have decided that about-ness, is conserved across documents, all documents’ vectors will have constant length. If we make the same assumption about the query vector, then all of the vectors will lie on the surface of a sphere, as shown in Figure 3.8. Without loss of generality, we will assume that the radius of the sphere is unity.

Making Weights Sensitive to Document Length

Unfortunately, this very simple normalization is often inadequate, as can be shown in terms of the inverse document frequency (IDF) weights discussed in Section 3.3.7. IDF weighting highlights the distinction between inter- and intradocument keyword occurrences. Because its primary focus is on discrimination among documents, intradocument occurrences of the same keyword become insignificant.This makes IDF very sensitive to the definition of how document boundaries are defined (cf. Section 2.2), as suggested by Figure 3.9.

The IDF weight that results from encapsulating more text within the same “document” is, in a sense, the converse of normalizing the number of keywords assigned to every document. In either case, the advantage of using the paragraph as our canonical document (cf. Section 1.4), and/or relying on all documents in the corpus to be of nearly uniform size (as in the AIT dissertation abstracts) is apparent.

The OKAPI retrieval system of Robertson et al. [Robertson and Walker, 1994] has proven itself successful (in retrieval competitions like TREC; cf. Section 4.3.3) by combining IDF weightings with corpus-specific sensitivities to the lengths of the documents retrieved. They propose that the average length of all documents in a corpus, , provides a “natural” reference point against which other documents’ lengths can be compared.

Define Len(d) to be the number of keywords associated with the document. OKAPI then normalizes the first component of our weighting formula, keyword frequency, by a term that is sensitive to each document’s deviation from this corpuswide average:

(3.30)

Robertson and Walker report that k 1.0 – 2.0 |Q| seems to work best, where |Q| is the number of query terms.

Singhal et al. [Singhal et al., 1996] approach the problem of length normalization by doing a post hoc analysis of the distributions of retrieved versus relevant documents (in the TREC corpus) as a function of their length. A sketch of typical curves is shown in Figure 3.10.A. The fact that these two distributions cross suggests a corpus-specific length normalization pivot value, p, below which match scores are reduced and above which they are increased. The amount of this linear increase or decrease, shown as the length normalization slope m of the length normalization function in Figure 3.10.B, is the second corpus-specific parameter of Singhal et al.’s model. Returning to the “generic” form of the weighting function originally given in Equation 3.13, the pivot-based length normalization is:

(3.31)

where norm is whatever other normalization factor (e.g., cosine) is already in use; several possible values are given in the next section.

Both OKAPI and pivot-based document length normalizations rely on the specification of additional corpus-specific parameters (k1 and p, m, respectively). Although the addition of yet more “knobs to twiddle” is generally to be avoided in a retrieval system, recent experience with machine learning techniques suggests the possibility of training such parameters to best match each corpus. This approach is sometimes called a regression technique and is discussed more fully in Chapter 7.

3.4.3 Summary: SMART Weighting Specification

Although the variety of potential keyword weighting schemes (signal/noise, IDF, keyword discrimination, etc.) may seem large, there is in fact a systematicity to this variation.

SMART is an extremely influential and widely used software system for investigating IR techniques [Salton, 1971; Buckley, 1985]. One secret of its success is that SMART provides a simple parameter-driven mechanism for easily changing from one form of index weighting to another.

In SMART, the weight is decomposed into three factors:

(3.32)

Each of these components can then be specified independently:

(3.33)
(3.34)
(3.35)

3.5 Matching Queries against Documents

In Chapter 2 we first identified documents, then lexical features to be associated with each. Then we built an inverted keyword list to make going from keywords to documents about those keywords as easy as possible. Now we’ll become specific about how we measure the similarity between a document and a query.

The discussion of matching queries and documents is simplified if we adopt the vector space perspective of Section 3.4 and imagine both the query q and all documents d to be vectors in a space of dimensionality equal to NKw, the keyword vocabulary size.

In this space, the answer to the question of which documents are the best match for a query seems straightforward – those documents that are most similar, relative to some particular metric Sim measuring distance between points in the space. Students of algebra and abstract vector spaces know that many different choices are possible; see Section 5.2.2.

3.5.1 Measures of Association

Given a vocabulary of lexical indexing features, our central concern will be how to moderate the raw frequency counts of these features in documents (and, to a lesser extent, queries) based on distributional characteristics of that feature across the corpus. We will be led to use real-valued weights to capture these relations, but we begin with cruder methods: simply counting shared keywords.

The most direct way to say that a query and a document are similar is to measure the intersection of their respective feature sets. Let Kq be the set of keyword features used in a query and Kd be the analogous set found in a document. The coordination level is exactly how many terms in the query overlap with terms in the document:

(3.36)

If we have many, many features and our query and documents are highly variable, then the presence of any significant overlap may be enough to identify the set of documents of interest. On the other hand, if there is a great deal of overlap between our query and many of the documents, then simply counting how big this intersection is will look like a gross measure. This fine line between one or two documents matching a query and an avalanche of thousands occurs regularly as part of FOA.

see exercise 4 Exercise 4

One part of the problem is normalization – the coordination level of intersecting features shared by both query and document seems a good measure of their similarity, but compared to what? It’s easy to show that this normalization matters. Consider the case where CoordLevel(q, d) = 1, and imagine first that Kq = Kd = 1 also; i.e., a single-word query and a one-word document. In this case the two match on the one feature they each have. Intuitively, this is a perfect match; you couldn’t do any better. But now imagine that the query includes 10 keywords, the document contains 1000 words, but still CoordLevel(q, d) = 1. The same intuition suggests that this should be judged a much poorer match, but our measure does not reveal this. Unless we normalize by something that reflects how many other features we might have matched, we can’t have a useful measure of association.

One natural normalizer is to take the average of the number of terms in the query and in the document and compare the size of the intersection to it. This gives us the Dice coefficient:

(3.37)

The average number of features may or may not be appropriate, again depending on what a typical query and typical document are. Often a document has many, many keywords associated with it, and queries have only one or two. This average is not a very good characterization of either.

Another perspective on similarity says that missing features are as significant as shared ones. The simple matching coefficient gives equal weight to those features included in both query and document and those excluded from both and normalizes by the total number of keyword features NKw:

(3.38)

3.5.2 Cosine Similarity

We will focus here on the cosine measure of similarity. Not only does this respect the useful mathematical properties mentioned at the beginning of this section, but it is consistent with a geometric interpretation of the vector space that many find insightful:

(3.39)

3.6 Calculating TF-IDF Weighting

Following the careful empirical investigation of Salton and Buckley [Salton and Buckley, 1988d; Salton and Buckley, 1990] and many others since [Harman, 1992a], we will concentrate on the TF-IDF weighting, which multiplies the raw term frequency (TF) of a term in a document by the term’s inverse document frequency (IDF) weight:

(3.40)
(3.41)

where fkd is the frequency with which keyword k occurs in document d, NDoc is the total number of documents in the corpus, and Dk is the number of documents containing keyword k.

Due to the wide variety observed in users’ query patterns, methods for weighting queries vary more, primarily depending on the length of the query. We will consider two weighting methods, especially designed for “short” and “long” queries. Short queries (of as few as one or two terms; cf. [Silverstein et al., 1999]) seem typical of those issued by Web search engine users. For these, we can assume multiple occurrences of the same keyword will be rare, and we ignore length normalization. This leaves us with simply the term’s inverse frequency weight:

(3.42)

Long queries are often generated indirectly, as the result of RelFbk from the users in response to prior retrievals. The long query corresponds to a particular document that the users like; searching for others like a known target is called query-by-example. By symmetry, it makes sense to use the same weighting of terms in this query-cum-document as we used for documents (Equation 3.40):

(3.43)

Notice that once the lengths of the q and d vectors in the denominator of Equation 3.39 are known, the computation of Sim requires a simple sum of products over all terms shared by query and document. Because both these lengths can be computed independently, it makes sense to compute them first.Implementation: Storing document lengths

In fact, the document length Len(d) can be computed before any query activity takes place:

(3.44)

With these definitions in place, we can begin to design an algorithm for computing similarity.

3.7 Computing Partial Match Scores

With length normalization to the side, we can concentrate on the main calculation of matching, summing the weight products of terms shared by query q and document d:

(3.45)

This mathematical characterization hides a number of messy details associated with actually computing it. We will need to make efficient use of two data structures in particular. The first is the inverted index (recall Figure 3.5). The critical feature of this organization is that we can, for each term in the query, find the set of all document postings associated with it. In particular, the freq statistic maintained with each posting allows us to compute the weight wkd we need for each. Even with these efficiencies, however, the need to consider every document posting for every query term can create a serious processing demand, and one that must be satisfied immediately – the users are waiting!

Because the elements of the sum can be reordered arbitrarily, we will consider a partial ranking (a.k.a. filtering, or pruning) algorithm that attempts to include the dominant elements of the sum but ignore small, inconsequential ones [Salton, 1989; Buckley and Lewit, 1985; Harman and Candela, 1990].

The fact that wkq and wkd vary considerably suggests that, by ordering the inclusion of products into the sum, we may be able to truncate this process when they become smaller than we care to consider.

We therefore begin by sorting the terms in decreasing order of query weights wkq. Considering terms in this order means we can expect to accumulate the match score beginning with its largest terms. Then the fact that our list of postings was similarly (and not coincidentally) ordered by decreasing frequency means that:

(3.46)

Once these weights diminish below some threshold add, we can stop going down the postings list. (In fact, it may be that the weight associated with the very first posting is too small and we can ignore all weights associated with this term.)

The second important data structure is an accumulator queue in which each document’s match score is maintained. Because each query term may add an additional term to the match score for a particular document, these accumulators will keep a running total for each document. For moderate corpus sizes, it may not be unreasonable to allocate an accumulator for each document, but this can demand too much memory for very large corpora. Define NAccum to be the number of accumulators we are willing to allocate. Then one obvious way to restrict this set is to only allocate an accumulator when a document’s score becomes significant, again in comparison to some threshold insert. Because we will be processing query terms in decreasing wkq order and heuristically value the space associated with new accumulators more than the slightly longer time to run down posting lists a bit further, we can assume that insert > add.

Picking appropriate values for these two thresholds is something of a black art, but Persin [Persin, 1994] reports one especially careful experiment in which both are made proportional to the most highly matched document’s accumulator A* (i.e., A* is the maximum match score in any document’s accumulator):

(3.47)
(3.48)

Persin’s experiments suggest that values of insert = 0.07, add = 0.001 give retrieval effectiveness near that of full matches (i.e., considering all query-document term products) while minimizing NAccum.Partial matching isn't just more efficient; it works better too!

These two thresholds divide the range of possible query-document term products into three conditions:

wkd · wkq > insertAlways add; create new accumulator Ad if necessary
add < wkd · wkq < insertAdd only if accumulator Ad already exists
wkd · wkq < add   Ignore; move on to the next query term

We want to remain flexible with respect to both long and short queries, so we will assume that the query weights wkq are precomputed and passed to our ranking procedure.* Using our definition for wkd and focusing first on the insert threshold:

(3.49)
Algorithm 3.1 Partial Ranking
prank(qry [], A) {
// qry = vector of < keyword, weight > pairs
// A = queue of < docid, score > accumulators
// initially empty and returned with most highly ranked
     Sort(qry, & descending WgtCmp);
     for (q query) {
          insert = (insert · A*)/(q.wgt · idfq);
          add = (add · A*)/(q.wgt · idfq);
         for (fp fpost (q)) {
               if (fp.freq <= add)
                    break;
               newscore = fp·freq·idfq · q.wgt;
               for (dp dpost (fp)) {
                    fnd = hashfind(docTbl, docno);
                    if(fnd OR dwgt > insert) {
                         if (fnd)
                              fnd.score + = newscore;
                         else {
                              hashadd(docTbl, docno);
                              if (length(A) > N Accum)
                                   pop(A);
                              insertQ(newscore, docno, A);
                    }
                    A* = max(A*,fnd.score,newscore);
               }
          }}// eo-posting loops
     }// eo-qry loop
}

This finally becomes an operational question we can apply with respect to each posting’s frequency fkd. Note that this threshold must be updated every time we move to a new term of the query. Of course, the computation of add proceeds similarly.

All the basic features of a partial ranking algorithm are now in place, and a pseudo-code sketch is shown in Algorithm 3.1. It also includes a few minor complications. First, a hashtable is required to find accumulators associated with a particular docno. Second, the set of accumulators Ad is described as a queue, but it must be slightly trickier than most: It must maintain them in order so that only the top NAccum are maintained, and it must support length() queries and a pop() function when it is full. Nondeterministic skip lists [Pugh, 1990] are recommended for this purpose [Cutting and Pedersen, 1997].

see exercise 5 Exercise 5


see exercise 6 Exercise 6


see exercise 7 Exercise 7


see exercise 8 Exercise 8


see exercise 9 Exercise 9

3.8 Summary

We’ve covered enormous ground in this chapter. We began by wondering just what it might mean to try to understand communicative acts, like the publication of a document or the posing of a question or query. We looked in some detail at one of the most fundamental characteristics of texts, Zipf’s law, and found it to be, in fact, quite meaningless! But the next level of features, tokens like those produced by the machinery of Chapter 2, supported a rich analysis of the index, as a balance between the vocabularies of the documents’ authors and searchers’ subsequent queries. The vector space provides a concrete model of potential keyword vocabularies and what it might mean to match within this space. Finally, we considered an efficient implementation of nearly complete matching. In the next chapter we will consider the problem of evaluating how well all these techniques work in practice.

However, there are gaps in this story that are so obvious we don’t even need to measure them. Some of the gaps involve implementation issues that can be critical, especially when faced with very large corpora [Harper and Walker, 1992]. Parallel implementation techniques, for example, those pioneered on “massively parallel” SIMD (single instruction/multiple data) Connection Machines, become important in this respect [Salton and Buckley, 1988c; Stanfill and Kahle, 1986a; Stone, 1987]. In the modern age of multiple search engines each indexing (only partially overlapping versions [Lawrence and Giles, 1998]) of the WWW techniques for fusing multiple hitlists into a single list for the same user suggests another level of parallelism in the FOA task [Voorhees et al., 1995].

However, linguists, in particular, must have more serious, implementation-independent concerns. Imagine that you are someone who has studied the range of human languages and who appreciates both their wide variety and equally remarkable commonalities. You would be appalled at the violence we have done to the syntactic structure of language. For linguists, Finding Out About documents by counting their words must seem like trying to understand Beijing by listening to a single microphone poised high over the city: You can pick up on a few basic trends (like when most people are awake), but most of the texture is missing!

DOG BITES MAN and MAN BITES DOG clearly mean two different things. Word order obviously conveys meaning beyond that brought by the three words. And the problem doesn’t end with word order. Look how different the meaning of these phrases are:

  • NEUTRALIZATION OF THE PRESENT
  • REPRESENTING NEUTRONS
  • REPRESENTATIONS, NOT NEUTRONS

despite the fact that all of them (conceivably) reduce to the same set of indexable tokens! Note especially how critical the same "noise" words thrown away on statistical grounds (in Chapter 2) are when analyzing a sentence’s syntactic structure.

The attempt to understand the phenomena of meaning by looking for patterns in word frequency statistics alone is reminiscent of the tea leaves and entrails of this chapter’s opening quote. Still, the success of many WWW search engines that use very little beyond this kind of gross analysis suggests that there is much more information in the statistics than traditional, syntactically focused linguists might have believed.


TERMS INTRODUCED IN THIS CHAPTER

accumulator (98)
average similarity (88)
base-pairs (68)
Bernoulli (74)
call numbers (86)
codons (68)
content words (73)
coordination level (94)
cosine (95)
dimensionality (86)
diversification (65)
economy of language (65)
elite (75)
exhaustivity (78)
external keywords (72)
function words (73)
fusing (101)
inner product (87)
internal keywords (72)
inverse document frequency (84)
informational temperature (66)
length normalization pivot (92)
length normalization slope (92)
magic bullet query (79)
matrix (79)
neologisms (67)
noise (83)
normalization (94)
partial ranking (98)
poisson process (74)
principle of least effort (65)
query-by-example (97)
query-oriented (79)
referent (65)
regression (92)
representation bias (78)
resolving power (77)
scope (89)
semantic (86)
signal (83)
simple matching coefficient (95)
specificity (78)
stratified sampling (62)
temperature of discourse (66)
term frequency (96)
TF-IDF (96)
two-Poisson model (75)
vector space (86)
verbosity (89)
vocabulary balance (65)
vocabulary mismatch (79)
unification
weight (81)
WWW surfing (69)
Back to TOC