Testable hypothesis. As discussed in Section 6.1, this hypothesis is now testable in terms of the science of bibliometrics.
Corpus-based linguistics. The field of corpus-based linguistics, led by people like Ken Church and Eugene Charniak [Church and Hanks, 1989; Charniak, 1993], is beginning to change all of that. The recent textbook by Manning and Schütze [Manning and Schütze, 1999] provides an excellent introduction to this methodology. And for a long time Karen Sparck Jones has been exemplary in straddling these two approaches. Her work [Robertson and Sparck Jones, 1976; Sparck Jones, 1972; Sparck Jones, 1973; Sparck Jones and van Rijsbergen, 1976; Sparck Jones, 1979a; Sparck Jones, 1979b; Sparck Jones, 1986; Sparck Jones et. al., 1996; Sparck Jones and Willett, 1997] has consistently sparked from one side of Snow’s gulf to the other, making fundamental contributions to each.
Marginal notes. This is just an example; nothing really to add here.
Meta-cognition about ignorance. “Meta-cognition” is a term often used to refer to thoughts we have about our thinking. One of the most important tenets of cognitive science is to remain very skeptical of our own “introspections” of how it seems to ourselves we are thinking, and more concrete sources of empirical data concerning meta-cognition are also hard to identify. This is especially true of ignorance: knowing what it is that you don’t know. Any use of the term “awareness” in this context, therefore, should be interpreted loosely.
What FOA data can we observe? An important methodological point is that we do not (and, at least until brain imaging techniques advance even further than they have already, will not) have access to the asker’s internal cognitive state, and therefore we cannot rely on such evidence to build a scientific theory of FOA. Rather, we must content ourselves with observing the asker’s overt actions – their query and relevance feedback, generated in an effort to satisfy this information need.
Other places to FOA IR. Information Processing & Management, the ACM’s Transactions on Information Systems, and the Journal of the American Society for Information Science (JASIS) are some of the central journals; meetings of the American Society for Information Science, the ACM’s Special Interest Group in Information Retrieval (SIGIR), and the Symposium on Document Analysis and Information Retrieval (DAIR) are the most important meetings, producing consistently valuable proceedings.
“Typical” users have changed. This gap, between skilled users of query languages and more general users, is related to the historical progression of IR technology. Many of the first systems were designed to be used by search intermediaries, typically library scientists. These skilled searchers would talk to, interpret the information need of, and translate for “end users.” More and more, however, the goal is to put the search tools directly into the end users’ hands.
Transitivity. A relation defined over a pair of objects R (.,.)is transitive if R(a,b) and R(b,c) implies that R(a,c). For example, the < (less than) relation is transitive, because a<b and b<c implies a<c.
Card catalogs were the first search engines. In fact, applications like the University of California’s MELVYL card catalog system were some of the very first, large-scale text search systems.
A misleading title, or did the document teach you something?! Of course, it is also possible that while a quick scan of a title might make a document seem relevant, users might decide after reading its full text that it is not. I will argue later that this test is too hard (like deciding whether you want to buy a bottle of wine after you’ve drunk it!), but here we will simply note that how well correlated relevance feedback for proxies is with that for full documents is an interesting, empirically testable question.
Signature of human culture?! Takeo Kanada of the Carnegie Mellon Vision Laboratory asserts that a very simple predicate can be used to distinguish purely natural scenes from those containing human artifacts: Natural scenes never contain more than a single horizontal line!
MONICA the meme. This phenomenal news event, and the enormous amount of electronic ink spent covering it did produce an interesting data set. M. Best [Best, 2000] has used it to provide some of the first empirical testing of interesting hypotheses concerning cultural change often attributed to Richard Dawkins [Dawkins, 1976]; just as biological evolution sifts through the gene pool to find fit individuals, cultural evolution sifts through available memes (paradigms, theories, hypotheses, ideas, words, and so on) to find the most fit. But theories of biological evolution are notoriously subtle, and the data concerning them are much better! Although it is only a beginning, Best’s statistical analysis of phenomena like the rise and fall of the token MONICA within newspapers and UseNet newsgroups provides some of the first concrete data on some very interesting questions.
NLP for databases. We leave to one side difficult but tangential issues such as how the database system might correctly interpret “2G” and “SCSI” as potential attribute values.
Omniscent relevance. Philosophical problems with assuming that any such “omniscent” assessment of relevance is possible, as well as the methodological problems of determining the set Rel, will be considered later.
What is dirent? The dirent interface began with a Berkeley Software Distribution (BSD) specification written by Kirk McKusick in the mid-1980s. It has evolved to be a part of the POSIX standard. Ports to various platforms (e.g., Linux, MS-DOS, MacOS) are available [Gwyn, 1994].
What is RFC822? RFC822 was a “request for comment” quasi-specification of what mailer software should expect to receive.
But sometimes we care about noise words! It is not hard to imagine, however, a query like TO BE OR NOT TO BE. PAT (Patricia) trees have been recommended especially for such situations [Gonnet et al., 1992].

More about AIT origins The dissertation citations and abstracts contained here are published with permission of UMI® Dissertation Publishing. Further reproduction is prohibited without permission.

In part to advertise recent additions to their corpus, UMI often “broadcasts” abstracts of recent dissertations. More, considerable intellectual effort is added by selecting which dissertations go to which distribution lists. I first became aware of the AI thesis portion on AI-List a long-running and influential mailing list moderated by Ken Laws. Computists Laws valuable service is now available for a fee as the www.Computists.com online newsletter.. These were selected by Suzanne Humphries. Copies of dissertations may be obtained by addressing your request to UMI® Dissertation Services run by Bell & Howell Information and Learning Company (formerly UMI), 300 North Zeeb Road, Ann Arbor, MI 48106-1346 USA. Telephone: (734) 761-7400; E-mail: info@umi.com; Web-page: http://www.umi.com/.

More about AIT origins. University Microfilms Inc. (UMI) is now part of the Bell-Howell Corporation. Copies of these very important documents can be obtained from UMI at nominal cost. In part to advertise recent additions to its corpus, UMI often “broadcasts” abstracts of recent dissertations. Considerable intellectual effort is added by selecting which dissertations go to which distribution lists. I first became aware of the AI thesis portion on AI-List, a long-running and influential mailing list moderated by Ken Laws. Law’s valuable service is now available for a fee as the Computists online newsletter. The AI theses were selected by Suzanne Humphries.
How do I index my email? Most modern email clients can now be configured so that these files are maintained in relatively straightforward ASCII encoding, but you may want to confirm this for your system.
Implementation details. InstallTerm() needs to check whether the term is present in the splay tree or needs to be added. It should also efficiently check if the docno is the same as the last time and simply increment the appropriate counter in this case.

Splay trees are appropriate technology because we can count on the many frequency-based queries to cause the resulting tree to become well-balanced with use.
Implementation details. If the optimized fpost representation of postings of Figure 2.5 is used, however, incremental addition of document postings is slightly more awkward.
More about STAIRS. STAIRS was developed by IBM in the 1970s. It also happens to be the IR system used as part of an extensive performance evaluation performed by Blair and Maron [Blair and Maron, 1985].
Proximity searching with low-resolution posting information. Of course, even without high-resolution posting information it is still possible to support proximity query operators by simply retrieving documents in which both terms occur and then performing a subsequent search of the document text itself to see if these occurrences are sufficiently close.
Pointing. To be concrete, you can imagine this all occurring under a microscope, and the authors choosing language to refer to some particular phenomena, e.g., “those green things with the tails that swish against the blue boxes” or something equally silly. But the act of pointing itself is far from silly. For example, it seems to be one of the fundamental, prelinguistic activities that distinguishes humans from other primates.
What is “information”? Entire courses are given on information theory so we cannot do it justice here. But its basic features are so simple–and so important–that it is tempting to try.

My favorite definition of information is due to Gregory Bateson [Bateson, 1973]: “Information is a difference that makes a difference.” Information is about surprise, ways in which an expectation has been violated in some way. If I tell you that your grade is based on (1) a final and (2) a midterm, you wouldn’t be very surprised. But if I tell you that your grade will also depend on (3) how long you can stand on one foot without moving, you probably would be surprised. There’s more information in that part of my message.

We can demonstrate this in terms of a conversation you might have after a class with someone who missed class that day. “What did you learn in class today?” they will ask. “Oh, not much really,” you’d say in the first case, because you–and your friend’s–expectations about grading (not to mention your friend’s expectation that you can be relied on to convey the information; cf. Section 8.2.1) have been confirmed. But in the second case you’d have to reply, “You won’t believe this; part of our grade is based on how long we can stand on one foot!” You’ve learned something; you’ve gained information.
Virtual spaces. One fundamental difference between organizing books within the physical space of the library and placing vectors within an abstract semantic space is that while a book can occupy only one physical location, it is trivial to maintain multiple representations of the same document. A physical object has to be in one place or another, meaning that someone (typically a cataloger at the Library of Congress) has to make a very difficult decision to determine the primary topic of each book. Think how consequential this decision is as it effects a patron’s ability to go to a semantically defined place in the stacks! This is in stark contrast to the multiple facets we can represent in electronic indices.
Sparse vector spaces. The fact that the Index matrix is sparse (i.e., that only a small number of a document’s keyword entries are nonzero) recommends special sparse-matrix techniques for vector space computations [Letsche, 1996].
Implementation hack. It is also common to use a sparser representation to first identify those documents with nonzero match with a query, rather than exhaustively checking every document.
There’s a quicker way to compute average similarity. Fortunately, it turns out that there is a more efficient way to compute average similarity than actually comparing all document pairs. First, define the centroid document to be the average document, i.e., the result of adding all NDoc vectors and dividing by NDoc. Then the average similarity can also be defined in terms of the distance of each document from this center:

Implementation: Storing document lengths. The fact that the keyword’s total document frequency Dk cannot be known until the entire corpus has been processed suggests a “second pass.” In some cases, lengths are computed for each document and stored in a doclen file, separate from the main kw-inv inverted postings file. (It would also be possible to normalize the frequencies fkd by document lengths and store this quotient in the postings file, but that would require the storage of floats rather than small integers.) For the small corpora we consider here, it proves easier to simply compute these values as the inverted index is read in, prior to the first matching against queries.
Partial matching isn’t just more efficient; it works better too! In fact, Persin claims superior performance (29.1 versus 28.9 percent, 11-point average precision on the TREC/Tipster corpus) using partial ranking than if all postings are considered. He attributes this to the “. . . pruning of common terms that are encountered in almost every document and that create informational noise rather than help discriminate between documents.” [Persin, 1994, p. 345]
In for a fact, stay for a lesson. The editors of Encyclopdia Britannica use a nice phrase to characterize users of their system:

In for a fact, stay for a lesson.


The idea is that a very pragmatic need might initially bring users to the Encyclopdia Britannica, but they often continue to read as they learn that their initial query does not have a simple answer. Imagine that you want the answer to a simple factual query, for example, the height of Mt. Everest. The first couple of paragraphs of the article on Mt. Everest would meet a simple version of this information need quite admirably:
Tibetan CHOMOLUNGMA, Chinese (Wade-Giles) CHU-MU-LANG-MA FENG, (Pinyin) QOMOLANGMA FENG, Nepali SAGARMATHA, peak on the crest of the Great Himalaya Range in Asia, the highest point on Earth. It lies on the border between Nepal and China (Tibet), at 27° 59' N, 86° 56' E.

Three barren ridges–the Southeast, Northeast, and West–culminate in two summits at 29,028 feet (8,848 m; Everest) and 28,700 feet (8,748 m; South Peak). The mountain can be seen directly from its northeastern side, where it rises about 12,000 feet (3,600 m) above the Plateau of Tibet. The lesser peaks of Changtse (north; 24,803 feet [7,560 m]), Khumbutse (northwest; 21,867 feet [6,665 m]), Nuptse (southwest; 25,791 feet [7,861 m]), and Lhotse (south; 27,890 feet [8,501 m]), which rise around its base, hide the summit from Nepal.
<WWW.EB.COM:180/CGI-BIN/G?DOCF=MICRO/199/84.html>
But in fact there are at least three numbers that could legitimately be given as this answer, each associated with a separate expedition at a different point in history! The online version of the Encyclopdia Britannica makes this additional “Research Note”:
The generally accepted figure of 29,028 feet (8,848 m) for the height of Mount Everest was established by the Indian government’s Survey of India in 1952–54. This datum is used by, among others, the (U.S.) National Geographic Society.

A Chinese survey in 1975 obtained the figure of 29,029 feet, and an Italian survey, using satellite surveying techniques, obtained a value of 29,108 feet (8,872 m) in 1987, but, owing to questions about the methods used, neither of these results is widely accepted. In 1986 a measurement of K2, regarded as the second highest mountain, seemed to indicate that it was higher than Everest, but this was subsequently shown to be an error. In 1992 another Italian survey, using a global satellite positioning system and laser measurement technology, yielded the figure 29,023 feet (8,846 m) by subtracting from the measured height the 6.5 feet (2 m) of ice and snow on the summit; this value has not found general acceptance.
<WWW.EB.COM:180/CGI-BIN/G?DOCF=BUP/630004.HTML>
These sagas make for very interesting reading, but only if you have the additional time and energy available to benefit from such education. Section 8.3.4 will explore this connection between FOA and educational objectives in further detail.
Neural-net style learning. This form of adaptation is very similar to techniques of gradient descent, for example, error correction learning in neural networks. If the positive cluster centroid d+ is treated as the correct answer, the parameter becomes analogous to the neural net’s learning rate.
“With Web search engines don’t we have access to enormous numbers of users searching the same corpus?” [SG]

Yes and no. The Web generally–and Web engines in particular–obviously generates huge traffic and potentially much interesting data about how real people (versus experiment subjects) FOA. But access to these statistics, most conveniently collected by Web search servers, is an increasingly valuable commodity! Many people would like to know what sorts of things people are searching for and how they search for it.

It’s also important not to think of the Web that everyone is searching as “the same corpus.” One of the Web’s most salient features is its dynamism. New documents are added and others (or at least the links to them!) are removed all the time. This makes comparing search retrieval results at two different times difficult.
Killing the messenger. The litigation support was part of an extended trial concerning the construction of the Bay Area Rapid Transit system, BART. Another interesting feature of this study was the repercussions felt by Blair and Maron: While they are careful to say that their very negative conclusions (concerning the Recall levels achieved) “. . . would be problems with any large-scale full-text retrieval system, and in this sense our study should not be seen as a critique of STAIRS alone,” IBM took the criticism very personally. “Big Blue” was very much the Microsoft of this era, and both Blair and Maron were made to feel quite uncomfortable.
Single dimensions for simple minds. Of course many people prefer unidimensional IQ scores over more sophisticated multidimensional analyses of human intelligence. In both cases, it seems likely that the phenomena measured is sufficiently complex that more than one number is warranted.
More elaborate ways of merging ranked lists. First, the assessment of a document whose ranked order is highly correlated across retrieval methods provides little information about differences between the methods. Said another way, we can potentially learn the most from those documents whose rank order is most different, and hence a measure of the difference in ranked orders of a particular document might be used to favor “controversial” documents. This factor has the unfortunate consequence, however, of being sensitive to what we would expect to be the least germane documents, those ranked low by any of the methods under consideration. A second factor that could be considered is a “sanity check,” including a random sample near the top of our list. While we might learn a great deal from these samples if users agree that these randomly selected documents are in fact relevant, we expect that in general the retrieval performance of the systems should not depend on random documents.
A better RAVePlan Ideally, RAVePlan could be folded into the interactive RAVE facility (described later), so that documents are allocated dynamically and incrementally, maintaining a nearly consistent density on all queries. The experimenters could then push the experiment alternately in the directions of higher density or larger candidate-document sets. In its current implementation, however, RAVePlan requires that we predetermine the list of documents that each subject will see.
Double-counting spaces?! There may be a bug in this derivation. As it stands, (M + 1)k+2 seems to potentially double-count the interword spaces and perhaps miss all-space stretches. W. Li (personal correspondence) claims that the normalization of Equation 5.4 makes this issue disappear, but I am not fully convinced.

I just went back to read my paper. i think counting the space once or twice leads to the same result. in the paper, I wrote the probability... is “proportional” to.... then I added up all to get the normalization factor. the same (1/27) factor will be canceled.

No, nobody asked this question before! I suspect people only look at the abstract/conclusion and never bother with the derivation itself (it can be bad for the authors because no feedback!).
Beyond the puny three dimensions of human existence. The space where documents exist is very large and high-dimensional. I encourage you to suspend your intuitions from the two and three dimensions in which we puny humans live (cf. [Abbott, 1952]), because many of the intuitions that apply in a small number of dimensions do not generalize to the high-dimensional spaces that we discuss here. Some of these involve the curse of dimensionality, which makes the computational expense of many important questions grow exponentially with the number of dimensions.
Mathematical details. The T superscript on A simply means that rows and columns of this matrix have been transposed: Every row in the original A has become a column in AT and vice versa. An orthonormal matrix contains mutually orthogonal vectors of unit length. The most common example of an orthonormal set of vectors is the column vectors of the identity matrix I, corresponding to the coordinate vectors, but any rotation of this system is also orthonormal.
Dimensionality reduction using neural networks. Neural networks can be shown to perform a very similar sort of dimensionality reduction, if they are forced to “auto-associate” an input vector with an equivalent output vector, compressing the information through a hidden layer of k units [DeMers and Cottrell, 1993]. One important difference, however, is that neural networks spread the variance evenly across all hidden units, rather than concentrating most of it along the first eigenvector, etc.
Earlier attempts to reduce dimensionality. Deerwester et al. [Deerwester et al., 1990] suggest that earlier attempts to apply factor analytic techniques to IR [Koll, 1979; Borko and Bernick, 1963]may have foundered because they considered too few dimensions (7 and 21, respectively).
SVD is patentable?! One complication surrounding this software and LSI generally is that Bellcore was granted a patent (U.S. Patent No. 4,839,853, June 13, 1989) on this technique! Caveat emptor.
Temporal drift. It can arise naturally in many settings, for example, when a newspaper wants to allow retrieval of articles from the last 30 days.
A new argument for nurture. Landauer and Dumais offer a more radical interpretation as well. These claims must also be appreciated as part of the much larger debate between “nature” and “nurture” as the determinant of linguistic development [Elman et al., 1996; Pinker, 1994; Piattelli-Palmarini, 1980].
IR has historically ignored preferences. Despite the availability of such data, the idea that retrieved documents either were or were not relevant has been so deeply ingrained in mathematical characterizations of the IR problem that it was not until 1990 that Wong and Yao first exploited the fact that users provide information about partial order information relating to retrieved documents [Wang et al., 1992].
Proximity can capture similarity or dissimilarity/distance. Although “similarity” seems a simple idea, it admits a number of interpretations. For example, what is the opposite of similar? When we talk about spaces, does dissimilarity mean that two things are far apart? Rather than considering dissimilarity to be a negative quantity, it is more conventional to think of it as the inverse of similarity:

van Rijsbergen’s long shadow. In every section of FOA, van Rijsbergen’s shadow has fallen. But there simply isn’t a way to cover the topic of clustering any better than he does in his chapter “Automatic Classification.” Thanks to the happy circumstance of his full text being on the FOA CD-ROM, I don’t have to say it again! This section is a superficial copy of his.
Divisive/partional clustering. Divisive or partitional clustering algorithms begin with the entire set of data points considered to be in one cluster and then attempt to partition this set [Jain and Dubes, 1988, p.57]. They sometimes are recommended because if they begin with exactly k seeds the complexity of comparison is reduced to O(kN).
Who stated the PRP? The PRP has been variously attributed (cf. van Rijsbergen, p. 113) to William Maron, William Cooper, and Steve Robertson. Here we use van Rijsbergen’s statement.

Dr. Cooper recently offered the following updated opinion concerning the PRP:

You did not ask for my thoughts on issues surrounding the so-called “Probability Ranking Principle”, but in the event that you are interested here, very briefly, are three that come to mind immediately.
  1. Why rank the output of a retrieval system according to computed values of probability of relevance?
  2. Since “probability of relevance” can be interpreted in several different ways, which interpretation is to be preferred?
  3. How accurately can those individual probabilities, which are used to compute probability of relevance, be estimated?
If we accept (as I do) a frequency interpretation for probability, then it is tautologically true that an event with the higher probability of occurrence will happen (occur) more often than one with a lower value of probability of occurrence. Hence if we are computing probability of relevance for the output in a document retrieval system, the best strategy is to rank those output documents in descending order according to their probabilies of relevance, because by so doing we would be providing the user with an optimal (output) search value. (Assuming here, of course, that all relevant documents are of an equal value.) Looking first at those documents with the highest probability of relevance means that the user will be most successful in finding relevant documents—in the long run.

A probability ranking retrieval system is only as good (accurate) as are the estimates of the individual probabilities that are being used to compute the output probability of relevance and upon which the final ranking is based. Which kinds of individual probabilities can be estimated most accurately? How might more accurate estimates be obtained?

Since probability of relevance is a relationship among classes of events (individual documents, individual users, documents of a certain type, users of a certain type, etc.), it is important to be clear about we mean, in any given discussion, by “probability of relevance”. In the model proposed in 1960 by Kuhns and myself, probability of relevance is a relationship between a single document and a class of users of a given type (viz., all of those submitting a query of a certain type). In the model proposed in 1976 by Robertson and Sparck-Jones, probability of relevance is a relationship between a given individual and a class of documents of a given type. These are two quite different meanings of “probability of relevance”.
The PRP hides another assumption. This statement also has latent in it the assumption that we can assess the relevance of every document independently of the relevance of any other document; cf. Section 4.1.1.
Rational world. One way to know that we are talking about a “rational” world is to say that:

Maybe this assumption isn’t so bad? Cooper has noted that we don’t need such a restrictive notion of independence [Cooper, 1994]. All we need to know is that the ratio of “linked” relevant of irrelevant feature probabilities



is independent. Although slightly less restrictive, this assumption seems no more realistic.
Weighting evidence. I.J. Good has argued that when attempting to reason about the probability that a hypothesis is true (e.g., that a particular document is relevant to a query), given evidence for or against it (e.g., the presence or absence of keyword features), the LogOdds of these probabilities provides conditions necessary to decide.
Multiple representations of the same document. Turtle and Croft discuss the association of multiple representations of the same document within the same network. For example, one set of keywords might be assigned automatically, while a second set may be manually assigned. But interactions among these representations, capturing, for example, the probability that a human editor would assign the keyword INFORMATION RETRIEVAL to a document given that it is assigned the keywords INFORMATION and RETRIEVAL by an automatic indexing algorithm, must also be included.
History: AI xor IR?! In the early 1970s, at a formative stage in the development of the discipline of artificial intelligence and information retrieval, funding agencies of the government, particularly DARPA, seemed forced to choose between AI and IR as methodologies for writing elaborate programs. Leading scientists of both groups made predictions about what their technologies could do in the foreseeable future. The claims of AI seem to have won the day, because after this period AI benefited from a great infusion of defense dollars relative to IR.

Whatever other consequences these early days had, they did not breed good feelings between the IR and AI communities. For example, when I was thinking about graduate school I had a meeting with Gerry Salton at Cornell. Entirely ignorant (then) of who he was and his position in AI, I began our interview by mentioning that my primary research interest was artificial intelligence! Suffice it to say, I was not accepted into Cornell’s graduate program:).
What field’s literature is this? Figure 6.3 happens to be an especially interesting set of documents to study: This is the entire set of documents on N-rays, a hypothetical and (it was ultimately determined) imaginary phenomenon investigated in the early 1900s. N-rays were a form of radiation first hypothesized to exist in 1904. After an extended period of investigation, the community of physicists investigating the question determined that in fact there were no such things as N-rays! This means the corpus of documents has a convenient, cleanly defined time period.

The example also provides insight into the larger scientific process: This is what science looks like when this engine is entirely divorced from any underlying phenomena. In general we can, with Plato, imagine that there is indeed an underlying reality, as well as a social process of science attempting to describe that reality. We can hope that in most cases any particular scientist’s activities, or that of the community in which he participates, is governed by both influences: that of the physical reality and that of the social process.
U.S. courts are latitudinarian! In English courts, this doctrine is taken strictly, while U.S. courts have adopted a more lax, “latitudinarian” version that allows the court to depart whenever “. . .the evil resulting from a continuation of the accepted rule [would produce] a greater mischief” [Fox v. Snow, 6 NJ 12, 25, 76 A.2d 877, 844 (1950)]
Page numbers are worth real money. In this context it is interesting to note just how valuable, in real economic terms, the localization of citation pointers can be. One example is provided by the addition of page numbers to legal citations. While a citation like West Pub. Co. v. Mead Data Cent., Inc., 616 F. Supp 1571 uniquely identifies a particular judicial opinion, the page number within the opinion, as reported in a particular court reporter publication, makes it easier to find the relevant legal issue. Goldberg summarizes the legal issues:
Given today’s technology, access to court opinions should be easier than ever. The law of the states and the federal law is in the public domain and is frequently reported electronically by the courts . . . . Small legal publishers are attempting to take advantage of this access and package less expensive computer assisted legal research tools than Westlaw and Lexis services; for example, many want to package state law on CD-ROM. Given the raw material is available these publishers should have a green light. The problem is that West Publishing, who publishes many of the State and Regional reporters and has control over the Federal Reporters, claims that their citation form, specifically the pagination of the reporters, is protected by copyright.

As a result, those who wish to become legal publishers must either receive a license from the official reporters to use their pagination and citation form or petition the court for recognition of their citation form. Publishers are free to request copies of opinions from the courts or, for those that are available, download them from electronic bulletin boards and then print them in whatever form they choose. These materials are useless, however, to many potential customers as there is no recognized way to cite to them. In turn, the fee these publishers can fetch is a fraction of that of those using West’s pagination.

In 1991, the Supreme Court held in Feist Publication v. Rural Telephone Service that in order to deserve a copyright a work must have a “creative spark” since a copyright is to reward originality. This decision has left some to doubt whether West is deserving of copyright protection for its pagination. In the 1980s, West sued Mead Data’s Lexis over the use of its pagination but the two settled out of court. Under the terms of the license agreement Lexis received, Lexis cannot relitigate so the issue is not yet settled. Efforts to get legislation on the issue passed have failed allegedly in part due to West’s strong legislative ties. [Goldberg, 1995]
The American Bar Association has recently issued a Special Report on Citation Issues including page numbering. Like most electronic publishing, the underlying issues remain in flux.
cf. for conferre. cf. for conferre, Latin for TO COMPARE, BRING TOGETHER, CONTRIBUTE, CONSULT [Gove, 1993].
What the Talmud says about Web sites. (as reported by Ron Fein. The Talmudists among you may find this amusing. It comes from Tractate Kombutra.)

Rabbi Tarfon of Bet She’an said of Rabbi Shlomo ben Yechezkel of Tiverya: “It is said that in those days Rabbi Shlomo ben Yechezkel of Tiverya designed a Web site for the mother of his father, Sarah the daughter of Pinchas, who begat Yechezkel, who begat Rabbi Shlomo ben Yechezkel of Tiverya. Thus Rabbi Shlomo ben Yechezkel of Tiverya performed the mitzvah of Web site design.”

Rabbi Michal ben Elkanah, who only had one eye, said: “But is it not also said that in those days there was no Web, only gopher?”

Rabbi Shmaryahu of Hevron said: It is true, but as it is written: “A Web browser may also use the gopher protocol, in addition to the HTTP protocol.”

Rabbi Eliezer asked: “Why does it specifically mention that the Web browser may also use the gopher protocol, when it is written elsewhere that a Web browser may use any protocol? Because the gopher protocol is especially meritorious, since it enables support of legacy systems.”

One time a poor man came into the home of Rabbi Shmaryahu of Hevron and asked for two megabytes of disk space on the Web site of Rabbi Shmaryahu of Hevron. Rabbi Shmaryahu of Hevron refused the man, but instead gave him a personal Web server for his own use. At this point Rabbi Yehudah ben Yerachemiel asked Rabbi Shmaryahu of Hevron. “Why did you refuse this man’s request, but instead give him a personal Web server for his own use?” Rabbi Shmaryahu of Hevron replied: “It [the Mishnah] teaches: ‘When a poor man comes into your home and asks for disk space on your Web site, first ascertain whether he is going to use it for his own purpose or for the purpose of idol worship. If he is going to use it for his own purpose, grant him the space he asks, unless it exceeds twenty ephraot [one ephrah = 213 kilobytes], in which case you may refer him to a local Internet service provider,’ for as it is written: ‘It is not upon you to complete the task, but neither are you free to desist from it. If he is going to use it for the purpose of idol worship, then do not give him the space, but instead rebuke him, that he might see the error of his ways and refrain from idol worship.’”

Rabbi Gideon of Sh’chem disagreed, saying: “It [the Mishnah] also teaches: ‘When a poor man requests space on an FTP server, you must grant it without asking why he is going to use it.’” Why would the Mishnah impose requirement on a Web server but not on an FTP server? Rabbi Shmaryahu of Hevron said that Rabbi Eliezer said: “Why does it specifically mention that the Web browser may also use the gopher protocol, when it is written elsewhere that a Web browser may use any protocol?” Because the gopher protocol is especially meritorious, since it enables support of legacy systems. Similarly, the FTP protocol is especially meritorious. Therefore, it is unfair to deny a poor man access to FTP, whereas it is sometimes permitted to refrain from giving a poor man access to HTTP, because without HTTP he can still serve files using FTP, but without FTP he will be unable to put his files on the server, because the means for saving files over HTTP are unreliable.
A wide-ranging psychologist. This is the same George Miller whose analysis of Zipf’s law was mentioned in Section 3.2. He is perhaps most famous for his “human information processing” analyses of cognition, such as the limit of 7 ± 2 on the number of “chunks” that can be retained in short-term memory [Miller, 1956]. The same information-theoretic motivation underlies all these wide-ranging efforts.
Alternate histories. Oliver Selfridge, a participant of these early meetings, believes that in fact the first meeting on AI topics was the Western Joint Computer Conference of 1955.
Late Winograd. This logical model follows after natural language processing algorithms such as Terry Winograd’s SHRUDLU. It is also worth noting that, like Wittgenstein (cf. Section 8.2.1), “late Winograd” has completely recanted the strong, logically positive views of language underlying any such theorem-proving models of language. With Flores, he has instead connected semantics to the practical uses of language, for example, as documents are passed around a business organization [Winograd and Flores, 1986].
The q/d gradient runs in both directions. The distinction between query and document vector modification runs deep through the literature of Salton’s students. Rocchio did the most to analyze vector modification strategies using RelFbk, but his intent was for query modification purposes rather than vector modification. Rocchio’s algorithm is now used, as these algorithms have been “recast as linear classification by treating the query as a classifier.” [Lewis et al., 1996].
Ordered sequence. An ordered sequence over this multiset is also possible. An order over these, corresponding to word order in the document, can also be imposed.

A reasonable simplification is to assume that the word’s position within the document does not affect its conditional probability:



When we become interested in realistic document structures and writing conventions (e.g., abstract paragraphs, introductions and conclusions, spiral expositions of news stories (cf. Section 6.2)), etc., this assumption must be reconsidered.
Probabilists’ religious wars. At their core, these issues go to the heart of a controversy between frequentist and Bayesian characterizations of valid inference [Shafer and Pearl, 1990].
Dark Horse effect, too. Diamond also mentions a “Dark Horse effect . . . [which] refers to the situation in which, for the query at hand, a retrieval approach may produce unusually accurate (or inaccurate) estimates of relevance for at least some items, relative to the other retrieval approaches” (personal communication).
Querying for link topology. Several search engines now allow such queries as “Give me all documents k = 1 links away from this one.”
ETHOLOGY. ETHOLOGY is a term used to describe a branch of biology particularly concerned with the interactions among evolution, environment, and behavior. The seminal work in this area was done at the Max Planck Institute in the 1950s. But the word ETHOLOGY was not used to describe this area until some time later. It is interesting to wonder how an IR system might nevertheless be able to retrieve these old documents in response to a query for ETHOLOGY.
We mostly ask about SEX. Evidence for wide query novelty is especially striking given that, at least at this juvenile stage of Internet usage, by far the most dominant query topic is SEX. Not only was SEX the most common token, but sex-related terms dominated 17 of the top 25 most frequent query terms. MP3 and CHAT were the most popular nonsex-related tokens, but their frequency was approximately a third of that of SEX.
Just how do Alta Vista, HotBot, . . . work?! Here is what some of the major engines have to say about how they work:

Alta Vista What information does Alta Vista actually index off of a Web page? Basically, we index all the HTML information on a page: all text, ALT text for images, links (hrefs and images), anchors, title, description and keyword meta-tags, applet and ActiveX object names, the page’s URL, its host name www.foo.com, and its domain name (com). The treatment of UseNet postings is similar but with different keywords. We do not index HTML comments.

How do I control what is indexed? You can control what is indexed by effectively using meta-tags for keywords and descriptions. We strongly recommend that you check out the Advanced Help page to learn more details for effective design techniques with title, dates, and contents, among other methods.

When should you use Advanced search? Advanced search is for very specific searches, not for general searching. Almost everything you need to search for can be found quickly and with better results using the standard search box, where the Alta Vista search service sorts the results by placing the most relevant content first. However, if you need to find documents within a certain range of dates or if you have to do some complex Boolean searches, there isn’t a more powerful tool on the Web than Alta Vista Advanced search.

What’s the difference between Search and Advanced search? Words and phrases work the same in Search and Advanced search. You can also choose to search UseNet or refine your search using either tool. The include (+) and exclude (–) features are not available in Advanced search; instead, you can use the more powerful Boolean commands to customize your search. Another difference is that you can choose to see results without having our system rank the material for you.

What do you mean by the term “ranking”? Usually, the search service sorts, or “ranks,” the contents of your search according to relevance. The higher the ranking, the more relevant the content. However, in Advanced search you can view “unranked” results by just using the Boolean Operations section rather than the regular Search box.

Why did my site become lower in the results? When surfers search for broad topic areas using one or two keywords, the search engine sometimes finds multiple pages containing equal or similar amounts of relevant content. The ranking of these pages can change over time.

Note: We do not sell result rankings to individuals or companies. You can contact our advertisers if you wish to purchase advertising space in the ad banner above specific results pages.

How can I ensure a good ranking for my Web site? The best way to improve the ranking of your Web site is to be more specific about the content by using synonyms or locations in meta-tags within your HTML document. It’s not a good idea to use duplicate words multiple times, use keywords excessively, or include keywords that do not relate to the content of your document.

InfoSeek We are often asked, “Could you share the secret formula for improving my site’s relevancy?” Our reply is the same to our users as well as our partners: Use a highly descriptive title, include a meta-tag description, and create meta-tag keywords that contain comma-separated phrases. Use an assortment of synonyms that accurately describe your site, but don’t try to boost the site’s relevance by repeating keywords. The overuse and repetition of keywords may result in a lower relevancy score and possible ommision from InfoSeek’s index.

Excite Search results are listed in decreasing order of relevance. The percentage sign to the left of each result is the relevance rating. The closer the rating is to 100, the more confident Excite is that the document will fit your needs. The relevance ratings are automatically generated by our search engine, which compares the information in the site against the information in your query.

What is a relevance rating? Excite lists search results using a scored relevance rating–the higher the percentage, the more confident we are that the site listed matches your search query. The rating is generated by an algorithmic equation, which measures the site against the concept described in your query.

The search results page lists the title, URL, and a brief summary of each site. To the left of each title is the relevance rating (a percentage), which will help guide you to the information most closely matching your query.

Improving your site’s ranking. Suppose you want users searching for Hawaiian Bed and Breakfasts to find your site among the first 20 sites retrieved. What’s the trick? How can you do it? Can Excite help? Simply adding, removing, or changing a few sentences may alter the way our spider indexes you.

When designing or redesigning your site, think about the search queries you want people to use to find it. Then create a site that will be responsive to those queries. Our design tip is simple: Relegate unrelated topic to subsidiary pages. If you’re advertising your Hawaiian bed and breakfast, don’t use the home page to emphasize how the ocean looks from a bedroom window. Instead, emphasize bed, breakfast, Hawaii, and weeklong getaways.

We appreciate creative design, and we’re not telling you how to design your site, but you may want to keep in mind that if you include a few lines of poetry on that Hawaiian bed and breakfast home page, the Excite spider will consider them as noteworthy as every other line on the page. They become part of your concept, and they might even dilute the main topic of “Hawaiian Bed and Breakfasts.” For the same reason, don’t put price lists on the home page. The spider may read the prices as important bits of text and your page may not appear as high on the list of results as you would like.

Does Excite use meta-tags? In general, our spider doesn’t honor meta-tags. The only exception to this is site summaries, which appear in the search results. For these summaries we do look for the “meta-description” tag. Even though we can use this information for summary purposes, we DO NOT index this information, so it will not influence the site’s ranking in the search results. We believe our decision protects our users from unreliable information. A couple of examples:

A site included this in meta-tags: META HTTP-EQUIV="keywords" CONTENT=“This site offers high quality information about how to buy residential real estate. Our experts can help beginner home buyers save money.” But it wasn’t aimed at educating home buyers at all. It was instead an advertisement for a large real estate firm that simply wanted to lure potential home buyers to its site.

Another site sold children’s clothing. Yet one of the first sentences in the meta-tags declared: META HTTP-EQUIV="keywords" CONTENT=“This site can help parents concerned about child care.” The author figured that queries about child care were more frequent than queries about children’s clothing. By dishonestly using meta-tags, the author hoped to increase the number of potential customers visiting the site.

Our spider is programmed to grab as much information as it can from your site by taking the exact words on the page. If the user can’t see or use it, we don’t bother to index it or search on it.

HotBot How do I improve my site’s ranking? HotBot’s search results are based solely on comparing the user’s search query to the content of millions of Web pages. There is no list matching certain search terms or keywords with special results.

Basic factors affecting a page’s ranking are: the words in the title, keyword meta-tags, word frequency in the document, and document length.

Although there seems to be significant variation among these methods, all characterizations are too vague to know just what any of them really do! Danny Sullivan’s Search Engine Watch is a good place to look for additional information about meta-tags.
Early versus late Wittgenstein. Everyone changes their mind, but when you’re a philosopher with the depth of Ludwig Josef Johann Wittgenstein, you can express two positions and have others take great interest in both! Wittgenstein’s Tractatus Logico-philosophicus [Wittgenstein, 1922] was published in 1922 and is a primary reference for what we now think of as “early Wittgenstein”; Philosophical Investigations [Wittgenstein, 1953] was published in 1953 and characterizes the “late Wittgenstein.” In the interim, Wittgenstein taught elementary school, played music, and quit philosophy more than once. But even more striking than the passage of time between these two great works is how diametrically opposed the arguments put forward in Tractatus and Investigations are. N. Malcolm [Malcolm, 1967] expresses just how unusual a state of affairs this is:
A considerable part of the Investigations is an attack, either implicit or explicit, on the earlier work. This development is probably unique in the history of philosophy–a thinker producing, at different periods of his life, two highly original systems of thought, each system the result of many years of intensive labors, each expressed in an elegant and powerful style, each greatly influencing contemporary philosophy, and the second being a criticism and rejection of the first. [Malcolm, 1967, p. 334]
(Terry Winograd’s turnaround concerning appropriate applications of natural language processing (NLP) technology, between his dissertation and 1983 [Winograd, 1983] and his 1986 book with Flores [Winograd and Flores, 1986] almost qualifies for early-Winograd versus late-Winograd, however!:)

You can imagine my reluctance to attempt to characterize just what it was that Wittgenstein changed his mind about, in a short sidebar! Quite roughly then, early Wittgenstein thought that language was the perfect philosopher’s tool. He aspired to a universal language, shared by all careful users, that could positively and uniquely allow careful naming of things. Just as numbers point to essential categories and mathematics builds these into theorems about how numbers are related, simple words name simple categories of objects (events, states, ...), and more complicated linguistic expressions name more complicated categories, an utterance of language means the same thing wherever and whenever it is said, just as 2 does, and just as a2 + b2 = c2 remains true wherever.

By the time of his Investigations, Wittgenstein had given up hope that the convenient naming system of mathematics was possible elsewhere. Language for the late Wittgenstein depended critically on the context of the utterance. Naming things was only one possible language game; there are many others people play all the time. Without understanding the purposes to which an utterance is being applied, we can’t really understand its meaning. The meaning of a sentence is its use (Gebrauch), its employment (Verwendung), its application (Anwedung) [Malcolm, 1967, p. 336].

This book proceeds on the assumption that Wittgenstein got it right the second time; I focus especially on language serving the FOA language game. Your mileage may vary.
Star Trek script generator. The consequences of violating these tacit rules of natural language are often used to characterize what is wrong when a Pinocchio-like robot attempts to communicate. Consider how Data, of Star Trek Next Generation fame, makes many of these mistakes as he attempts to communicate using natural language.
Math proofs are more informal than you may think! It is a well-known cliché, however, that most mathematical papers do not contain sufficient details (or even do contain errors!) that would violate any algorithmic notion of theorem proving.
The editor’s dual representation of a document. The ancient InterLisp editors, which represented a file/document in terms of the set of operations used to create it, may be relevant here. Think of the studies of Mozart’s composition process. Supposedly this analysis proceeded by carefully peeling off layers of paper (in reverse chronological order) that had been overlaid as Mozart edited his manuscript. The startling conclusion: On more than one occasion, the composer’s final work was identical to his original conception!
Computists is a good example. Ken Laws’ Computists online newsletter is an excellent example of how an editor, his readers, and various subeditors and stringers can synergistically help one another. The controversy concerning AOL’s moderators, some of whom think they should benefit from the same economic windfall enjoyed by the AOL corporation, is another interesting example.