6

Back to TOC

Inference beyond the Index

The Index, that critical mapping between documents and descriptive keywords, has dominated our approach to FOA in all the preceding chapters. But there is of course a larger context of available information: FOA can be accomplished by showing a user relations among keywords, by acquainting him or her with important authors, by pointing to important journals where relevant documents are often published, and so on. Retrieval of all these information resources, especially when structured in meaningful interfaces, can tell a user much more than a simple list of relevant documents.

This chapter is concerned with exploiting a variety of other clues we might have about documents (and keywords, authors, etc.), above and beyond the statistical, word-frequency information that has been at the heart of the Index relation. In all cases, these techniques identify some new source of data, represent it efficiently, and then perform some kind of inference over the representation.

AI is a subdiscipline of computer science that is centrally concerned with questions of knowledge representation and inference over those representations, especially when these algorithms arguably lead to “intelligent” behaviors. (In many ways the best characterization of the AI domain is the extensional one provided by the AIT corpus of Ph.D. dissertations.) We could expect, therefore, that there would be a great deal of cross-fertilization between AI methods and IR methods, both having grown up within computer science during the same period. But for complicated reasons, until recently there has been very little interaction.History: AI xor IR?

By and large, AI has defined its notions of inference in logical terms, originally based on automatic theorem-proving results. Chapter 5 discussed IR’s probabilistic foundations, and one immediate axis of difference between AI and IR is the distinction between primarily logical and primarily probabilistic modes of inference. Nevertheless, some in IR perceived early on the advantages offered by AI’s knowledge representations [Smith, 1981] and expert systems techniques [Fox and France, 1987; McCune et al., 1985; Fidel, 1986].

Today, the fields of AI and IR align much more closely. For example, both machine learning and natural language processing have always been considered central issues within AI. The next chapter will discuss at length machine learning techniques as they have been applied to document corpora. Section 8.2 can only sketch another large intersection, corpus-based linguistics, where natural language issues and IR techniques also merge.

The advantages of applying AI knowledge representation techniques become especially obvious when additional structured attributes are associated with documents, keywords, and authors. Early on, Kochen [Kochen, 1975] considered a broad range of these forms of information as shown in Figure 6.1. Even more inclusive lists have since been proposed [Katzer et al., 1982; Hanson, 1988].

This shows the primary Index relation in the larger context of other information we might have available. What all of these additional forms of information have in common is their ability to shed new light on the semantic questions of what the documents are about. Information on the publication details of documents, for example, the journal date or page numbers of documents, can help provide a context within which individual documents can be better understood. Much of this data-about-data document is now referred to as meta-data. This additional modeling of document structure, in languages like XML and codified in standards like the Dublin Core, is one of the most important ways in which database and IR technologies now interact. This constructively blurs many of the database/search engine differences mentioned earlier (cf. Section 1.6). Techniques for performing fact extraction – building database relations from analysis of textual WWW pages [Craven et al., 1998] – suggest a broad range of new ways that structured attributes may enter into the retrieval task.

Section 6.1 discusses one of the most important ways in which documents can be understood independent of their keywords. In science, in the common law tradition, more recently in email newsgroups, and now with HTML hyperlinks, the ability to link one document to another can provide vital information about how the arguments of one document relate to those contained in another.

Section 6.3 will discuss some of the special representation techniques that have been used to organize keywords in the vocabulary. It is also possible to reason about authors of documents. Section 6.4.1 discusses Ph.D. “genealogies” in which dissertation authors are related to one another by shared advisors. Coauthorship and membership in the same research institution have also been proposed as ways to provide context on a particular author’s words. In some cases, characterizations of expertise of the authors, independent of the documents themselves, are available.

The chapter concludes with several suggestions of how these varied information sources can become integrated as part of next-generation FOA tools. Section 6.5 considers several “modes of inference” by which new conclusions about keywords and documents can be reached from elementary facts. Section 6.6 suggests a few of the new interface techniques that become available as richer data streams are provided by and presented back to a user. Section 6.7 and Section 6.8 look at two domains of discourse in particular – the law and science surrounding molecular genetics – as examples of how such techniques can be marshaled toward particular FOA purposes. After considering all these ways that the methods of AI can be used to help with FOA, Section 6.9 concludes by speculating about how the problem of intelligence itself might be changed as we take seriously the prospect of basing it on textual representations.

6.1 Citation: Interdocument Links

The bibliography at the end of a scientific publication, links from one World Wide Web page to another, references in a legal brief to prior judicial opinions, and conversational threads connecting postings to one another within a common UseNet group may seem completely unrelated. In each case, however, the author of one document has found it useful to cite another document. Perhaps it is because the author wishes to extend a prior scientific or legal argument, or perhaps it is to attack it. It may be to pull together disjointed Web pages into a single “home page” theme. Or the citation may be designed to quiet a bunch of “newby” newsgroup discussion participants by alerting them to a FAQ (frequently asked question) answer. In all cases, a new piece of text is being woven into the larger fabric of other texts, uniting one author’s contribution into the legacy of many others. The value citations can offer in supporting the FOA activity has been recognized by many [Pao and Worthen, 1989; Salton and Bergmark, 1979] and leads to methods that allow users to capture and organize their own bibliographic materials [Belew and Holland, 1988]. As more and more scientific publishing moves to open electronic repositories, efforts such as the Open Citation Project are leading the way toward new standards for the exchange of this important information.

At its core, a citation is a pointer from a document to a document. We typically think of a citation pointing from one document to another document of the same type: Scientific papers cite other journal articles, email messages refer to prior messages, HTML pages point to one another. But in today’s quickly changing scene, it is not uncommon to find heterogeneous forms of citation, from one document type to another, as shown in Figure 6.2.

For many publications, citations are collected at the very end of a document, in its bibliography. Often the real locus of a citation, however, is someplace earlier in the document, and many compositional styles insert an expicit bibliographic citation there. We will be interested in the citation resolution of both ends of the pointer: How accurately do we know the location of the citation in the citing paper, and how precisely is its pointer into the cited paper? Does it point to a particular paragraph, page, section, or the entire document?

see exercise 1 Exercise 1

The application of very similar citation mechanisms has been exploited in different ways in different contexts. Table 6.1 summarizes a number of dimensions across several contexts. Here we consider citations as exemplified in two particular classes of documents, generated by science and by law, which have supported social activities for a very long time. Section 6.1.5 will report on new analyses of citation patterns observed on the WWW.

6.1.1 Bibliometric Analysis of Science

Most extensive analysis of citation has been in science. Long before Newton [Newton, 1776] appreciated “standing on the shoulders of the giants that came before,” scientists realized that they need one another to advance. In some cases the reference is to arguments on which a new author builds; in other cases there is disagreement about hypotheses, data, or other facts.

The field of bibliometrics has found a great deal of interesting structure in graphs created by bibliographic citation links. That is, imagine each document in a corpus is represented by a node in a graph, and a directed edge is drawn from document dj to di, just in case dj refers to di in its bibliography.

Figure 6.3 shows this citation structure when the references are ordered by a natural, temporal feature. In any subject area, papers can be indexed chronologically, and a dot is placed at location l, j just in case document di cites document dj. Because citations can only run backward in time, this graph is upper triangular.What field’s literature is this?

As with many fields, this one begins with a small number of highly cross-linked papers in the upper-left corner. Strong horizontal and vertical stripes can also be seen against a more uncorrelated background. Horizontal lines correspond to citations of classic papers: chestnuts that everyone includes in their bibliography. Vertical stripes are papers that have much more extensive bibliographies and that stretch much farther back in time than typical; these are often referred to as review articles. Note how these semantic determinations can be derived from patterns in the syntactic facts of citation. Other inferences are also possible.

Perhaps the most common use of citation graphs is impact analysis. In terms of the bibliographic graph, a document’s importance, its effect on a field, is proportional to its in-degree: the number of citation links pointing to a document node. Price provides motivation for this measure:

Flagrant violations there may be, but on the whole there is, whether we like it or not, a reasonably good correlation between the eminence of a scientist and his productivity of papers. It takes persistence and perseverance to be a good scientist, and these are frequently reflected in a sustained production of scholarly writing. [Price, 1986, pp. 35–36]

This suggests a simple heuristic, widely used by university deans who must quickly evaluate faculty members who are up for promotion: Important authors are those with higher impact than their peers! The Institute for Scientific Information (ISI) has made an entire industry of collating bibliographic citations and inverting them. Its Web of Science product now makes hypertext navigation of this valuable information straightforward. Similar arguments can be extended to identify important academic departments, universities, even countries. This mode of analysis, used to evaluate individuals (and, by extension, departments, colleges, universities, even countries) consistently makes news when data and politics cross paths [May, 1997].

see exercise 2 Exercise 2

Finally, as mentioned in Section 5.2.5, cocitation can be used as a basis for interdocument similarity: Two documents are similar to the extent that their bibliographies overlap. Bar-Hillel has been credited with the first suggestion of using cocitation as a similarity metric between documents [Bar-Hillel, 1957; Swanson, 1988]; Henry Small, Eugene Garfield, and others have provided some of the first empirical support for this hypothesis [Small, 1973; Garfield, 1979; Garfield, 1982; Garfield, 1986].

So-called invisible colleges [Merton, 1973], connecting cliques of self-referential colleagues who are relatively independent of the rest of science, also have been identified. Beyond fully isolated cliques, higher-order structure over sets of documents can also be analyzed. We can imagine that the documents of one discipline have much higher connectivity among themselves than they do with papers in other disciplines. A new paper, whose bibliography cites papers coming from more than one discipline, can therefore be imagined to be a new cutting-edge synthesis.

Bibliometrics has also made clear many dangers in using citation data. What we might call the norm of scholarship, the average number of citations in a document, seems to be about 10 to 20 [Price, 1986, p. 161]. Some scientific disciplines rely on much longer bibliographies than others; within a discipline, idiosyncratic author variations in bibliography length are also common.

6.1.2 Time Scale

Obviously bibliographic references can only go back in time, but just how far back in time a citing author reaches also tells us something. For typical journal publications, the typical time scale of references is on the order of years. A large fraction of this time has been, to date, sensitive to the production schedules of scientific journals. As the time between publication of an author’s words and a reader’s browsing has shrunk (e.g., in fields like physics, where the Los Alamos preprints Server has become a dominant mechanism of dissemination), much of the time difference is now due exclusively to delays associated with peer review, revision, etc. This delay between the time an author is finished with a document and when it reaches its public obviously provides a lower bound for citation lag, the time between a document’s publication date and its reference by another paper’s bibliography.

But independent of production schedules, there are wide varieties in how far back citations typically reach. Price sees a deep connection between the social processes underlying various scientific disciplines. First, he distinguishes between two gross types of citation, normal aging and the immediacy effect: “a special hyperactivity of the rather recent literature” [Price, 1986, p. 164]. He also defined Price’s Index to be the fraction of documents published that are cited within five years.

Price then uses the width of the interval to distinguish between “hard” and “soft” sciences, even “nonsciences,” all based on the width of the research front within which most citations are typically made. He sees his Price’s index as “corresponding very well with what we intuit as hard science, soft science and non-science as we descend the scale” (p. 168). Using biological metaphors, with different disciplines compared to different kinds of organisms:

...pathological cases apart, it would seem that the [Price] index provides a good diagnostic for the extent to which a subject is attempting, so to speak, to grow from the skin rather than from the body. With a low index one has a humanistic type of metabolism which the scholar has to digest all that has gone before, let it mature gently in the cellar of his wisdom, and then distill forth new words of wisdom about the same sorts of questions. In hard science the positiveness of the knowledge and its short term permanence enable one to move through the packed down past while still a student and then to emerge at the research front where interaction with one’s peers is as important as the store-house of conventional wisdom. The thinner the skin of science the more orderly and crystalline the growth and the more rapid the process. (pp. 177–8)

Price also infers prescriptions from these statistics for editors of journals who are in a controlling position to influence a field:

I regard the value of this work as being not only diagnostic, but also prescriptive, so let us look in closing at what suggestions it makes for the technology of scientific information. At the personal level of the scientific writer and the editor and publisher of journals, it tells that one might well be as careful about references as we are about titles, authorships, and proper presentation of data.... For a research paper it should be exceptional to allow an author to eschew interaction with colleagues by citing too few references, and if he cited too many, perhaps he is writing a review paper rather than a research contribution. Similarly, if you want to make the field firm and tight and hard and crystalline you have to play with your peers and keep on the ball by citing their recent work. (p. 178)

6.1.3 Legal Citation

The use of citation in legal documents is interesting for a number of reasons [Rose, 1994, section 5.4.]. First, the common law tradition adjudicating legal behavior is based on arguments of stare decisis: Stand by previous decision.U.S. courts are latitudinarian! The ability to reference prior judicial opinions provides the core of many forms of documents, including the judicial opinions themselves, briefs, even legislation. It is no wonder, then, that legal prose has developed an extensive system of conventions for representing how judicial opinions relate to one another.

As with scientific papers, the fact of citation–reference by one judge to the opinion of another – is never in doubt, and so some analyses like impact analysis transfer quite directly from scientific corpora. (The fact that the publishing of our courts’ opinions is dependent on commercial interests means that precedent inflation [Brenner, 1992] can occur, however.Precedent inflation See Section 6.7.) But when referring to prior cases, the relevance of prior decisions often depends on the judge’s interpretation of the relation holding between the two opinions.

In fact, an entire industry exists within legal publishing to do nothing but elaborate the syntactic fact of reference to a prior ruling with an interpretation of the purpose for which the citation is made. This process is performed especially well by Shepard/Lexis. So critical are the arguments captured by these citations that the process of checking a prior ruling that a lawyer wishes to reference, to be sure that it hasn’t been overruled or otherwise rendered obsolete, is known as Shephardizing a case.

Table 6.2 shows the entire range of Shepard citation labels. These are broken into two categories; the first deals with the history of a case, the second with its treatment.

To make sense of this distinction, a brief digression into the purpose of legal citation is necessary. Cases, as they proceed from lower to higher courts, have basically a binary outcome: They are won or lost. There is never ambiguity as to whether a higher court agrees with or overrules the opinion of a lower court. These unambiguous statements are captured as the history of a case.

But in a common law tradition, a much larger number of citations refer to cases and decisions in those cases by other judges. The relationship between these cases and the one before the author-judge is less clear. But as Figure 6.4 makes explicit, the two cases (citing and cited) have at least two important dimensions along which they may be similar or dissimilar. First, the set of facts associated with one case may be very close to those in the other, or they may be very different. Second, the rules of law that are to be applied may be consistent between one judge and the other, or they may be contrary. Figure 6.4 shows these two dimensions and orders Shepard’s treatment labels roughly along dimensions of this two-dimensional similarity space.

6.1.4 Citations and Arguments

Expository documents are arguments: attempts to convince an audience that some set of propositions are true. Depending on the type of document and the type of audience, the argument may be more or less formally structured.

Perhaps the most structured kinds of documents are mathematical papers. Mathematical arguments typically depend on, indeed are defined by, theorem proving, which connects new propositions to previously established results. Even in the case of mathematical papers, however, it is interesting to observe the crucial role natural language text continues to play in providing a context for formal theoretical results. Readers can be persuaded that assumptions are reasonable, that mathematical definitions capture intuitive relationships, etc. Citations in a mathematical paper typically point to papers containing theorems that were used as part of the current proof.

In a scientific paper, the form of the argument often has to do with the discipline. There is often data to present, methodology to describe, and so on. Many of the citations are typically gathered in a section relating the current piece of work to prior literature. This prior literature may be referenced for at least two, very different, purposes. Most typically the prior work is used much as theorems in mathematical papers: to establish a result from which the current work proceeds. The current author points to the work of a prior author as providing validation for a proposition they both believe to be true. Another potential reason for citation is the opposite. In this case, the current author is interested in debating a position put forward by the cited work.

This polarity, reflecting whether the cited work and its conclusions are positively or negatively correlated with those in the citing work, will be termed the polarity of the citation. Here legal documents provide a good example for what might be possible in scientific writing. Every law student learns proper legal syntax for codifying references as they learn to write legal briefs, the structured memos provided to judges that are often incorporated into the ultimate opinion [Harvard Law Review Association, 1995]. One feature of legal briefs is the special syntax used to refer to statutory law (e.g., 17 U.S.C 101 refers to a section of the U.S. Code) or case law (e.g., West Pub. Co. v. Mead Data Cent., Inc., 616 F.Supp 1571) in a conventional fashion. A more interesting aspect of legal brief style is the explicit syntactic marking of the relation of the cited case to the argument being made in the brief: Reference to a supporting legal precedent is marked by cf., while potentially antagonistic arguments are marked with but cf.!

Note the importance of the syntactic localization of both the source and destination of a citation pointer to the two documents’ semantic purposes. Localization helps to anchor the author’s purpose in using the citation to the document’s larger argument. Most papers make many points and attempt to establish a number of propositions. The ability to point to particular conclusions within the cited paper is therefore important. The citing paper’s argument structure is simultaneously being extended, and so reference to a prior argument at a particular location in the current argument can be more persuasive than if all cites aren’t localized.

see exercise 3 Exercise 3


see exercise 4 Exercise 4


see exercise 5 Exercise 5

One reason citation has been less exploited in FOA applications than it might otherwise be is due to the expense of obtaining this data. In the context of the WWW, however, it turns out that the indices built by Web crawlers can be quite easily extended to capture information on cited pages, which can be easily inverted to maintain information on citing pages as well (cf. Section 8.1).

6.1.5 Analyzing WWW Adjacency

To a computer scientist, the WWW looks much like the directed graphs (digraph) we have studied in data structure classes for decades. It’s big, it’s dynamic, we have special questions about it, etc., but many of the same analyses we would apply to any digraph are good starting points for the WWW. A useful first step is to define the adjacency matrix A connecting all the documents di of the WWW:

(6.1)
(6.2)

For example, the Google search engine imagines the WWW graph as the basis of a Markov process, where the probability of jumping from one page is uniform across all of its anchor/citations.

If is defined to be this (small) probability, the stationary distribution of this Markov process provides some insight into how likely a browsing user would be to find him or herself on a particular page. NEC’s CiteSeer is another example of how useful citation information can be as part of a tool for searching computer science literature.

A more extensive analysis [Chakrabarti et al., 1998b; Kleinberg, 1998] has analyzed the WWW, looking especially for methods that extract authoritative pages from the vast numbers of other pages the WWW also contains. The first component of this method corresponds approximately to the notion of impact discussed in Section 6.1.1.

...the fundamental difficulty lies in what could be called the Abundance Problem: The number of pages that could reasonably be returned as “relevant” is far too large for a human user to digest. Thus, to provide effective methods for automated search under these constraints, one does not necessarily need stronger versions of classical information retrieval notions such as relevance; rather one needs a method of providing a user, from a large set of relevant pages, a small collection of the most “authoritative” or “definitive” ones....

Unfortunately, “authority” is perhaps an even more nebulous concept than “relevance,” again highly subject to human judgment....

We claim that an environment such as the WWW is explicitly annotated with precisely the type of human judgment that we need in order to formulate a notion of authority. Specifically, the creation of a link in the WWW represents a concrete indication of the following type of judgment: the creator of page p, by including a link to page q, has in some measure conferred authority on q. [Chakrabarti et al., 1998b, p. 2]

Second, they define hub documents to be ones that are particularly exhaustive in their reference to other pages. This is roughly analogous to review papers also mentioned in Section 6.1.1. To first approximation, authoritative pages are those with high in-degree while hubs are those with high out-degree. But Kleinberg imposes an important additional constraint: A community of hubs and authority pages must be mutually self-referential. The thinking underlying Kleinberg’s method is provocative:

Authoritative pages relevant to the initial query should not only have large in-degree; since they are all authorities on a common topic, there should also be considerable overlap in the sets of pages that point to them. Thus, in addition to highly authoritative pages, we expect to find what could be called hub pages: these are pages that have links to multiple relevant authoritative pages. It is these hub pages that “pull together” authorities on a common topic, and allow us to throw out unrelated pages of large in-degree. Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. [Kleinberg, 1998, p. 4]

Two quantities, x and y, are associated with each document, corresponding to how good an authority or hub, respectively, the document is, based on the adjacency matrix A:

(6.3)
(6.4)
(6.5)
(6.6)
x and y values are iteratively updated by premultiplication with the adjacency matrix or its transpose:
(6.7)
(6.8)

It is also important to renormalize these vectors to unit length after each update.

Under reasonable assumptions, this update procedure is guaranteed to converge on values with x* being the principal eigenvector of ATA and y* the principal eigenvector of AAT:

(6.9)
(6.10)

Using this notation, the similarity of documents di and dj can be conveniently measured in terms of cocitation as the i, j entry of AAT.*

While we can conceive of applying these techniques to the graph corresponding to the entire WWW, the computational time and space required still makes such an analysis intractable. Kleinberg et al. typically recommend applying this adjacency analysis to a subset of pages pulled together by a query against some search engine. In their experiments, they augment this initial hitlist with documents that either point to or are pointed to from documents in the hitlist itself, as shown in Figure 6.5.

Note that the values for authority and hub on which this analysis converges correspond to the first, primary eigenvector. Using these values to identify high hub and anchor nodes gives rise to the first, primary community of documents in the graph. Another interesting application of adjacency analysis is to consider communities other than the one corresponding to the first, largest eigenvalue. A particularly striking application of this analysis concerns bimodal queries: Consider results arising from a query ABORTION, shown in Table 6.3. After first identifying a community of pages extensively citing both pro-choice and pro-life documents, the second eigenvector 2 clearly separates pages associated with pro-choice organizations (with relatively high positive values) and pro-life (with negative values).*

Another important feature of this analysis is that it depends on only first-order adjacency information. That is, while it is always easy to find all of the documents pointed to by a target document simply by inspecting the document for its anchors, the in-neighborhood of a document can be identified through direct inspection of other documents. This means, for example, that search engine crawlers that look at every single document can, as part of their normal search, simultaneously collect this adjacency data.

6.2 Hypertext, Intradocument Links

Since the very first papyrus scrolls were used to capture written language, it has become natural to conceive of text as a single, linear, and continuous thread, authored and then read as a single stream. But as books became longer, tables of contents were prepended, indices were appended, and the opportunities for traversing the text in fundamentally nonlinear ways became more common. We are becoming interested in other kinds of documents, many of which bring their own special structures and writing conventions, for example, the abstract paragraph, introductions, and conclusions of longer papers and the “methods” sections in scientific papers. In news reporting, spiral exposition is often used; a news item is summarized in the first paragraph, then treated in more detail in the paragraphs that fit on page 1 or “above the fold” of the newspaper, and in still more detail in the body of the article.

The attempt to analyze, support, and create such nonlinear hypertext relations among documents began long before the WWW made hypertext links commonplace; Vannevar Bush’s “As We May Think” article (published in 1945!) [Bush, 1945] and Ted Nelson’s revolutionary Xanadu project [Nelson, 1987] are often mentioned as seminal works. Mice and graphical interfaces made clicking on one textual passage, to jump to another, second nature. Hypertext conferences focusing on these new issues began in the mid-1980s and have taken on new energy as the HTTP protocols and HTML authoring languages made it easy to support many kinds of intra- and interdocument relations [Conklin, 1987; Conklin and Begman, 1988; Agosti et al., 1992; Agosti and Marchetti, 1992; Bruza and Weide, 1992; Egan et al., 1991]. In the process, many types of linkages between documents have been proposed. The following sections mention some of the most common and useful, sometimes using this FOA text (self-referentially!) as examples.

6.2.1 Footnotes, Hyperfootnotes, and cf.

Footnote text embellishes a primary text. It provide a more detailed treatment of terms or concepts used in the primary text. A lexical token, typically smaller in size,* creates a correspondence between the primary text and the annotation on it.

Perhaps the most direct claim of hypertext authors is that the standard linear presentation of text does violence to the more networked way in which we naturally conceive of the concepts being discussed. At the same time, the conventional sequential flow through text mandated by printed media does a great deal to support a rhetorical argument. For this reason, this FOA text was written first as a traditional book, assuming a basically linear flow. On this linear spine, two hypertext extensions have been added (cf. Section 8.2.1).

For example, the primary purpose of the last paragraph was to move (the primary, linear course of the textbook) from a discussion of the syntactic conventions of footnoting to a consideration of communication’s ultimate purposes. Because these issues are covered in more depth in Section 8.2.1, a parenthetical cf. reference to this other section was made. The cf. relation is best viewed as an opportunity (offered by the author to a reader) to compare the two passages.cf. for confer

In this FOA text I have also chosen to distinguish between standard footnotes and hyperfootnotes, which are used to capture more extended digressions. The last paragraph allows readers with access to the CD-ROM version of FOA who are interested in the Latin etymology of the cf. token to poke into that. Hyperfootnotes include a caption, providing a clue to what additional information lies on the other end of the hyperjump; standard footnotes are referenced by a simple asterisk.

6.2.2 Hierarchic Containment

One of the most common features of all document types is that as larger and larger portions are aggregated, an explicit hierarchic structure is used to organize the text. For example, this textbook is broken into chapters, which are broken into sections and subsections, and so on. This is basically a containment relationship, with shorter passages aggregated to form larger ones, but with short textual rubrics providing useful summaries of the themes of the smaller units. Typical atomic units of text can be a few sentences [Salton et al., 1993; Salton et al., 1994], but typically they are paragraphs [Hearst and Plaunt, 1993].

A provocative picture of how the containment relation can be exploited is shown from Salton et al.’s analysis of encyclopedia text in Figure 6.6. This shows the encyclopedia text ordered sequentially around a ring. In the top figure, individual pages of the encyclopedia are treated as separate documents, and in the bottom figure larger sections are aggregated. In both cases, links between documents are created when their similarity (as measured according to their vector space representations; cf Section 3.4) exceeds a threshold. As Chapter 3 explained in detail, a vector space representation of document content is very sensitive to length. Salton investigated the use of inclusion to aggregate text into larger units (in the bottom of Figure 6.6) and various “global” versus “local” weighting schemes to manipulate the effects of length normalization. Note also how co-reference to similar topical themes can be seen at different scales within the containment hierarchy.

When considering constituent passages of the same document, it becomes possible to ask how much the topic of the prose changes as it goes from one passage to the next. Hearst [Hearst and Plaunt, 1993] analyzed how similarity (using TF-IDF weighting and cosine similarity) varies across passages. The result is the wave of Figure 6.7. Also shown in this figure are vertical bars where a human judge has determined that a topical shift has occurred.

Having isolated individual passages as part of retrieval, it becomes important to show the user this additional level of analysis as part of the retrieval of the “documents.” Figure 6.8 shows topical tiling, another visualization technique Hearst developed to highlight shifts in topical focus from one passage to the next. By increasing the resolution of document analysis, users can see which passages match particular keywords or combinations of keywords in their query.

Containment relations among documents may also be useful in supporting queries of widely varying generality (cf. Section 4.3.4). If a user issues a very broad query, it may mean he or she seeks documents with an equally broad overview or level of treatment. One reasonable hypothesis would be that broader queries should correspond to the retrieval of large sections and narrower queries to smaller sections, but only if the assumption that documents in general obey something like a uniform level of treatment. Such an assumption would not be unreasonable for the encyclopedia text considered by Salton et al., because we expect (and encyclopedia editors attempt to ensure) that there is some document about every topic.

Figure 6.9 contrasts the encyclopedia’s top-down coherent topical tiling with another (hypothetical) document distribution, generated by bottom-up organic case law. People don’t go to court unless they disagree, and so what judges must write about are contentious legal issues defining the borders of legal disputes. As courts work out a legal issue (for example, the intellectual property status of software or whether genes can be patented), we can expect many opinions to cover very similar topical ground.

6.2.3 Argument Relations

A central purpose of many documents is to persuade. Especially interesting, then, are relations among documents that reveal the arguments relating among the documents [Sitter and Maier, 1992]. The law provides several interesting examples, as shown in Figure 6.10 (taken from Rose, 1994, Figure 7.4). Rose enumerates four relations often found in legal documents, from simple reference to more elaborate logical relations such as exception.

While less explicit than within legal documents, educational materials also have implicit or explicit prerequisite relationships assumed by a text’s author. Figure 6.11 shows dependencies among this book’s chapters. In interdisciplinary papers, it is common to have several introductory sections, providing background for experts in one field who may not have the requisite background in another. When taken across many texts, such a prerequisite structure imposes a lattice, which can be exploited to help a reader/browsing user find just those components of the document that are most important to them.

6.2.4 Intra- versus Interdocument Relations

It is interesting to see how similar many conventions for relating textual passages within a single document are to those used to traditionally refer across documents. In part this reflects circumstantial features of a document’s production – whether it was printed on newsprint, archival paper, in a journal or book, or a WWW page – that will disappear as a common technology underlies them all. At the same time, integrating new technologies into the publication process makes it clear how certain features will remain true only within a single document.

This can be described in terms of a “membrane” that functionally defines what we will continue to consider the document. The defining membrane in this case has to do with a notion of authorship. The words expressed are the creation/opinion of a single author or multiple authors, each of whom has signed his or her name. For example, when one paragraph of a legal document* refers to another passage within the same document:

Except as provided in the case of certain unauthorized signatures (§8-205), lack of genuineness of a certificated security or an initial transaction statement is a complete defense....

we can be assured that the document’s authors were thinking about both passages and commit to a particular relation between them. Interdocument citations, on the other hand, reflect a new opinion about some preexisting document.

The Talmud, arguably one of the world’s first legal documents, is most commonly published as a single volume. But extensive typographical conventions (as shown in Figure 6.12) are used to isolate individual voices and relations among commentaries; a more WWW-conventional approximation is shown in Figure 6.13, taken from an HTML Version of the Talmud. In the common law tradition, the connection between implicit and explicit meanings becomes similarly fused. The “primary” text, secondary commentary on this, commentary on the commentary, and so on become fused into a single unified work written by multiple authors who may have never met.What the Talmud says about Web Sites

6.2.5 Beyond Unary About (k) Predicates

Our discussion of FOA has generally assumed that we are attempting to find the topic or topics that a document is about. In the language of logic, we can talk about this as applying a unary predicate About(k). That is, the determination that a document is about keyword k depends on only one argument, a single keyword. But no piece of writing deals with only a single topic, at least for more than a moment. Good writing connects topics. It discusses a relation among those topics.

Within the paragraph-size textual passages that have been our focus, this relational aspect of language has not been too troubling. But it should make us especially interested in the transitions in topical focus as we move from one paragraph to the next within the same document. Passage-level analysis like this was mentioned earlier (cf. Section 6.2.2).

As we move to larger documents, an additional source of information is citations from one to another. Assuming that we have characterized the topical area of each document independently, an analysis of the citations, placed by one author relating another document to the author’s topic, is an explicit indication of the relationships between these topics.

All three levels of analysis – within-paragraph relations among topics, paragraph-to-paragraph topical progression, and interdocument relations reflected by citations – are some of the clues we can exploit when suggesting potential browsing directions for users trying to FOA.

6.3 Keyword Structures

Most of what we have said about the Index relation (e.g., as part of the vector space model) assumes that keywords are simply a set of features. But beyond simply providing access to the retrieval of documents, the fact that keywords are meaning-ful objects in their own right means that we can analyze relationships among keywords directly.

Thesauri are structured representations of relations among keywords. Common relations represented in thesauri include:

  • broader term/narrower term (BT/NT); these capture hierarchic relations, generally between a kind of semantics, sometimes a part-whole relation.
  • related term (RT), capturing synonym or quasi-synonym relationships.
  • use for (UF), capturing a preferred, conventional, or authoritative term over possible alternatives.

One of the most extensive examples of such a representation is the MeSH (Medical Subject Headings) thesaurus, part of the National Library of Medicine’s extensive PubMed system.
Figure 6.14 shows the term LYMPHOMA within the MeSH thesaurus.* Example hierarchic BT/NT relations are shown as indentation. Because this thesaurus allows a single keyword to fit in multiple places in the hierarchy (e.g., treating LYMPHOMA as a kind of NEOPLASM as well as a kind of IMMUNOLOGIC DISEASE), this browser shows the term as part of three separate paths; note that the children (narrower terms) of LYMPHOMA are repeated in each location.

6.3.1 Automatic Thesaurus Construction

Before considering WordNet, another elaborate thesaurus, it is useful to relate these manually constructed representations with automatic, statistically derived analogs. Such comparisons have been part of IR research since its beginnings [Joyce and Needham, 1958; Dennis, 1964; Soergel, 1974]. Section 5.2.5 has discussed how the same information used to cluster documents can be used with keywords. The semantics of relations based strictly on cooccurrence frequencies are not obvious [van Rijsbergen, 1977] but seem to provide evidence for the RT (related term, or synonymy relation) discussed earlier.

Using this information to construct hierarchic relations among keywords corresponds to (hierarchic) clustering techniques. Thesaurus-specific techniques generally exploit a heuristic that high-frequency keywords correspond to broad, general terms while low-frequency keywords correspond to narrow, specific ones [Srinivasan, 1992]. This heuristic can be used to organize keywords into levels of a taxonomy, with the hierarchic parent/child relation formed between those keywords with similar document distributions. RelFbk can also be used to provide thesaurus structure [Guntzer et al., 1989]. Whether constructed manually or automatically, thesuarus structures support many new forms of navigation [Rada and Bicknell, 1989; McMath et al., 1989].

6.3.2 Corpus-Based Linguistics and WordNet

Linguistics has traditionally focused on the phenomena of spoken language, and since Chomsky [Chomsky, 1965; Chomsky, 1972] it has further focused on syntactic rules describing the generation and understanding of individual sentences. But as more large samples of written text have become available, corpus-based linguistics has become an increasingly active area of research. D. D. Lewis and E. D. Liddy have collected a useful bibliography and resource list on NLP for IR, and R. Futrelle and X. Zhang have collected large-scale persistent object systems for Corpus Linguistics and Information Retrieval. Stuart Shieber maintains the The Computation and Language Archive as part of the LANL reprint server.

The sophistication of syntactic analysis of computational linguistics provides a striking contrast to IR’s typical bag-of-words approach, which aggressively ignores any ordering effects. Conversely, IR’s central concerns with semantic issues of meaning and the ultimate pragmatics of using language to find relevant documents go beyond the myopic concern with isolated sentences that is typical of linguistics. The range of potential interactions between these perspectives is only beginning to be explored, but it includes the introduction of parsing techniques with IR retrieval systems [Smeaton, 1992; Strzalkowski, 1994], as well as using statistical methods to identify phrases that are a first step from a simple bag-of-words to syntactically well-formed sentences [Lewis, 1992; Krovetz, 1993; Church and Hanks, 1989; Steier, 1994; Steier and Belew, 1994a]. Another important direction of interaction is the use of IR methods across multilingual corpora, for example, arising from the integration of the European Community [Hull and Grefenstette, 1996; Sheridan and Ballerini, 1996].

From a syntactic perspective, the only way to get issues of real meaning into language is via the lexicon: a dictionary of all words and their meanings. Our present concern, interkeyword structures, becomes an issue of lexical semantics [Cruse, 1986], and it is no surprise that linguists have also developed representational systems for interword relationships. An influential and widely used example of a keyword thesaurus is the WordNet system developed by George MillerA wide-ranging psychologist and colleagues [Fellbaum, 1998].

One obvious distinction of WordNet is simply the size of its vocabulary: It contains almost 100,000 distinct word forms, divided into lexical categories as shown in Table 6.4. Central to the lexical approach to semantics is distinguishing between lexical items and the “concepts” they are meant to invoke:

“Word form” will be used here to refer to the physical utterance or inscription and “word meaning” to refer to the lexicalized concept that a form can be used to express. Then the starting point for lexical semantics can be said to be the mapping between forms and meanings.[Fellbaum, 1998, p. 4]

The relations connecting words in WordNet are similar – but not identical – to those used within thesauri. The first and most important relation is synonymy. This has a special role in WordNet, pulling multiple word forms together into a synonym set, which, by definition, all have the same meaning:

According to one definition (usually attributed to Leibniz) two expressions are synonymous if the substitution of one for the other never changes the truth value of a sentence in which the substitution is made.... A weakened version of this definition would make synonymy relative to a context: two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value. [Fellbaum, 1998, p.6]

The BT/NT relation in standard thesauri is refined in WordNet into two types of relations, hypernymy and meronymy. The former relation plays a dominant role, allowing inheritance of various properties of parent words by their children:

Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation).... A hyponym inherits all the features of the more generic concept and adds at least one feature that distinguishes it from its superordinate and from any other hyponyms of that superordinate. This convention provides the central organizing principle for the nouns in WordNet. [Fellbaum, 1998 p. 8]

This hypernymy relation connects virtually all the words into a forest of trees rooted on a very restricted set of “unique beginners.” In the case of nouns, the top-level categories are those shown in Table 6.5; those for verbs are Table 6.6.

The final category of stative verbs is used to capture the distinction between the majority of active verbs and those (e.g., SUFFICE, BELONG, RESEMBLE) reflecting state characteristics.

WordNet also represents roughly the opposite of the synonym relation with the antonymy relation. Defining this logically proves more difficult, and Miller is forced to simply equate it with human subjects’ typical responses:

Antonymy is a lexical relation between word forms, not a semantic relation between word meanings.... The strongest psycholinguistic indication that two words are antonyms is that each is given on a word association test as the most common response to the other. For example, if people are asked for the first word they think of (other than the probe word itself) when they hear VICTORY, most will respond DEFEAT; when they hear DEFEAT most will respond VICTORY. [Fellbaum, 1998 pp. 7, 24]

The use of the antonymy relation in WordNet is particularly interesting when applied to adjectives:

The semantic organization of descriptive adjectives is entirely different from that of nouns. Nothing like the hyponymic relation that generates nominal hierarchies is available for adjectives.... The semantic organization of adjectives is more naturally thought of as an abstract hyperspace of N dimensions rather than as a hierarchical tree. [Fellbaum, 1998, p. 27]

First, WordNet distinguishes the bulk of adjectives, which are called descriptive adjectives (such as BIG, INTERESTING, POSSIBLE), from relational adjectives (PRESIDENTIAL, NUCLEAR) and reference-modifying adjectives (FORMER, ALLEGED). They then find:

All descriptive adjectives have antonyms; those lacking direct antonyms have indirect antonyms, i.e., are synonyms of adjectives that have direct antonyms. [Fellbaum, 1998, p. 28]

An example of the resulting dumbbell-shaped bipolar organization is shown in Figure 6.15.

Voorhees was one of the first to explore how WordNet data can be harnessed as part of a search engine [Voorhees, 1993].

6.3.3 Taxonomies

As their central role in WordNet suggests, the hierarchic BT/NT, or hypernymy, relations are especially important. Because of the constant analytic pressure of the Western intellectual tradition, topics continue to be refined into smaller topics, which the next generation of scholars immediately refines further. This has led to a wide range of classification taxonomies associated with various social groups of scholars and scientists.

Figure 6.16 shows several different sources from which indexing information might be obtained. The single most important source of subject indexing is the Library of Congress (LoC). Its indexing system is the basis of most large libraries because it covers all disciplines. For exactly the same reason, however, the indexers at the LoC have too much to do, and the resulting indices are admittedly crude. Partly as a response to the lack of adequate indexing structures, various professional groups have developed their own taxonomies to help organize the information within their particular technical specialty. For example, the Association of Computing Machinery has developed the ACM Computing Reviews Classification for use by its Computing Reviews publication [ACM, 1986]. This taxonomy is much more specific, and therefore more widely used by computer specialists. However, it lacks many of the advantages of the LoC system. In general, the keywords are assigned by the authors rather than by trained librarians. The system is rarely, if ever, integrated into the operations of libraries. And although the indexing structure is much more refined than that of the LoC, it is still too crude for most research currently going on in any subspecialty. This has caused some practitioners in various subspecialties to develop their own extensions. For example, David Waltz was commissioned by Scientific Data-Link to extend the ACM’s Computing Reviews taxonomy for the sub-specialty of artificial intelligence [Waltz, 1985]. Waltz’s extension is extremely refined and helpful to AI practitioners. At the same time, it is even more ad hoc, its “sponsoring institution” has less impact, and consequently it is even less well accepted within libraries.

All three of these indexing systems are examples of top-down knowledge structures. That is, they are developed by various social institutions as prescriptive languages used to represent the consensus opinion as to how information is to be organized. Such “consensual” knowledge structures are critical if individuals are to share information. Each indexing system represents a compromise between increased scope and diminished resolution. Increased scope brings along with it broader acceptance and adherence. These advantages occur at the expense of acceptance by users actively involved in technical specialties.

The central role of hierarchic BT/NT relations in organizing keyword vocabularies should make us especially concerned with a precise semantics for this relationship. Most would agree that if B is a broader term than A, then A “is a” B. But as knowledge engineers within AI have known for a long time, the ubiquitous IS_A relation admits a number of interpretations, which can support different types of inference [Brachman, 1979; Brachman, 1983]. In general, the BT/NT relation seems to correspond most closely to the “a kind of” implication relating predicates A and B:

(6.11)

Earlier generations of Internet users participated in the construction of the extensive UseNet hierarchy of discussion boards, on topics from ALT.SEX.FETISH.ROBOTS to COMP.SYS.MAC.OOP.TCL; see Section 7.5.5 for an example of the use of this hierarchy in text classification tasks. These days the most widely known taxonomies are WWW directories such as Yahoo!, whose employees have constructed a hierarchy, primarily of places to spend money. One of the most exciting recent developments is the development of collaborative classification efforts such as the Open Directory Project (DMOZ), which involves large communities of experts, each working to make sense within his or her own area of expertise.

6.4 Social Relations among Authors

Another potential source of information about documents that we can use to augment statistial keywords is “cultural” information, capturing some features of the social relationships among authors in a field. Most of our discussion of documents has treated them as if they were completely dead artifacts. But documents are written by people, authors who write from a particular perspective. When their writing is in science or the law, or any other tradition within which they participate, we can make reasonable guesses about some aspects of what it is they are trying to say.

An author’s education, in particular, offers clues as to how we can interpret their words. Work and writing in many fields requires extensive graduate education. Students are soon moved from common “core” curricula to more advanced material. Kuhn and others have analyzed the central role textbooks play as part of the social process of codifying a discipline [Kuhn, 1970]. As students move beyond common textbooks to specialized training, their approach to the problem often resembles that of their teachers (at least as long as they are around the teacher). By knowing something about the author’s education, and especially about his or her dissertation advisor, we may have a basis for interpreting the writing. The importance of dissertations as an academic resource was recognized as early as 1940 by University Microfilms Inc. (UMI), as copies of virtually every dissertation published by many universities were microfilmed. UMI (now the Information and Learning division of Bell & Howell) makes its Dissertation Abstracts corpus available for WWW searching.

6.4.1 AI Genealogy

The AI Genealogy project (AIG) is an attempt to collect information relating authors in artificial intelligence to one another through shared advisors. In analogy to genealogical family trees, we can treat the advisor as a parent to the advisee. Students of students become grandchildren, and so on. A subset of this data is shown in Figure 6.17.

As an example of how this additional information about authors can be exploited to understand more about the words used in documents, Steier has analyzed word-pair cooccurrence data as a means of finding potential index phrases [Steier and Belew, 1994b]. Take, for example, the phrase CASE-BASED REASONING. This phrase has a high degree of phrasal information content (using mutual information statistics to identify statistically dependent word pairs) when these statistics are collected across the entire AIT document corpus.

But a statistically significant different distribution is observed within dissertations coming from a particular set of universities: Yale, Georgia Tech, and the University of Massachusetts. Within these particular university contexts, the constituent concepts of CASE-BASED and REASONING are examined in detail and independently. Not only do these words occur together as CASE-BASED REASONING, but they often occur separately (e.g., CASE-BASED PLANNING, CASE-BASED ARGUMENT, CASE-BASED PROBLEM, SCHEMA-BASED REASONING, ANALOGICAL REASONING, COMMONSENSE REASONING). Within the limited context of these subcorpora, the keywords occur more independently and hence are considered less phraselike. As this phrase is “exported” into the general AI vocabulary, the semantic nuances are left behind, and the dominant use of the constituent words is as part of the phrase.

What is it that makes language use so different at Yale, Georgia Tech, and the University of Massachusetts? Perhaps it is merely coincidence, but our hypothesis is that it is the common lineage traced to Roger Schank! Work across these geographically distant research institutions is pulled together by an intellectual tradition captured by the AIG data.

Part of what is interesting about the AIT corpus is its demonstration within a single corpus of many of the representations discussed in this chapter. The AI genealogy captures some of the intellectual linkage due to the Ph.D. advisor/advisee relationship. David Waltz’s taxonomy of AI provides an excellent initial thesaurus over keywords [Waltz, 1985].

6.4.2 An Empirical Foundation for a Philosophy of Science

One advantage of studying a focused corpus like the AIT is that we have an especially good chance of understanding some of these social relations. A history of AI often begins with a seminal conference that took place at Dartmouth in 1956 [McCorduck, 1985; Russell and Norvig, 1995].Alternate histories If we treat the attendees at that meeting as “founding fathers” (in a population genetic sense!), we can attempt to track their “genetic” impact on the current field of AI.

see exercise 6 Exercise 6

In an attempt to capture other significant intellectual influences beyond the advisor, the AIG questionnaire asks for committee members other than the chairman. The role of committee members varies a great deal from department to department and from campus to campus. But all of these numbers can be expected to exert some intellectual influence. Asking for committee members is a step toward other nonadvisor influencers, and it is also a matter of record. Research institutions are another way to capture intellectual interactions among collaborators.

Even if and when the AI-Ph.D. family tree is completed, it will certainly not have captured all of what we mean by “artificial intelligence research.” For example:

  • The Dartmouth “founding fathers” probably provide (direct) lineage for a minority of AI Ph.D.s currently working in the field.
  • Ph.D. theses themselves are probably some of the worst examples of research, in AI and elsewhere. By definition, we are talking about students who are doing some of their first work. They had better improve with time!
  • Ph.D.s account for only a fraction of AI research.

Nevertheless, science is primarily concerned with accumulating knowledge, at least as much as it is about its initial discovery. A primary argument for interest in the AIG is that traditional academic relationships, as embodied by Ph.D. genealogies, form a sort of “skeleton” around which the rest of scientific knowledge coalesces. Certainly, individuals can pursue their own research program, and corporations can fund extended investigations into areas that are not represented in academia whatsoever. But it is hard to imagine extended research programs (like that “fathered” by Roger Schank, for example) that do not involve multiple “generations” of investigators; academia is almost certainly the most successful system for such propagation. Further, clear demonstrations of intellectual lineages like this promise much more concrete evidence for memes – the hypothetical analogs of biological genes in cultural systems [Dawkins, 1976] – than analyses of text alone [Best, 2000] can ever provide.

Kuhnian [Kuhn, 1970] and post-Kuhnian [Hull, 1988; Latour, 1987] analyses highlight the importance of the social aspects of science. “Paradigms” are extremely appealing constructs, but they’re also amorphous. For all of its faults, the AI-Ph.D. tree represents incontrovertible facts, just as word frequencies do.

6.5 Modes of Inference

If deduction is the process of proceeding from a set of rules to the implications of those rules, and if induction is the process of forming rules based on patterns across many examples, and if abduction is the process of forming hypotheses worthy of asking, FOA can also be viewed as an inference process: the process of forming questions that elicit desirable answers.

6.5.1 Theorem-Proving Models for Relevance

Imagine the current state of knowledge possessed by a browsing user as a set of axioms . Then we can model their information need as a question: Can we infer that a theorem is true or false from our current knowledge?

(6.12)

The fact that the user has an information need can be taken as an assumption that there must be additional knowledge, contained in some documents, that together with his of her current knowledge base does allow (or ) to be proven. Relevant documents are exactly those that support such an inference:

(6.13)

Practically speaking, of course, this model is impossible. It would demand a complete and consistent logical description of the user’s cognitive state. It also requires that the full set of all possible logical facts contained in each and every document be similarly encoded. (And then, of course, there is the minor problem of searching for the minimal set of documents that satisfy this inference!)Late Winograd

Still, this basic conception of existing knowledge being extended by new facts and new inferences becoming possible as new facts become known is a provocative metaphor at the least. Sperber and Wilson’s notion of “connected” information corresponds exactly to such new inferences (cf. Section 8.2.2).

Van Rijsbergen has talked about this strong notion of relevance as objective relevance [van Rijsbergen, p. 147]. In more recent work, he has extended this basic model to a full relevance logic, based on a four-valued semantics over 2{T,F} [van Rijsbergen, 1986].

6.5.2 Spreading Activation Search

The facts that keywords can be associated with documents or with one another, that documents become associated by citations, and so on led early IR pioneers such as Doyle and Stiles to adopt simple association as a unifying relation connecting many objects of the FOA inquiry [Doyle, 1960; Doyle, 1961; Doyle, 1962; Stiles, 1961; Maron, 1982b; Giuliano, 1963; Findler, 1979]. Integrating information across semantic networks of such associative relations has been an important example of knowledge representation within AI since the memory models of Collins and Qullian [Collins and Quillian, 1972]. In its simplest form, a simple quantity known as activity is allowed to propagate through a network like that shown in Figure 6.18 from several sources. Spreading activation search is the name for a broad range of techniques that find solutions (for example, a path from Start to Goal in Figure 6.18) by controlling the propagation of activity through associative networks like this [Preece, 1981; Salton and Buckley, 1988a; Cohen and Kjeldsen, 1987].

The Adaptive Information Retrieval (AIR) system was a prototype search engine built as part of my dissertation at the University of Michigan in the mid-1980s [Belew, 1986; Belew, 1989]. This research was one of several systems applying connectionist (neural network) learning methods to the IR search engine problem [Mozer, 1988; Bein and Smolensky, 1988; Doszkocs et al., 1990; Belew, 1987b].

Basic Representation

In AIR, each new document first causes a corresponding document node to be generated. An author node is then generated (if it doesn’t already exist) for each author of the document. Two links are then created between the document and each of its keywords (one in each direction), and two more between the document and each of its authors. Weights are assigned to these links according to an inverse frequency weighting scheme: The sum of the weights on all links going out of a node is forced to be a constant; in our system that constant is one. Figure 6.19 shows the subnet corresponding to the book Parallel Models of Associative Memory, by G. E. Hinton and J. A. Anderson [Hinton and Anderson, 1984].

The initial network is constructed from the superposition of many such documents’ representations. Most of the experiments described in this report used a network constructed from 1600 documents, forming a network of approximately 5000 nodes. This is a trivial corpus that uses relatively crude lexical analysis and keyword weighting ideas. However, AIR requires only that the initial automatic indexing assign some weighted set of tentative keywords to each document.

There is one property of the inverse weighting scheme on which AIR does depend, however. A network built using this keyword weighting scheme, together with similar constraints on the weights assigned to author links, has the satisfying property of conserving activity. That is, if a unit of activity is put into a node and the total outgoing associativity from that node is one, the amount of activity in the system will neither increase nor diminish. This is helpful in controlling the spreading activation dynamics of the network during querying.

Querying and Retrieval

Users begin a session with AIR by describing their information need, using a very simple query language. An initial query is composed of one or more clauses. Each clause can refer to one of the three types of “features” represented in AIR’s network: keywords, documents, or authors, and all but the first clause can be negated. This query causes activity to be placed on nodes in AIR’s network corresponding to the features named in the query. This activity is allowed to propagate throughout the network, and the system’s response is the set of nodes that become most active during this propagation.

The traditional result of a query is only documents. AIR also provides keywords and authors. Keywords retrieved in this manner are considered related terms that users may use to pursue their searches. Retrieved authors are considered to be closely linked to the subject of interest. There are many ways in which a user might find related terms and centrally involved authors, a valuable information product in their own right.

Figure 6.20 shows AIR’s response to a typical query:

((:TERM “ASSOCIATIVE”)(:AUTH “ANDERSON,J.A.”))

This is the network of keywords, documents, and authors considered relevant to this query. The nodes are drawn as a tripartite graph with keywords on the top level, documents in the middle, and authors on the bottom. Associative links that helped to cause a node to become retrieved (and only those links) are also displayed. Heavier lines imply stronger associative weights. AIR uses directed links, and this directionality is represented by the concavity of the arcs; a clockwise convention is used. For example, a link from a document node (in the middle level) to a keyword node (in the top level) goes clockwise, around to the left.

Actually, this is only a picture of the final state of the system’s retrieval. The network is actually drawn incrementally, with the first nodes to become significantly active being drawn first and in the middle of the pane. As additional nodes become active at significant levels, they are drawn farther out along the three horizontal axes and the links through which they became active are drawn as well. This dynamic display has at least two real advantages. First, the fact that AIR provides the first part of its retrieval almost immediately means that the user is not impatiently waiting for the retrieval to complete (typically 5 to 10 seconds in this implementation). Second, displaying the query’s dynamics helps to give the user a tangible feeling of “direct manipulation” [Rose and Belew, 1990]; the user “prods” the network in a certain place and then watches as waves of activity flow outward from that place.

RelFbk in AIR

Queries subsequent to the first are performed differently. After AIR has retrieved the network of features, the user responds with RelFbk, indicating which features are considered (by that user) relevant to the query and which are not. Using a mouse, the user marks features with the symbols: , , and , indicating that the feature was Very Relevant, Relevant, Irrelevant, or Very Irrelevant, respectively. Not all features need be commented on (cf. Section 4.1.1).

The system constructs a new query directly from this feedback. First, terms from the previous query are retained. Positively marked features are added to this query, as are the negated versions of features marked negatively. Equal weight is placed on each of these features, except that features marked Very Relevant or Very Irrelevant are counted double.

From the perspective of retrieval, this RelFbk becomes a form of browsing: Positively marked features are directions the user wants to pursue, and negatively marked features are directions that should be pruned from the search. From the perspective of learning, this RelFbk is exactly the training signal AIR needs to modify its representations through learning. This unification of learning (i.e., changing representations) and doing (i.e., browsing) was a central component of AIR’s design. It means that the collection of feedback is not an additional onerous task for the user, but rather is a natural part of the retrieval process.

Learning in AIR

Nodes marked by the user with positive or negative feedback act as sources of a signal that then propagates backwards along the weighted links. A local learning rule then modifies the weight on links directly or indirectly involved in the query process. Several learning rules were investigated; the experiments reported here used a learning rule that correlated the activity of the presynaptic nodei with the feedback signal experienced by the postsynaptic nodej:

AIR makes a most direct correspondence between the connectionist notion of activity and the IR notion of Pr(Rel) 5.5):

The activity level of nodes at the end of the propagation phase is considered to be a prediction of the probability that this node will be judged relevant to the query presented by the user.

This interpretation constrained the AIR system design in several ways (e.g., activity is a real number bounded between zero and one, query nodes are activated fully). AIR also allows negative activity, which is interpreted as the probability that a node is not relevant. The next step of the argument is to consider a link weight wAB to be the conditional probability that NodeB is relevant given that NodeA is relevant. Next, this definition must be extended inductively to include indirect transitive paths that AIR uses extensively for its retrievals.

The system’s interactions with users are then considered experiments. Given a query, AIR predicts which nodes will be considered relevant and the user confirms or disconfirms this prediction. These results update the system’s weights (conditional probabilities) to reflect the system’s updated estimates. Thus, AIR’s representation results from the combination of two completely different sources of evidence: the word frequency statistics underlying its initial indexing and the opinions of its users.

A straightforward mechanism exists to incrementally introduce new documents into AIR’s database. Links are established from the new document to all of its initial keywords and to its authors; new keyword and author nodes are created as necessary. The weights on these links are distributed evenly so that they sum to a constant. Because the sum of the (outgoing) weights for all nodes is to remain constant, any associative weight to the new document must come from existing link weights. A new parameter (*CONSERVATIVE*) is introduced to control the weight given these new links at the expense of existing ones. If the network is untrained by users, this parameter can be set to zero to make the effect of an incremental addition exactly the same as if the new document had been part of the initial collection. In a trained network, setting *CONSERVATIVE* near unity ensures that the system’s experience incorporated in existing link weights is not sacrificed to make the new connections. Also, note that the computation required to place the new document is strictly local: Only the links directly adjacent to the new document’s immediate neighbors must be changed. The major observation about the inclusion of new documents, however, is that there is an immediate “place” for new documents in AIR’s existing representation.

A second source of new information to the AIR system comes from users’ queries. If a query contains a term unknown to AIR, this term is held in abeyance and AIR executes the query based on the remaining terms. Then, after the user has designated which of AIR’s responses are relevant to this query, a new node corresponding to the new query term is created and becomes subjected to the same learning rule used for all other nodes.

Although easily incorporating new documents and new query terms is a valuable property for any IR system, from the perspective of machine learning these are examples of simple rote learning, and they are necessarily dependent on the specifics of the IR task domain. The main focus of the AIR system is the use of general-purpose connectionist learning techniques that, once the initial document network is constructed, are quite independent of the IR task.

Generalized Representation

The standard, interkeyword and interdocument associations typically evaluated as part of keyword and document clustering are part of the broader context of the associative representation used by AIR. The system extends these pairwise clustering relations and the fundamental keyword-document Index relation to include higher-order transitive relations as well. That is, if node A is associated with node B and node B is associated with node C, then node A is also considered to be associated with node C, but to a lesser extent.

Obviously, this transitive assumption is not always valid, and this may be why most IR research does not consider this extension. But AIR is an adaptive system, and one of the critical problems facing any learning system is the generation of plausible hypotheses, i.e., theories that stand a better than average chance of being correct. Transitivity is considered a default assumption, the consequences of which will be subjected to adaptations that favor appropriate transitivities and cull out inappropriate ones.

It is interesting to contrast the adaptive changes made by AIR in response to RelFbk with the document modification strategies of G. Salton and his students [Friedman et al., 1967; Brauen, 1969] mentioned in Section 4.2.7 (and returned to in Section 7.3). The query-document matches used as the basis of their changes consider only direct, keyword-to-document associations while AIR makes use of a much wider web of indirect associations. To a first approximation the changes made by AIR to direct keyword-to-document associations are not unlike those proposed by Salton and Brauen (if I’d only known!), but AIR also makes other changes, to more indirect associations.

Salton and Buckley have analyzed the spreading activation search used in some of these systems and concluded that it is inferior to more traditional retrieval methods [Salton and Buckley, 1988a; Salton and Buckley, 1988b]. They point out:

... the relationships between terms or documents are specified by labeled links between the nodes .... the effectiveness of the procedure is crucially dependent on the availability of a representative node association map. (pp. 4, 5, emphasis added)

In a weighted, associative representation of the semantics of indexing, interdocument and interkeyword clustering links are dropped in favor of a single, homogeneous associative relation. AIR treats all three types of weighted links equally. For example, if interdocument citation data had been available, this information could naturally be included as well; again the semantics of these relations would have been dropped in favor of a simple associative weight. The contrast between the use of spreading activation search in connectionist networks with its use in semantic networks is admittedly a subtle one, but it is also critically important [Rose and Belew, 1989]. One clear difference is that semantic networks typically make logical, deterministic use of labeled links, while connectionist networks like AIR rely on weighted links for probabilistic computations.

Figure 6.21 shows how many of the features discussed here can interact as part of a single retrieval system. This figure comes from Dan Rose’s SCALIR (Symbolic and Connectionist Approach to Legal Information Retrieval) system, which was built to investigate the use of both logical, “symbolic” modes of inference and probabalistic, “subsymbolic” ones. This figure shows containment relations between document elements (like those shown in more detail in Figure 6.10), topical connections between keywords, and interdocument citations, all mixed and used as part of spreading activation-based inference.

6.5.3 Discovering Latent Knowledge within a Corpus

Nothing is more frustrating than spending many hours on a technical problem, unless it is finding out later that someone else had previously solved the same problem! One of the motivations shared by many people working on search engine technology is the hope that we can reduce the number of times the same wheel is reinvented.

D. R. Swanson has concentrated on the scientific and medical literature and viewed it as “a potential source of new knowledge” [Swanson, 1990; Swanson and Smalheiser, 1997]. Each new medical report contains new knowledge about some particular disease or treatment. But Swanson has taken the next step: to imagine what modes of inference might be most appropriate across a network of such papers, each of which describes potential causal relations:

Each scientific article contributes to a web of logical connections that interlace the literature of science. Some of these connections are made explicit through references from one article to another, citations that reflect, among other things, authors’ perceptions of how their own work is related to that of others and how it fits into the scheme of existing knowledge. However, there may exist many implicit logical interarticle connections that are unintended and not marked by citations; such implicit links are the focus of this paper. (The word “logical” here is used informally; a “logical connection” is formed by statements that are related by any process of scientific reasoning or argument.)

Scientific articles can be seen as clustering into more or less independent sets or “literatures.” Within each set, common problems are addressed, common arguments are advanced, and articles “interact” by citing one another. Distinct literatures that are essentially unrelated are in general “noninteractive” in that they do not cite or refer directly to each other, have no articles in common, and are not cited together by other articles. On the other hand, if two literatures are linked by arguments that they respectively put forward – that is, are “logically” related or connected – one would expect them to cite each other. If they do not, then the logical connections between them would be of great interest, for such connections may be unintended, unnoticed, and unknown – therefore potential sources of new knowledge....

The number of possible pairs of literatures (that might be connected) increases very nearly as the square of the number of literatures; the number of possible connections increases at even a greater rate if triple or higher combinations are taken into account rather than just pairs. From this perspective, the significance of the “information explosion” may lie not in an explosion of quantity per se, but in an incalculably greater combinatorial explosion of unnoticed logical connections. (pp. 29, 35)

Swanson’s first and most well-known example of new knowledge discovered in this fashion identified fish oil as a treatment for Raynaud’s syndrome, a circulatory problem resulting in poor blood supply to the extremities. Figure 6.22 shows a second example of Swanson’s method. Beginning with a syndrome like migraine headaches, Swanson searches the literature for features mentioned in the context of migraines that have also been mentioned in a second, disjoint, “mutually oblivious” literature. Working backwards from the syndrome to be explained, a query is formed against a medical corpus like Medline. The resulting set of (in this case 63) documents is then analyzed and clustered into related sets; it is in this clustering that Swanson’s manual, intellectual effort is most obvious. The clusters are used to suggest new additional queries. Finally, the single common “cause” of magnesium is identified. Figure 6.22 also shows an attempt to classify the relationship between common phrases in the retrieved literature and the relation to the syndrome. Further, the relationship between these same phrases (or lexical variations on them!) and the common cause magnesium is also forced into structured relationships.

It is hard to imagine a more exciting prospect for the analysis of all literature. The identification of such “undiscovered public knowledge” is almost certainly possible in many other situations. The question becomes how we might algorithmically search for all of them. Note especially the liberties taken in interpreting the literature and phrases used consistently within it as they have been transformed into the structured relations of Figure 6.22. These relations are meant to suggest more formal and powerful modes of inference between causes and effects mentioned within each paper. Some of the arrowheads show suggested causal relationships; some relationships (e.g., associated_with) are neutral with respect to correlation versus causation; others (type_of) suggest hierarchic relations between classes.

Swanson is aware of these difficulties:

... [the] form and structure of logically connected arguments are in general recognizable by scientists irrespective of their specialty, a point that may have implications for research on futuristic, more fully automated systems. However, the simple structure of the syllogistic model does not in many respects reflect the depth or range of actual problems that would be encountered if one tried to build a database of logical connections....

The objective, moreover, is not simply to draw mechanistic logical inferences, but rather to determine whether certain plausible connections or hypotheses appear to be worth testing. Most articles harbor, either explicitly or implicitly, an enormous number of logical connections. Which connections are relevant and important can be determined only in the light of a specific context, hypothesis, problem, or question; but such contexts and questions are always changing. The degree to which one can hope to encode logical connections in any form suitable for all future uses may therefore be quite limited. [Swanson, 1990, p. 35]

Recognizing that even if fully automatic discovery of new facts is currently too hard, Gordon and Lindsey have investigated forms of “discovery support” [Gordon and Lindsay, 1996]. Gordon and Dumais have also explored the use of LSI techniques (cf. Section 5.2.3) as part of the literature-based discovery process [Gordon and Dumais, 1998]. The formation of an SVD, eigenstructure analysis of the relationship between a query and documents (step 2), and then the analysis of these intermediate literatures and ultimate causes (step 3) is an important extension beyond the manual investigations originally proposed by Swanson.

6.6 Deep Interfaces

Probably because interface technology did not support rich graphical presentations and pointing devices until late in the development of computer science, the human-computer interface (HCI) is often an afterthought in software engineering. This bias remains today in search engine designs, which assume the indexing and match algorithms can be separated from the presentation of results [Harman, 1992b; Rose and Belew, 1990].

Sometimes interface design is approached as a data visualization task [Veerasamy and Belkin, 1996]. Korfhage’s VIBE presentation [Korfhage, 1995] highlights a user’s and author’s perspectives on topical areas. Rather than assuming that there is one absolute preferred perspective on keywords, Korfhage considers what the words look like from the perspective of users and authors, respectively.

In terms of the vector space model, we can think of these as projections. The huge dimensional space of keywords, or even the still large reduced dimension representation, is still far more than we can visualize on a two-dimensional computer screen. We can try to impose other dimensions (e.g., color or size), but we still must pick some projected subspace of the larger data set. Norman and Schneiderman have written extensively on the design of interfaces that are deeply connected to the user’s underlying task [Norman, 1988; Schneiderman, 1992]; Marchianoni has focused particularly on interfaces for the “information seeking” task [Marchionini, 1995]. As part of the Xerox PARC group, Hearst has explored a number of visualization techniques (cf. Figure 6.8); she has also recently provided an extensive survey of interface technologies [Hearst, 1999].

6.6.1 Geographical Hitlists

Singular tokens are those proper nouns – people, places, events – that have a unique reference in the world. They are distinguished from general terms, which refer to categories of objects in the world. Distinctions like this have been a part of linguistic analysis since the beginning [Searle, 1980], and many with a background in AI will recall Ron Brachman’s CLYDE_THE_ELEPHANT example [Brachman and Levesque, 1985; Brachman and McGuinness, 1988].

In FOA the distinction initially arose out of practical considerations. The basic morphological process of folding case – Porter’s stemmer and similar tools – is designed primarily to deal with what we could, in the current context, call general terms only. Conversely, proper names (e.g., family and place names) rarely observe morphological transformation names. The capitalization that often flags singular proper nouns is thrown away rather than actually helping to ease the task of automatically identifying and parsing them.

see exercise 7 Exercise 7

It is no wonder, then, that name-tagging techniques that deal intelligently with singular tokens were an early area of search engine development [Rau, 1988; Jacobs, 1987]. Identifying the subclass of people singulars is an especially active area. Relatively small dictionaries of the “movers and shakers” of the modern world – politicians, captains of industry, artists, etc. – can provide an especially informative and commercially valuable set of additional indexing tokens in applications such as financial news services.

Chris Needham has proposed an interesting strategy for progressively applying stronger models of representation based on various classes of singulars (personal communication). Working on a representation for Encyclopædia Britannica’s editors, the procedure Needham and his group hit upon was to

  1. first describe places in the world;
  2. then describe people who live (are born, travel through, and then die) in these places; and
  3. finally to describe events involving people at locations.

Specification of one layer of terminology provided a concrete frame of reference for the next: Events involve people, who are associated with places. This suggests one argument for focusing on place-related singulars first. However, modeling even this “simplest” class of proper names quickly required even tighter focus onto physical places about which it was quite easy to give very concrete references and distinguished from political places whose names and extents can vary dramatically. As editors of the Encyclopædia Britannica, these designers were especially aware of how historically and culturally sensitive resolving political place names could be.

At least for physical locations, the emergence of global positioning system (GPS) technologies that allow users to know their position within a single, reconciled geographic frame has helped to drive a growing market for geographical information systems (GIS) software, and the development of worldwide authority lists of place names (e.g., The U. S. Board on Geographic Names (BGS) and the earlier Federal Information Processing Standards (FIPS) “Countries, Dependencies, Areas of Special Sovereignty, and Their Principal Administrative Divisions” list). Like people’s names, place information is important.

Further, human cognition has evolved to live in a three-dimensional world. We each have deep psychological commitments to basic features of our physical space and orientation with respect to a spatial frame of reference [Shepard, 1981; Kosslyn, 1980]. In contrast to all the other abstract, disembodied dimensions along which information often barrages a user’s screen, place information is special. Our experience of time is the other important experiential dimension, as demonstrated by representations like the time line. The orientation provided by such concrete frames can be critical.

Consider, for example, the query CIVIL WAR BATTLE and its conventional retrieval, as shown in Figure 6.23. We should be able to see these retrieved items in the geographical frame they naturally suggest, as shown in Figure 6.24.

Note the steps this required: First the textual hitlist was parsed for geographical tokens. Next, the map coordinates for each of these WiW entries are collected, and a convex hull (bounding polygon) for at least a majority of them is computed. Finally the map that best contains this region is identified, zoomed into, and shifted to best fit.

Within this same frame, a user also immediately knows how to draw queries, for example, restricting a search to only those battles near the East Coast or along a particular river. With modern graphical techniques, animation of these battles as a timeline slider is slid back and forth is almost trivial. But the additional power to browsing users of visualization and direct manipulation interface techniques [Rose and Belew, 1990] such as these is enormous. The important thing is that this additional functionality does not come at the expense of a much more complex interface of commands or menu items. People already know what space means, how to interpret it, and how to work within it.

6.7 FOA (The Law)

The careful reader will have noticed that there have been an unusually large number of examples drawn from the legal domain. There are a number of things it can teach us about general principles of FOA [Belew, 1987a]. The common law tradition is very old and has been connected to an organized corpus of documents in free text for a very long time. As the Doonesbury cartoon (part of Gary Trudeau’s bicentennial series) in Figure 6.25 suggests, legal documents help to demonstrate just how long prose can live beyond its drafting. Hafner was one of the first to recognize this, manually representing a wide range of attributes for a small number of documents in the legal domain [Hafner, 1978]. This work has now become a part of a larger effort within AI to model the legal reasoning process (e.g., reasoning by analogy [Ashley, 1990]). Here we are most concerned with what we can learn from lawyers who have FOA the law, which might generalize to other corpora and searchers.

First, the fact that judicial opinions have been written for so long and in such a particular “voice” has meant that it is also possible to consider special linguistic characteristics of this legal genre [Goodrich, 1987; Levi, 1982].

Second, notions of citation are especially well used within this corpus. Simple concepts like impact have been described [Shapiro, 1985], and theories of legal citation proposed [Merryman, 1977; Tapper, 1980; Ogden, 1993]. Obviously, manipulation of access to the legal printed record (for example, by controlling which judges’ opinions are made available!) has enormous political ramifications [Brenner, 1992]. This became even more true because recent consolidation of the media industry means that one or two corporations effectively control the entire process of legal publication.

Third, the backbone of the legal process is an adversarial argument. This dialectic is often explicit, for example, as marked by the cf. and but cf. citation conventions (cf. Section 6.1). The presence of such syntactic markers makes it conceivable to analyze polarization across an entire legal literature. Of course, the arguments contained in briefs and opinions have a great deal more structure than simple opposition. The analysis of legal argument structures, in conjunction with the textual foundations of common law, is perhaps the most important feature of the legal domain to FOA. On the one hand, it is possible to model individual documents and their logical features so as to reason about them [Hafner, 1978; McCarty, 1980]. Special deontic logics have been developed especially to deal with the concepts of “rights” and “obligations” that are at the heart of many legal relations [Lowe, 1985].

Statistical analyses of large document corpora may seem contrary to logical analyses of the arguments contained in each of them, and in fact this chasm runs very deep. Not only does it suggest different technology bases from the arsenal of (roughly inductive versus deductive) AI techniques, but it also reflects a tension within the law itself. Rose [Rose, 1994, p. 95] refers to a spectrum of legal philosophies ranging from “formalism” to “realism.” On the one hand, many legal documents certainly seem to function logically, with careful definitions and reasoning that Langdell [Langdell, 1887] has idealized as “mechanical jurisprudence.” At the same time, analyses such as the Critical Legal Studies [Unger, 1983] have helped to demonstrate that the law is just another social process.

It is exactly this dual nature of the law, and hence of legal texts, that makes it especially interesting as an example of FOA [Nerhot, 1991]. Individual applications include litigation support systems, which allow lawyers to search through the truckloads of documents involved in extended trials. On a much larger scale, systems like West Group’s WestLaw and Reed Elsevier’s LEXIS systems provide access to the bulk of statutory and case law to all practicing lawyers [Cohen, 1991; Bing, 1987].

6.8 FOA (Evolution)

For some time the study of evolution has demanded an especially interdisciplinary approach, so it is no wonder that profound difficulties in communication arise as scientists trained in paradigms as varied as biology, psychology, and computer science attempt to communicate with one another [Belew and Mitchell, 1996]. Now, of course, theories of evolution are increasingly informed by huge volumes of concrete data, generated by the Human Genome Project and related efforts. Serendipitously, the field of molecular biology is also one of the first (but quite certainly not the last) disciplines to undergo a qualitative change because of the WWW. The nearly simultaneous growth of the WWW and genomic databases has meant that computational biology as a science has grown up with a very advanced notion of publication. Beyond formal publication channels, even beyond informal email and discussion groups, the genomic databases at the heart of molecular biology today may point to forms of communication among scientists that are arguably, like image-based WWW traffic, post-verbal.

The flood of biological sequence data – nucleic acid, proteins, and now gene expression networks and metabolic pathways – into sequence databases, with the related flood of molecular biology literature, represents an unprecedented opportunity to investigate how concepts learned automatically from various data sets relate to the words and phrases used by scientists to describe them. Learning this linkage – between molecular biology concepts and the genomic data relating to them – can be described as annotating the data. It is now possible to learn many of these correspondences automatically, guided by the RelFbk of practicing scientists, as a natural by-product of their browsing through genome data and publications related to them. RelFbk provides a key additional piece of information to learning algorithms, beyond the statistical correlations that may exist within the genome data or textual corpora treated independently: It captures the fact that a scientist who understands both the sequence data and the journal articles deeply does (or does not) believe that a particular sequence and particular keyword/concept share a common referent. Sequences are posted, annotations are often automatically constructed based on homologous relations to other sequences found in the databases. A different variety of “sequence search engines,” specially developed to look for similarities among sequences rather than among documents, has become the basis for retrievals. These retrievals can – and often do – connect the work of one scientist to that of another without a single verbal expression passing between them.

Figure 6.26 sketches the basic relations. On the bottom are the most fundamental classes of molecular data, namely, gene and protein sequences. On the top is a set of scientific documents, such as those found in MEDLINE. The primary relation connecting the raw genetic data and textual corpora are annotation links that scientists have (manually) established between articles and sequences that are both significant and useful. They are significant because they help to establish the construction of the genome as a piece of the scientific enterprise, linking it to the traditions of academic publication. They are also useful to many scientists who, for example, are interested in a particular gene or protein and want to find out all that others might know about it. But annotation is not done consistently by all participating scientists, nor has a precise semantics for what exactly an annotation should mean been established. The Entrez interface to MEDLINE makes it convenient for a user with a particular sequence in mind to find its corresponding publication, and vice versa. Together with the MESH thesaurus of medical terms (cf. Section 6.3), these features make the National Library of Medicine’s resource one of the most advanced on the WWW.

In addition to expediting the searches of scientists and doctors, the identification of significant patterns in one modality (i.e., in text or in sequence data) can be used to suggest hypotheses in the other (similar to suggestions made by Swanson (cf. Section 6.5.3). Also shown in Figure 6.26 are Sim arcs relating “similar” data. In the case of genetic or protein sequence data, these similarity measures are typically based on a notion of “edit distance” generated by string-matching tools such as BLAST and FastA, but the investigation of new methods for this problem is one of the most active areas within machine learning (cf. [Glasgow et al., 1996]. The investigation of interdocument similarities has been an important problem within the field of information retrieval (IR) for many decades. Most document similarity measures are based on correlations between “keywords” contained by pairs of documents, but other methods (e.g., those based on a bibliometric analysis of shared entries in the documents’ bibliographies) have also received considerable attention.

6.9 Text-Based Intelligence

Knowledge representation has always been a central issue for AI, and as a subdiscipline within computer science its primary contribution is probably the beginnings of a computational theory of knowledge. Although it is still too early to speak of such a theory, some key aspects of good knowledge representation are becoming clear [Belew and Forrest, 1988].

The text captured in document corpora was not entered with the intention of being part of a knowledge base. These are documents written by someone as part of a natural communication process, and any search engine technology simply gives this document added life. Alternatively, we can say that the document was intended to become part of a “knowledge base,” but one that predates (at least the AI) use of that term: People publish their documents with the explicit hope that their ideas can become part of our collective wisdom and be used by others.

Note the ease with which an author–as–knowledge engineer can express his or her knowledge. Hypertext knowledge bases are accessible to every writer. In this view, hypertext solves the key AI problem of the knowledge acquisition bottleneck, providing a knowledge representation language with the ease, flexibility, and expressiveness of natural language – by actually using natural language! The cost paid is the weakness of the inferences that can be made from a textual foundation: Contrast the strong theorem-proving notions of inference of Section 6.5.1 with the many confounded associations that arise in Swanson’s analysis of latent knowledge in Section 6.5.3.

Grounding Symbols in Texts

According to Harnad’s grounding hypothesis, if computers are ever to understand natural language as fully as humans, they must have an equally vast corpus of experience from which to draw [Harnad, 1987]. We propose that the huge volumes of natural language text managed by hypertext systems provide exactly the corpus of “experience” needed for such understanding. Each word in every document in a hypertext system constitutes a separate experiential “data point” about what that word means. The exciting prospect of using search engines as a basis for natural language–understanding systems is that their understanding of words, and concepts built from these words, will reflect the richness of this huge base of textual “experience.” There are, of course, differences between the text-based “experience” and first-person, human experience, and these imply fundamental limits on language understanding derived from this source.

In this view, the computer’s experience of the world is secondhand, from documents written by people about the world and subsequently through users’ queries of the system. The “trick” is to learn what words mean by interacting with users who already know what the words mean, with the documents of the textual corpus forming the common referential base of experience.

The hypertext itself is in fact only the first source of information, viz., how authors use and juxtapose words. The second, ongoing source of experience is the subsequent interactions with users, a new population of people who use these same words and then react positively or negatively to the system’s interpretation of them. Both the original authors and the browsing users function as the text-based intelligent system’s “eyes” into the real world and how it looks to humans. That insight is something no video camera will ever give any robot.


TERMS INTRODUCED IN THIS CHAPTER

active verbs (214)
activity (224)
adjacency matrix (195)
annotation (245)
antonymy (215)
argument structures (242)
associative relation (232)
authoritative (196)
authority lists (238)
authorship (207)
bag-of-words (212)
bibliometrics (187)
bipolar
categories (237)
citation lag (190)
citation resolution (186)
classic papers (188)
cocitation
collaborative classification (219)
community (197)
connectionist (225)
containment (201)
conversational threads (185)
convex hull (239)
corpus-based linguistics (212)
deontic logics (242)
descriptive adjectives (216)
direct manipulation (239)
draw queries (239)
extensional (182)
fact extraction (183)
general (237)
general terms (237)
geographical information systems (238)
global positioning system (238)
homologous (244)
hub (196)
hyperfootnotes (201)
hypernymy (214)
immediacy effect (190)
impact (196)
impact analysis (188)
in-degree (188)
inheritance (214)
invisible colleges (189)
knowledge acquisition bottleneck (246)
knowledge representations (183)
legal briefs (195)
lexical semantics (213)
lexicon (213)
magnesium (234)
Markov process (196)
memes (196)
meronymy (214)
name-tagging (237)
norm of scholarship (189)
normal aging (190)
objective relevance (224)
phrases (213)
physical places (238)
polarity (194)
political places (238)
post-verbal (243)
postsynaptic (229)
prerequisite (205)
presynaptic (229)
Price’s Index (190)
reference-modifying adjectives (216)
related terms (227)
relational adjectives (216)
relevance logic (224)
research front (190)
review articles (188)
rhetorical argument (201)
rubrics (201)
semantic networks (224)
shephardizing (192)
spiral exposition (200)
spreading activation search (224)
stationary distribution (196)
stative verbs (214)
synonym set (214)
synonymy (214)
time line (239)
top-down knowledge structures (218)
topic (203)
topical tiling (204)
WWW directories (220)
Back to TOC