2 | ![]() | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extracting Lexical Features2.1 Building Useful ToolsThe promise offered by Chapter 1 is that many real-world problems can be viewed as instances of the FOA problem. The proof is to be found in concrete code a relatively small technology base that will prove useful in a wide array of applications. In this chapter we will present a suite of software tools that together build a search engine for a wide variety of situations. Source code is provided so that these tools can be easily modified for applications of your own. We will work through two different examples of IR systems, in order to demonstrate how slight variations of the same basic code can handle both. Compared to the broad generalities of Chapter 1, the technical details of this chapter will sound a very different tone. Describing a complex algorithm requires the specification of many, sometimes tedious, details. To make the software executable on machines that are likely to be available to you, the details are provided for several operating environments. But the processor speeds, internal memory, and hard disk sizes available on computers are changing dramatically each year, so many of the assumptions on which these routines are based will require constant reevaluation. We will develop the software tools in three phases. The first phase will convert an arbitrary pile of textual objects into a well-defined of documents, each containing a string of terms to be indexed. The second phase involves building efficient data structures to invert the Index relation so that, rather than seeing all the words contained in a particular document, we can find all documents containing particular keywords. All of these efforts are in anticipation of the third and final phase, which matches queries against indices to retrieve those that are most similar. These three major phases are central to building any search engine. This chapter will be most concerned with the first two phases, which together extract lexical features. Our goal will be the extraction of a set of features worthy of subsequent analysis. As in any cognitive science, the specification of an appropriate level of analysis whether it is the resolution and depth of an image, the subphonemes of continuous speech, the speech acts of language, or something else the specification of this atomic feature set is the first important step. This will involve a great deal of work, much of it unpleasant except to those who enjoy designing efficient algorithms and data structures (some of us actually do enjoy this!:). The promise is that we will, as a consequence of good software design, develop useful tools that allow us to spend the rest of our time exploring interesting features of language. 2.2 Interdocument ParsingThe first step is to break the corpus an arbitrary pile of text into individually retrievable documents. This demands that we be specific about the format of the corpus and that we decide how it is to be divided into individual documents. For all operating systems we will consider, this problem can be defined more precisely in terms of paths, directories, files, and positions within files. For any application in which the corpus can be described by the path to its root, these tools will translate directories, files, and documents-within-files into a homogeneous corpus. Of course, there are some situations (e.g., when documents are maintained within a database) that cannot be captured in these terms, but these primitives do allow a wide range of corpora to be specified. Our model will assume that many documents may be contained within a single file and that each document occupies a contiguous region within the file. Issues concerning structure within a single document are closely related to assumptions we may or may not be able to make about the length of the documents in question. Our assumptions about how long a typical document is will recur throughout this book. It is obvious, for example, that different document browsers are necessary if we need to browse through an entire book rather than look at a single paragraph. Less obvious is that the fundamental weighting algorithms used by our indexing techniques will depend very sensitively on the number of tokens contained in each document. In this textbook we will focus primarily on two particular test corpora, AI theses (AIT) and email; these are discussed in more detail in Section 2.4. Each of these has natural notions of the individual document: In the case of the AIT it is the thesiss abstract, and for email it is the entire message. In both cases, more refined notions of document (the individual paragraphs within the abstract or within the email message) are possible. With these assumptions, we can define our corpus simply with two files: one specifying full path information for each file, and a second specifying where within these files each message resides. A large portion of the task of navigating a directory full of files and visiting each of them can be accomplished using the dirent.What is dirent? This utility allows the recursive descent through all directories from a specified root, visiting every file contained therein. In many cases, the files we will be indexing have a great deal of syntactic structural information above and beyond the meaningful text itself. For example, our email will often contain a great deal of mail header information, as (loosely:) specified in RFC822.What is RFC822? Many text-formatting languages, for example, The basic data elements to be parsed from our two examples, email and AIT, are shown in Figure 2.1. 2.3 Intradocument ParsingHaving now focused our attention on a particular file and on the beginning and ending locations within that file associated with a particular document, we can consider this file segment simply a stream of characters. Reading each and every character of each and every document, deciding whether it is part of a meaningful token, and deciding whether these tokens are worth indexing will be the most computationally intensive aspect of the indexing chore; this is our inner loop. For that reason, we will devote some real attention to making this lexical analysis as efficient as possible. Several general criteria will shape our design. First, because we are assuming that our textual corpus is very large, we will do our best to avoid duplicating this primary text. That is, we will attempt to deal with the text in situ and not make a second copy for use by the indexing and retrieval system. Thus, we will be creating a system of pointers into locations within the corpora directories and files. A wide range of alternative designs are possible even at this early stage, and so we desire as much flexibility as possible in the specification of the lexical analyzer. A lexical analyzer generator, such as the lex tool in Unix, allows the specification of very complicated lexical analyzers for very elaborate languages. The fundamental representation used by all such algorithms is a finite state machine, like that shown in Figure 2.2. This simple representation breaks the set of possible characters coming from a text stream into classes (drawn as circular states), with transitions from one state to the next on the occurrence of particular characters. By careful construction of the sets of characters (e.g., white space characters corresponding to state 0 in Figure 2.2), arbitrary text sequences can be handled very efficiently. For our two example corpora and many other situations, the stream of characters, a straightforward analysis in terms of a simple finite state machine, will suffice. We will depend on a utility written by Christopher Fox [Fox, 1992]. This utility simultaneously achieves two critical goals. First, the lexical analyzer tokenizes the stream of characters into a sequence of wordlike elements. At first blush this seems straightforward: A token is anything separated by white space, where the standard definition of white space is used. But what about hyphens? Should the hyphenated phrase DATA-BASE be treated as two separate tokens or as a single one? Should a file name, like WINDOWS.EXE be treated as a single token? Which host, directory, and file elements in a full URL like www.cs.ucsd.edu/~rik are to be kept intact as individual tokens? More elaborate elements such as these can quickly demand the sophistication of a tool like lex. The presence of digits among the alphabetic characters presents another problem. Are numbers to be allowed as tokens? Perhaps we only want to allow special numbers (e.g., 1776, 1984, 2001, 3.14159). Perhaps we want to use rules similar to those for programming language identifiers and require that a token begin with an alphabetic character, which may then be followed by numbers or letters. We must also worry about the case of the characters at this earliest lexical analysis stage. Are we to treat capitalization as significant in distinguishing tokens from one another? An enormous reduction in vocabulary size is possible if we fold case so as to treat upper- and lowercase characters interchangeably. But of course then we have also precluded the possibility of many proper name analyses that may be useful for identifying singular people, places, or events (see Chapter 6). In some cases the semantics of the documents make decisions about case automatic. For example, if the documents are program source files, the language in question may or may not treat differences in case as significant. 2.3.1 Stemming and Other Morphological ProcessingFrom the perspective of linguistics, many of the early design issues we address are considered morphological transformations of language, i.e., an analysis of what we can infer about language based on structural features. As discussed briefly in Chapter 1, the arbitrary way in which white space may or may not separate tokens whose meanings are interdependant (e.g., recall the German word GESCHWINDIGKEITS-BESCHRANKUNG and English phrase SPEED LIMIT example) will make us interested in phrasal units of indexing as well. In many Asian texts, the relationship between characters and words is quite radically altered. The Kanji alphabet and Unicode standards help to define the problem but bring biases of their own [Fujii and Croft, 1993]. For now we will focus on one of the most common morphological tricks, stemming. Stemming is a direct attempt to remove certain surface markings from words to reveal root form. Beyond deciding which characters are to be combined into tokens, Chapter 1 discussed how important it can be to use a tokens root form as an index term: We can hope that our retrieval is robust even when the query contains the plural form CARS while the document contains the singular CAR. Linguists would say that the number feature (whether a noun is singular or plural) is morphologically marked. Linguists also distinguish between inflectional morphology like plurals, and derivational morphology, which can change a words syntactic category (e.g., changing the noun PRODUCT to the verb PRODUCTIZE) and meaning more radically. In stemming, suffixes are dropped. Even in the simple case of plural endings, it isnt as simple as removing ss. Consider: WOMAN/WOMEN Conversely, we cannot assume that every time there is an ending s we can remove it; stemming CRISIS and CHESS to CRISI and CHES would damage their meaning. The most common approach to this problem [Fox, 1992] is to identify more elaborate patterns over character sequences that reliably pare tokens down to their root forms. A broad range of such patterns can be defined in terms of a context-sensitive transformation grammar. For example: Rule 2.1 (.*)SSES Rule 2.1 says that strings ending in -SSES should be transformed by taking the stem (i.e., characters prior to these four) and adding only the two characters SS. Rule 2.2 says that stems containing a vowel and ending in -ED should be transformed to leave only the stem; Rule 2.3 says that stems containing a vowel and ending in -Y should be transformed to the stem with an I replacing the Y.* A complete algorithm for stemming involves the specification of many such rules and a regime for handling conflicts when multiple rules match the same token. An early and influential algorithm due to Lovins [Lovins, 1968] specified 260 suffix patterns and used an iterative longest match heuristic. This means that first preference is given to the pattern (left-hand side of the grammar rule) that matches the most characters in a target token (because this prefers more specific matches over shorter, more generally applicable ones); then rules are iteratively reapplied until no other rules match. The Porter stemmer [Porter, 1980] (included as part of the FOA software) is a simplified version of Lovins technique that uses a reduced set of about 60 rules and organizes them into sets, with conflicts within one subset of rules resolved before going on to the next. In fact, if only the first set of rules in Porters stemmer (focusing exclusively on plurals and the most straightforward suffixes like -ED and -ING) is used, the result has been called weak stemming [Walker, 1989]. A key advantage of all such rule-based grammatical representations of the stemming process (and of efficient implementations of them, such as that provided by Fox) is that modifications to the rules and to ordering among the rules can be accomplished by changing the grammar rather than by endless ad hoc hacking (ad hacking?:) in response to particular character sequences. The use of any stemmer obviously reduces the size of the keyword vocabulary and consequently results in a compression of the index files. Such compression can vary from 10 to 50 percent, depending on the total size of the keyword vocabulary and how aggressive (e.g., how many suffix rules are used) the stemmer is. The primary effect of stemming, however, is that two keywords that were once treated independently are considered interchangeable. Stemming is therefore an example of a recall-increasing operation because it will cause a keyword used in the query to match more documents. The fundamental problem with any stemming technique, of course, is that the morphological features being stripped away may obscure differences in the words meanings. For example, the token GRAVITY has two word senses, one describing an attractive force between any two masses and the other having to do with a serious mood. But once the word GRAVITATION has been stemmed, we have lost the information that might constrain us to the first interpretation. Krovetz [Krovetz, 1993] considers several more sophisticated approaches to keyword morphology, including augmenting Porters stemmer with a dictionary that is checked after each phase of stemming rules has been applied. 2.3.2 Noise WordsFrom the earliest days of IR (e.g., Luhns seminal work [Luhn, 1957]), two related facts have been obvious: First, a relatively small number of words account for a very significant fraction of all texts bulk. Words like IT, AND, and TO can be found in virtually every sentence. Second, these noise words make very poor index terms. Users are unlikely to ask for documents about TO, and it is hard to imagine a document about BE.But sometimes we care about noise words! Due then to both their frequency and their lack of indexing consequence, we will build the capability of ignoring noise words into our lexical analyzer. As will be discussed extensively in Chapter 3, noise words are often imagined to be the most frequently occurring words in a corpus. One problem with defining noise words in this way is that it requires a frequency analysis of the corpus prior to lexical analysis. It is possible to use frequency analyses from other corpora, assuming that the distribution of noise words is relatively constant across corpora, but such an extrapolation is not always warranted. Worse, the most frequently used words often include those that might make very good keywords. Fox notes that the words TIME, WAR, HOME, LIFE, WATER, and WORLD are among the 200 most frequently used words in general English literature [Fox, 1992, p. 113]. Instead, we will define noise words extensionally, in terms of a finite list or negative dictionary. The list we use, STOP.WRD, was derived by Fox from an analysis of the Brown corpus [Fox, 1990]. The relationship between these noise words and those words most critical to syntactic analysis of natural language sentences is striking. Note that the same tokens that are thrown away as noise because they have no meaning are precisely those function words that are most important to the syntactic analysis of well-formed sentences. This is the first, but not the last, suggestion of a fundamental complementarity between FOAs concern with semantics and computational linguistics concern with syntax. 2.3.3 SummaryWe have described the lexical analyzer in terms of the job it must do processing every document in the corpus, because this task confronts us first. But because our central task will be to match these documents against subsequent users queries, it is critical that the identical lexical analysis be performed on the queries. This creates several implementation constraints (e.g., that the same code libraries are available to the indexer and to the query processing interface), but these are minor. If the query language is designed to support any special operators (e.g., Boolean combinators, proximity operators), the querys lexical analyzer may accept a superset of the tokens accepted by the documents analyzer. In any case, it is imperative if queries and documents are to be matched correctly that the same lexical analysis be applied to both streams. Using an identical code library is the easiest way to ensure this. It may seem nonsensical to worry so much about processing each character efficiently, when we assume that some other previous process has already identified each interdocument break doesnt such processing require the same computational effort, and, if so, doesnt this make our current efficiency worries moot? Perhaps. A conclusive answer depends on many architecture and operating system specifics. There are two reasons we have made such assumptions. The first is that the practicalities of delivering the FOA corpora and code currently make this convenient. But the more serious reason is that the most theoretically and intellectually interesting questions involve analysis of operations downstream from the first stages of interdocument parsing: how to identify tokens, how to count them, etc. If these latter operations are made especially efficient, it means we can afford to do more experimentation, more playfully. For a text, that is the primary concern. 2.4 Example CorporaIn these experiments, and the rest that follow, we will consistently use two example corpora. The first of these we will call the Artificial Intelligence Thesis (AIT) corpus. This is approximately 5000 Ph.D. and Masters dissertations and abstracts. Virtually every dissertation published within the last 30 years has been microfilmed by University Microfilms, Inc. (UMI).More about AIT origins The corpus is a fairly exhaustive set of theses classified as AI by UMI from the years 1987 to 1997. A histogram of the theses distribution by year is shown in Figure 2.3. We will focus on a handful of characteristics of each thesis:
For now, we will lump these attributes into two categories: textual fields and structured attributes. Structured attributes are ones for which we can reason more formally, using database and artificial intelligence techniques. For now, we will concentrate on only the textual fields. The abstract will be the primary textual element associated with each thesis, while its title (also a textual field) will be used as its proxy a synopsis of the thesis that conveys much of its meaning in a highly abbreviated form. Proxies will prove very important surrogates for the documents (for example, when users are presented with hitlists of retrieved documents). The second corpus we will study could not be provided on the CD because you must provide it yourself; it is all of your email. Email is now a fundamental form of transient, nearly immediate communication for many, but the resulting permanent record of these conversations can also be treated as a static, long-lived type of literature [Belew and Rentzepis, 1990]. We will assume that with disk storage as cheaply available as it is today, you at some point began to collect email.How do I index my email? Typically, some of this will be email others have sent you, but you may have also kept a copy of all of your outgoing email. Many email clients support on-the-fly segregation of email into separate folders. Minimally, this means that our procedures for indexing email must be capable of traversing elaborate directory structures. Later, we will also consider the use of this user-generated structure as a source for learning; cf. Section 7.4. The directory in which you have filed an email message is one feature we may associate with each message; whether it is an incoming or outgoing message is another. But of course email also has many structured attributes associated with it, in its header. These include:
In general, we will put off all consideration of structured attributes associated with documents until later. For now, simply note the many parallels between our two example corpora: Both have well-defined authors, well-defined time-stamps, and excellent and obvious candidates for proxy text. 2.5 ImplementationThe range of potential implementations of the basic techniques discussed in this chapter and subsequent ones is quite remarkable. Each depends on features of the specific application, available hardware, and so on, such as:
Design decisions depend on features such as corpus size, available memory, and query response time. Two implementations have been developed to accompany this textbook, an earlier one in C and a more recent one in Java; see the FOA Web site for details. 2.5.1 Basic AlgorithmWe now assume that:
Then the basic flow of what we will call the postdoc function operates as follows (see Algorithm 2.1).
For every document in the corpus we will iterate through a loop until weve exhausted every token in that document. So lets call getNonNoiseToken a routine that repeatedly builds tokens from the documents stream, does whatever character assessments are required, checks it against a negative dictionary, and returns a token. If stemming is to be applied, well stem the word at this point. Then we will save a posting for that tokens occurrence in that document. A posting is simply a correspondence between a particular word and a particular document, representing the occurrence of that word in that document.* That is, we have a document in front of us and it contains a set of tokens. We are now going to build a representation for each token that tells all of the documents in which ones it occurs. For each keyword we will maintain the token itself as the key used for subsequent access and the head of a linked list of all postings, each containing the document number and the number of occurrences of the keyword in that document. A sketch of these data structures is shown in Figure 2.4. After going through every document in the corpus in this fashion, we have a large collection of postings. Here we recommend splay trees as an appropriate data structure for these keywords and their postings. In the C implementation shown in Algorithm 2.2, the InstallTerm() function inserts a new posting into the Terms tree.Implementation details
During the processing of each document, it will prove important to know how many keywords are extracted from it. This will be known as the documents length, denoted lengthd; this quantity is important when normalizing documents of different lengths. One way to implement this computation is to maintain a small separate file doclend.d containing only this one number for each document. When the set of documents has been exhausted, we need to write out this inverted representation to a file for subsequent processing. For every token in the splay tree (typically the traversal will be in lexicographic order), we will organize all its postings. First, we count the number of occurrences of the keyword across all the documents in the corpus; we will call this variable totfreqk. A second, less obvious statistic we will maintain is how many documents contain this keyword; this variable will be called docfreqk. If there is exactly one occurrence of a keyword in each document, then these two numbers will be the same. But typically there are multiple occurrences of the same keyword in a single document and totfreqk > docfreqk. Both variables will be important to us in determining appropriate weights for the Index relation (cf. Chapter 3). After going through all of the documents and accumulating for each these two statistics, we must sort the postings in decreasing frequency order. The reason for this wont be apparent until we discuss the matching algorithms (cf. Section 3.5), but it turns out to be important that documents that use a keyword most often are at the beginning of the list. Once the documents postings have been sorted into descending order of frequency, it is likely that several of the documents in this list will have the same frequency, and we can exploit this fact to compress their representation. Figure 2.5 shows the POSTING list broken into a list of FPOST sublists, one for unique frequency count. 2.5.2 Fine PointsChanging Indices for Dynamic CorporaOne reason to keep raw frequency counts in the inverted keyword file used by our experimental implementation is that this provides maximum flexibility as we consider various keyword weighting schemes. But there is another reason these raw statistics are useful in real applications. It is often important to be able to update a corpuss index as documents are added to or deleted from it. Retention of raw keyword frequency information allows these statistics to be updated as our corpus changes. Adding a new document simply requires that it be analyzed (as outlined earlier), simply incrementing existing counters for each keyword.Implementation details Similarly, deletion of documents from an index exploits the full text of the document itself to identify all keywords it contains. For each keyword then, posting counts are simply decremented.* Posting ResolutionTypically we need only keep track of which document contains the posting. But an important element of many query languages is proximity operators, which allow users to specify how close two keywords must be (adjacent words, within the same sentence, within the same paragraph, within a k-word window of one another, etc.). To support such queries, we may also be concerned with recording higher resolution posting information than which document it is in. For example, many systems retain the exact character position of the (beginning of the) keyword. Figure 2.6 shows the elaborate data structure used by the STAIRS IR system.More about STAIRS In addition to very high-resolution postings, this representation supports other query attributes (e.g., security).Proximity searching with low-resolution posting information Emphasizing Proxy TextThe fact that keyword tokens occur in both the proxy text and the main text of the document gives us the opportunity to treat them differently. For example, we can emphasize the importance of words used in the proxy over those occurring in the raw text. This would be sensible if we believed that those occurrences in, for example, the subject of a message or the title of a dissertation, are better characterizations of a document than words picked from the text of the abstract or the text of the email message. In our code, this emphasis will be controlled by an integer variable EmphProxy, which notes occurrences of keywords in the proxy by doubling (EmphProxy = 2) or tripling (EmphProxy = 3) the keyword counters for proxy text. Document NumberBecause we have made the first stage of our processing flexible with respect to how a corpus extends across multiple files in general, two numbers will uniquely identify each of your documents: its file number and the document number within that file. For that reason, and because each posting must retain a unique identifier for each document, it becomes important to construct a single number that folds them together. Maintaining a single integer, instead of two integers, therefore becomes a worthwhile space-saver. One simple way to accomplish this is to multiply the documents file number by some number larger than the maximum number of documents within any file, and then add its document number. Just how large a number this must be and whether your machine/compiler efficiently supports integers this large (or whether you are better off keeping the two numbers separate) will vary considerably. For this reason it makes good sense to isolate these issues in a separate routine. Dependencies on Document TypeThe process of indexing has been idealized as having a first stage, where we worry about what kind of document it is (e.g., whether its a thesis or an email message), and then assuming that subsequent processing is completely independent of document type. Like all software designs, this idealization breaks down in the face of real data. Consider email messages. One common element of these documents is quoted text from another email message. Often this is marked by a > prefix, as shown in Figure 2.7. The role of interdocument citations like this is considered in depth in Section 6.1, but for now, a reasonable design decision is that all text should be indexed only once. This is especially appropriate if we have both the original email message and the quoted version of it; we might want to elide (ignore) quoted lines. Other software designs are possible, but the easiest way to implement this is to check for quoted lines within the postdoc routine; if the first character of a line is a greater-than symbol, dont do any of the subsequent processing. Dont check it against noise words, dont stem, dont index, and dont install it in the term tree. Unfortunately, this creates precisely the kind of email-specific processing that should be avoided by well-engineered software. 2.5.3 Software LibrariesAs much as possible, we will depend on standard libraries for some of our basic utilities. In particular, it is recommended that you use:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||