Text Mining

Mark Sharp

msharp@scils.rutgers.edu

Rutgers University, School of Communication, Information and Library Studies

final term paper for 16:194:610

Seminar in Information Studies, Prof. Tefko Saracevic

11 December 2001

 

Abstract

The general idea of text mining – getting small "nuggets" of desired information out of "mountains" of textual data without having to read it all – is nearly as old as information retrieval (IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be viewed as one of a class of nontraditional IR strategies which attempt to treat entire text collections holistically, avoid the bias of human queries, objectify the IR process with principled algorithms, and "let the data speak for itself." These strategies share many techniques such as semantic parsing and statistical clustering, and the boundaries between them are fuzzy. Therefore in this paper several related concepts are briefly reviewed in addition to text mining proper, including data mining, machine learning, natural language processing, text summarization, template mining, theme finding, text categorization, clustering, filtering, text visualization, and text compression. Current text mining systems per se appear to be fairly primitive, but to have the following goals which may serve as a useful definition to distinguish text mining from other IR concepts: (1) to operate on large, natural language text collections; (2) to use principled algorithms more than heuristics and manual filtering; (3) to extract phenomenological units of information (e.g., patterns) rather than or in addition to documents; (4) to discover new knowledge. Interest in text mining for biomedical research purposes is especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining systems designed for use with science and technology text databases such as MEDLINE currently seem to have an undue emphasis on expert human filtering which contradicts goal (2). Whether this represents premature surrender to difficulty or a necessary temporary expedient remains to be seen.

 

Why Text Mining?

It has become a cliché to describe information space and the challenge of navigating it in dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with regard to scientific, technical, and scholarly literature. We moderns may like to think we are the first to face this problem, but scientists have always complained about keeping up with their literature (Saracevic, 2001). The promise of better science through better information technology has been a major theme in information science since Vannevar Bush (1945) proposed his famous Memex machine to deal with the "growing mountain of research."

Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and difficult to deal with" but also "the most common vehicle for formal exchange of information." Therefore, the "motivation for trying to extract information from it is compelling – even if success is only partial …. Whereas data mining belongs in the corporate world because that's where most databases are, text mining promises to move machine learning technology out of the companies and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as "web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a current review of web data extraction tools.

Text mining is one of a class of what I will call "nontraditional information retrieval (IR) strategies." The goal of these strategies is to reduce the effort required of users to obtain useful information from large computerized text data sources. Traditional IR often simultaneously retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas, 2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the entire database or collection more holistically, recognizing that the selectivity of anthropogenic queries has a downside or bias which can be counterproductive to obtaining the best information, and attempting to "objectify" the IR process with principled algorithms. I like to think that they try to "let the data speak for itself."

When I started to research this paper I made a list of all the IR concepts (traditional and non-) that were explicitly related to text mining by the first wave of authorities I identified. It was a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and thus define their boundaries and hierarchical relationships to text mining. However, it soon became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of truth. Therefore I decided to try to cover them all instead of focusing on text mining proper, whatever that turned out to be. Fortunately, time and literature resource limitations intervened to significantly curtail this plan. Hopefully the result will serve as a sensible compromise.

History of Text Mining

H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text mining and related methods when he said that "natural characterization and organization of information can come from analysis of frequencies and distributions of words in libraries" ("libraries" representing what we would now more generally call collections or corpora). Text mining per se may be new, but the dream of training a computer to extract information from "mountains" of textual data is nearly as old as IR itself.

Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted scientists' attitudes toward information usage with those of intelligence analysts.

'To the working scientist or engineer, time spent gathering information or writing reports is often regarded as a wasteful encroachment on time that would otherwise be spent producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence analyst, by contrast, is much more intimate with the available base of recorded information. New knowledge, or finished intelligence, is seen as emerging from large numbers of individually unimportant but carefully hoarded fragments that were not necessarily recognized as related to one another at the time they were acquired. Use of stored data is intensively interactive; "information retrieval" is an inadequate and even misleading metaphor. The analyst is continually interacting with units of stored data as though they were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not relevant documents, are sought.

Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes toward information indistinguishable from attitudes toward research itself."

Not content to lecture scientists from a theoretical pedestal, by the time these words were published Swanson had already put the idea into practice by developing a system to discover meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser, 1999). Software now called ARROWSMITH and freely available on the web (http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal useful information not apparent in the two sets considered separately" – e.g., one may reveal a natural relationship between A and B, and the other a relationship between B and C, so that together they suggest a relationship between A and C. The two literatures are "noninteractive" if their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has discovered at least three biomedically important relationships using this system: between fish oil and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).

Swanson's system remains far from fully automated, it is highly medical domain-specific, and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self-described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper, Swanson is the father of modern text mining.

What is Text Mining?

Text mining per se is new and is still defining itself. It "has the peculiar distinction of having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor "implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the computer rediscovers information that was encoded in the text by its author" (IBM, 1998b).

Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes it from "information access" (traditional IR). Traditional IR is concerned primarily with the retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need, but getting the desired information out of the documents is left entirely up to the user. According to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with the information, it tries to discover or derive new information from the data (text) which was previously unknown even to the author(s) of the data (text[s]). She says "data mining is opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering, finding terms for query expansion, and co-citation analysis are not text mining, although they can aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique supporting text mining, rather than its broader term.

Text mining always involves (a) getting some texts relevant to the domain of interest (traditional IR); (b) representing the content of the text in some medium useful for processing (natural language processing, statistical modeling, etc.); and (c) doing something with the representation (finding associations, dominant themes, etc.) (Perrin, 2001).

IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et al, 1999). It is a set of tools which "can be seen as information extractors which enrich documents with information about their contents" in the form of structured metadata. "Features" are classes of data which can be extracted, such as the language of the text, proper names, dates, currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature extraction component is "fully automatic – the vocabulary is not predefined." It may operate on single documents or on collections of documents. Word counts are based on normalization to canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery). The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of-speech information for English words [and] simple pattern matching to find expressions having the noun phrase structures characteristic of technical terms. This process is much faster than alternative approaches." There is also a clustering tool, a classification tool, and a search engine/web crawler. The clustering similarity measure is based on "lexical affinities" – correlated groups of words which appear frequently within a short distance of each other and which can be used to label the clusters.

Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into ARROWSMITH, which generates a list of all significant words and phrases common to the two result sets, and uses this information to "juxtapose pairs of text passages for the user to consider as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added lexical frequency statistics (tf*idf) to rank the common words and phrases by probable discriminatory value, but their system, like Swanson's, still requires "human filters" at several points.

Kostoff and co-workers have published several papers on the Web describing various text mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text data mining] architecture that unifies information retrieval from text collections, information extraction from individual texts, knowledge discovery in databases, knowledge management in organizations, and visualization of data and information." What they mean by "unifies" is unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes subsystems for data collection (source selection and text retrieval), data warehousing (information extraction and data storage), and data exploitation (data mining and presentation). It thus appears to be a system for extracting and analyzing metadata. The authors discuss linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long-range goals. Current work focuses on the more pedestrian challenges of relevance feedback ("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time and labor intensive" by the authors' own admission, "requires the close involvement of technical domain experts(s)" at every level of processing, and aims for a "main output [consisting of] technical experts who have had their horizon and perspectives broadened substantially through participation in the data mining process. The data mining tools, techniques and tangible products are of secondary importance…"

Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database tomography," a system for phrase extraction and proximity analysis. The authors capture the spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret large amounts of technological information semi-autonomously can expand greatly the capabilities of human beings…" The idea of "tomography" also evokes text visualization, an important nontraditional IR strategy related to text mining (see below). The authors cite unpublished studies showing that in "real-world text mining applications" there is a "strong de-coupling of the text mining research performer from the text mining user. The performer tended to focus on exotic automated techniques, to the relative exclusion of the components of judgment necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if it meant "reading copious numbers of articles." Database tomography aims to couple text mining research and technology more closely with the user through "heavy involvement of topical domain experts (either users or their proxies)" in the development of "strategic database maps" on the "front end." "The authors believe that this is the proper use of automated techniques for text mining: to augment and amplify the capabilities of the expert by providing insights to the database structure and contents, not to replace the experts by a combination of machines and non-experts."

Kostoff and DeMarco (2001) define science and technology text mining as "the extraction of information from technical literature." It has three components: information retrieval (gathering relevant documents), information processing, and information integration. "Information processing is the extraction of patterns from the retrieved records" by bibliometrics, computational linguistics, and clustering. "Information integration is the synergistic combination of the information processing computer output with the [human] reading of the retrieved relevant records. The information processing output serves as a framework for the analysis, and the insights from reading the records enhance the skeleton structure to provide a logical integrated product." Again, "substantial manual labor" is noted, and technical details are not given, leaving doubt as to what kind of and how much "computational linguistics" and "clustering" were actually implemented. This work was also published under the title "Citation mining: Integrating text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia, and Ramirez (2001).

In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical jargon and speculation to actual accomplishment. He seems to be re-inventing several well established techniques such as relevance feedback, co-citation analysis, and phrase extraction, giving them flashy new names, and failing to cite prior work by others. It is often unclear where the boundary is between the computer and human filtering, particularly in Kostoff's phrase extraction process. Given the authors' constant emphasis on the importance of human judgment it seems likely that they have not automated the phrase selection process at all, and therefore have not added anything to classical word proximity analysis for phrase identification. Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is, in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing and objectifying the IR process, and it is hard to see how it contributes anything progressive to text mining research. This is not to disagree with Kostoff about the importance of domain expertise and user credibility and acceptance, only to caution against using such concerns as a figleaf for excessively primitive IR technology.

Based on the foregoing, I propose the following criteria for a true text mining system. The keywords are highlighted.

It is to be expected that different systems will meet these criteria to different extents. Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at some related concepts.

Data Mining

It seems fairly noncontroversial that text mining is a subdiscipline of the broader and slightly older field of data mining, the subdiscipline which deals with textual data. An intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al, 2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise worthless rock" is actually more appropriate for text mining than for data mining, which tends to deal with trends and patterns across whole databases (Hearst, 1999).

Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin, 2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. "Information archaeology" is a synonym for both data mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001).

Data mining usually deals with structured data, but text is usually fairly unstructured. The crux of the text mining problem, then, can be viewed as imposing structure on text to make it amenable to the analytic techniques of data mining. This is often conceptualized as extracting metadata from text (Losiewicz et al, 2000).

Machine Learning

Data mining is based on a variety of computational techniques, some of which fall under the rubric of machine learning. Examples are decision trees, neural networks, and association rules (clustering). In this context, machine learning involves "the acquisition of structural descriptions from examples [which] can be used for prediction, explanation, and understanding." When the description can be used to classify the examples, all three are enabled, unlike purely statistical modeling which only supports prediction. By some views, however, machine learning is little more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis on searching "through a space of possible concept descriptions for one that fits the data" (Witten & Frank, 2000).

From a broader artificial intelligence (AI) perspective, machine learning is one of the four capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear logical, rational, and intelligent to an intelligent human interrogator. In this context machine learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns" (Russell & Norvig, 1995).

From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine learning is "the study of computer algorithms capable of learning to improve their performance of a task on the basis of their own previous experience" primarily through pattern recognition and statistical inference. They see a legitimate future role for it in "every element of scientific method, from hypothesis generation to model construction to decisive experimentation." Text mining could help with the "high data volumes" involved in literature searching. However, most work to date has focused on experimental data reduction such as visualization of high-dimensional vector data resulting from gene expression microarray studies (see footnote 6, p. 25).

Natural Language Processing

Natural language processing (NLP) or understanding (NLU) is the branch of linguistics which deals with computational models of language. A brief history is given by Bates (1995). Its motivations are both scientific (to better understand language) and practical (to build intelligent computer systems). NLP has several levels of analysis: phonological (speech), morphological (word structure), syntactic (grammar), semantic (meaning of multiword structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of multi-sentence structures), and world (how general knowledge affects language usage) (Allen, 1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle with meaning. NLP can differentiate how words are used such as by sentence parsing and part-of-speech tagging, and thereby might add discriminatory power to statistical text analysis. Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is widespread but the jury remains out.

Rau (1988) described an early NLP system named SCISOR which was developed by General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was programmed to deal only with information on corporate mergers. Input (news stories, etc.) was described as being converted to "conceptual format" permitting natural language interrogation (i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down (expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing. Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences. Computerized parsing of free text "is an extremely difficult and challenging problem," according to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base containing grammatical and lexical information. The double parsing strategy of SCISOR allowed flexibility to perform in-depth analysis when complete grammatical and lexical knowledge is available, and superficial analysis when unknown words and syntax are encountered, giving the system robustness. The top-down parser could also be used for text skimming (looking for particular pieces of information).

However, semantic analysis "is very expensive and furthermore depends on a lot of domain-dependent knowledge that has to be constructed manually or obtained from other sources" (IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton, 1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came of age and it was realized that the limitations of the linguistic techniques did not prevent them from being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus that NLP is still not competitive with statistical approaches to traditional IR, but that it may be practical and even critical for applications such as phrase extraction and text summarization. Even Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text summaries" (Salton, Allan, Buckley, & Singhal, 1994).

Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her definition of the goal of text mining, in fact, is "capturing semantic information" as tabular metadata amenable to statistical data mining techniques. In her work, NLP includes stemming (morphological level), part-of-speech tagging (syntactic level), phrase and proper name extraction (semantic level), and disambiguation (discourse level). Goals include automating text mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text classification (see below).

A "reverse flow" of purely statistical methods to NLP has been going on since about 1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown to significantly improve the accuracy of proper name classification, part-of-speech tagging, word sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and disambiguation improve probabilistic document retrieval ranking discrimination by some parts of speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which in turn reflect natural languages' relation to "naturally occurring dependencies in the physical world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like stemming and query expansion in improving the performance of an advanced IR system under rigorous test conditions (Perez-Carballo & Strzalkowski, 2000).

Computational linguistics is used as a synonym for NLP by some writers and as a narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with finding statistical patterns in large text collections to inform algorithms for NLP techniques such as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e., computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being the bridge between NLP and statistics. They both envision text mining as a component of a full-featured information access system which also includes source detection, content retrieval, and analytical aids such as text visualization (see below).

A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives (this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal, 1993). Therefore a good job for NLP would be to detect anaphors and search backwards to resolve their referent. In the language of logic, this might be called identifying the point in the text where each significant new proposition begins. In 1993, that was beyond available text processing capabilities, so the authors had to exclude anaphoric sentences from further analysis regardless of their information content.

In summary, all this activity and interest raise hopes, but NLP still "has not delivered the goods" (Saracevic, 2001) and so the jury remains out.

Text Summarization

An obvious example of text mining would be to find previously unknown natural correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that, of course, one must identify the themes. A theme being a form of summary, automated theme-finding is a form of automatic text summarization (or automatic abstracting), a proud old IR tradition.

Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation from Luhn (1958), who proposed extracting sentences based on their computed word content weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's idea of cues to "indicator constructs" such as In this paper we show that…

Johnson et al (1993) built a NLP-based auto-abstracting system which selected non-anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary-based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.). Each word was then indexed by its sentence number, position within the sentence, part of speech, verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned up" by a set of corrective heuristics and a grammar-based tag disambiguator. A global parser then identified noun phrases based on definitive cues such as being separated by a preposition (e.g., the primary factor in public health), and then parsed the sentence. The resulting sample abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most text summarization needs, the next step might be to take a page from statistical IR and develop ways of ranking the selected sentences.

Template mining

SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B dollars per share in a takeover bid for company C on date D. The values were extracted from raw text by parsing and stored in relational data tables. Then summaries of the parsed data values could be written by a natural language generator. This seems to be a form of template mining, where the script or metadata table field structure constitutes the template.

Chowdhury (1999) describes template mining as a form of information extraction using NLP "to extract data directly from the text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template." Chowdury traces its history from the mid-1960s Linguistic String Project at New York University, where "fact retrieval" was conducted against template data mined from natural language text, up to its current (1999) use in the AltaVista and

Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the Message Understanding Conferences (MUCs).

To facilitate template mining, Chowdhury recommends "standardization in the presentation and layout of information within digital documents" through the use of templates for document creation. But this is contrary to the spirit of text mining, which is to liberate both the creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty – hopefully premature!

Theme Finding

Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be applied to theme generation and text summarization. The authors derived the notion of passage retrieval from the problem of ranking vector matches when the vectors are of different lengths, e.g. very short queries against long documents, or clustering documents of different sizes. One solution is to decompose the documents into subunits of roughly equal size, called "passages." A common passage unit is a paragraph.

The passages may be converted to normalized vectors and compared. Those with similarities above a certain threshold (which may be chosen to deliver a desired degree of abstraction) are considered connected. If the documents are plotted as arcs on the circumference of a circle and their component passages connected by straight lines in accordance with their vector similarities, the resulting starburst pattern can convey themes within and between documents. These themes can be focused by expressing each triangle of passage similarities

as a centroid and doing similarity calculations on the centroids.

One may want to compute an estimate of the "most important" passages for the purpose of selective text traversal ("skimming") or text summarization. Such passages might be identified as (a) having a large number of above-threshold similarity connections, (b) strategic position (e.g., the first paragraph in each section), or (c) high similarity to some reference node. The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be combined; e.g., start with some desired passage (as in "more like this"), go to the most similar sectional heading passage, then go to its strongest link, the select the other densely connected nodes in that cluster in chronological order. For text summarization, repetition can be edited out on the basis of similarities between sentences or other subunits which are "too high."

Text Categorization

Text categorization should not be considered a form of text mining because it is a "boiling down" of document content to "pre-defined labels" which "does not lead to discovery of new information" since "presumably the person who wrote the document knew what it was about," according to Hearst (1999). Presumably she would also rule out text summarization and auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of categorization is to find "unexpected patterns" or "new events" because these "tell us something about the world, outside of the text collection itself" and therefore qualify as new information.

I would argue, however, that it is not so easy to predict where "new information" will come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a form of separating "precious nuggets" from "worthless rock" according to the human idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all text mining, but just to highlight the fuzziness of the boundaries between them.

Clustering

Clustering can be used to classify texts or passages in natural categories that arise from statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of traditional manual indexing systems. In the context of text mining, it is the derivation of the categories which is of interest, since this is a form of theme finding and therefore text summarization. Once the texts are clustered on the basis of common themes, it may also be useful to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank (2000, Section 6.6).

Filtering

E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank, 2000). The relevance of related techniques such as name recognition, theme finding, and text categorization are obvious, and it is even possible to imagine software which modifies its own filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable to find reports of any actual work on such a system.

Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term…regular information interests" of IF compared to the "periodic… information need or ASK" of IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at another striking comparison: the IF network looks exactly like an upside-down IR network! That is, in IR multiple documents are percolating down to a single user, while in IF each single incoming document is percolating down to multiple users. However, the authors reject this analogy for reasons not entirely clear to me.

Text Visualization

Text visualization shares text mining's goals of using computational transformations to reduce the cognitive effort of dealing with large text corpora, highlight patterns across documents, and help discover new knowledge. Text mining implies homing in on "precious nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice both may be regarded as elements of a holistic approach to multi-text corpora. The text mining systems of Hearst, Kostoff, and Liddy all have explicit text visualization components.

Wise (1999) developed a text visualization paradigm for intelligence analysis named Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of ‘visualizing text’ in order to reduce information processing load and to improve productivity" by representing large numbers of documents to permit "rapid retrieval, categorization, abstraction, and comparison, without the requirement to read them all." The theory behind SPIRE was that

humans’ most highly evolved perceptual abilities are those involved in interpreting "visual features of the natural world." Therefore the goal was to represent text as natural, ecological images from our early hominid past which require no "prolonged training to appreciate and use" such as star fields or landscapes (Figure 1). This transformation was accomplished using standard vector space algorithms and involves clustering and text summarization. SPIRE is an excellent example of how a cognitive theory can be helpful in inspiring IR innovation and guiding system development, despite its apparent lack of commercial success.

Text Compression

As mentioned at the beginning, I started this paper by trying to narrow the definition and scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One by one, however, the other strategies refused to be cleanly differentiated, and the foregoing polyglot review is the result. The only concept I thought I had succeeded in banishing from the scope of text mining was data compression, which showed up in the title of a single citation in a literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was something I could confidently rule out.

But on page 334, Witten and Frank (2000), in discussing statistical character-based models for token classification (names, dates, money amounts, etc.), note that "there is a close connection with prediction and compression: the number of bits required to compress an item with respect to a model can be interpreted as the negative logarithm of the probability with which that item is produced by the model." That is, text compression algorithms might function as token classifiers in reverse! So I give up. Text mining appears to be related to just about everything on my original list.

Biomedical Applications

My interest in text mining is motivated primarily by the belief that it can be fruitfully applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge. I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of gene sequence analysis is based, after all, on nothing more than algorithms for finding and comparing patterns in the four-letter language of DNA. Swanson's work has focused on MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of the function of newly sequenced genes" by determining which novel genes are "co-expressed with already understood genes which are known to be involved in disease."

Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as "extracting information about predefined classes of entities and relationships from natural language texts and placing this information into a structured representation called a template" [is it therefore template mining?], to build a database of information about enzymes, metabolic pathways, and protein structure from full text biomedical research articles. The LaSIE (Large Scale Information Extraction) system includes modules for datatype recognition (names, dates, etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template filling. It does linguistic analysis at all levels up to discourse using lexical knowledge, morphology, and grammars to identify significant words. The enzyme and metabolic pathway variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles (substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields include concentration and temperature. The PASTA variant deals with protein structure information such as which amino acid residues occupy given positions, active and binding sites, secondary structure, subunits, interactions with other molecules, source organism, and SCOP category. The prototype has been tested on only six journal papers, so it is far from satisfying the large text corpus requirement for true text mining, but the authors make no such claim.

The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf, Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort out the thousands of gene expression correlations resulting from microarray experiments to separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an Israeli gene information database) on the expressed genes. [Gene names generally make good search words because they are different from normal English words, e.g. "JAK3".] The second module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity or term frequency scores (NLP criteria being regarded as too computationally expensive). The third module is a "carefully designed user interface" to facilitate access to the most likely-to-be-interesting documents.

Despite the name, then, MedMiner is not a true text mining system, but rather a search and display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval, and no integration with GeneCards, although it is integrated with other gene and protein databases). Like Kostoff's system, it is designed to deal with highly technical information by assisting expert users in their traditional IR tasks rather than attempting to automate them completely. MedMiner is freely available online at http://discover.nci.nih.gov.

Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER attempts to identify noun phrases representing molecular entities such as drugs, receptors, enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain, sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing, the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms. While terminology extraction might be considered a fairly trivial form of text mining, it is obviously a logical step toward the mining of binding relationships (A binds B) which would have enormous potential for knowledge discovery.

Stapley and Benoit (2000) developed a system named "BioBiblioMetrics" (Stapley, 2000) which uses text visualization to suggest functional clusters of genes from the yeast Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co-occurring genes are displayed in a graph with "nodes" representing genes and edge lengths between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to sequence databases, and edges to those MEDLINE documents that generated them, creating a biomedical information "landscape" and inference network. BioBiblioMetrics is freely available online at http://www.bmm.icnet.uk/~stapleyb/biobib/.

Other MEDLINE text mining papers which I did not have a chance to review in full involve dictionary-controlled natural language processing for extraction of drug-gene relationships (Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur & Yang, 1996); statistical text classification and a relational machine-learning method (Craven & Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a) provides online full-text access to many biomedical text mining papers, including those from the hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing.

Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and enunciated a radical vision which includes several of the themes of this paper, such as the analogy between text and genome analysis, and the long history of information extraction in its many guises. He see the challenge as "understanding the nature of biological text, whatever that turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules and grammars of Chomskian linguistics are more hindrance than help.

Frankly, a fresh new approach is needed, fueled by the conviction that language is a biological phenomenon, not a logical phenomenon. By this we mean that the nature of language is as messy as the genome. The data and observed phenomena in all their richness and variety are dominant and cannot subsumed by any elegant theories. This means that in many ways, biologists have far better hopes of cracking the NLP problem than the computational linguists, who are focused on mathematics and logic. Even when they look at data, it is primarily as grist for their math mills.

Futrelle recommends, for example, building visualization tools such as a protein noun phrase highlighter which could be used to "assemble a large collection of the standard textual expression forms [and] map these onto the query forms for which they are the answers."

But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a coherent theory based on the biological nature of language.

By this I mean that language is a communicative capability of living organisms that has evolved from deep biological roots and from social interactions over millions, and ultimately, billions of years. I claim that language is not logical and mathematical, because that's not the nature of the organism (us) that exhibits the language capability.

An example of this is found in our vocabularies. A technically skilled adult will have a vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or "ship" does not follow from the characters that make them up. We simply commit them to memory. Linguists would like us to believe that our natural ability to "parse" is radically different and can be explained as a rule-based system.

My radical view is that we understand language not by generalization to abstract rules as much as by retaining examples and generalizing from them as needed. This is quite within our capacity, given our 100,000 word vocabularies. We also do reason. I would claim, again in the biological view, that this is done more by "imagined life" than by logic. Humans have superb abilities to remember events and to build detailed mental plans for future activities …. So we need to build this type of reasoning into our systems.

The analogy to genomics is clear. The coding of a particular protein by a particular sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail (such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking for patterns within the data. Purely logical approaches must wait for a richer knowledge base. Only now, after the massive effort of half a century of molecular genetic research, sequencing whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome, can we begin to think about prediction of protein structure and function from sequence data alone. Biological linguistics now stands at the beginning of a comparably arduous journey.

These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on human expertise and manual filtering in a better light. Perhaps they do not represent premature surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text mining and the cognitive tradition in IR would make a worthy topic for another paper.

 

References

Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA: Benjamin/Cummings.

Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proceedings of the international conference on intelligent systems for molecular biology 5:25-32.

Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7):600-607.

Bates, M. (1995). Models of natural language understanding. Proceedings of the National Academy of Sciences, 92, 9977-9982.

Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35, 29-38.

Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract-ion of biological information from scientific text: protein-protein interactions. Proceedings of the international conference on intelligent systems for molecular biology, pp.60-67.

Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108.

Cartia, Inc. (2000). ThemeScape product suite. Formerly online: http://www.cartia.com/products/index.html [no longer accessible].

Chowdhury, G. G. (1999). Template mining for information extraction from digital documents. Library Trends, 48, 182-208.

Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, pp.77-86.

Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of textual data. KDD-99, Association of Computing Machinery.

Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the Association for Computing Machinery, 8, 223-239.

Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html

Futrelle, R. P. (2001a). Natural language processing of biology texts. Online: http://www.ccs.neu.edu/home/futrelle/bionlp/

Futrelle, R. P. (2001b). The past, present and future of biology text understanding. Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli Gardens, Copenhagen, Denmark, July 26. Online: http://www.ccs.neu.edu/home/futrelle/brie2001/index.html

Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293, 2049-2051.

Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html

Hearst, M. (1997). Distinguishing between web data mining and information access. Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA. Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm

Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

Hearst, M. (2001). About TextTiling. Online: http://www.sims.berkeley.edu/~hearst/tiling-about.html

Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of information extraction for scientific journal articles. Journal of Information Science, 26, 75-85.

IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview. Online: http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/im4t23over8.htm

IBM (1998b). Text mining technology: Turning information into knowledge: A white paper from IBM. Daniel Tkach (Ed.). Online: http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf

Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive approaches for information retrieval. Libri, 45, 160-177.

Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management, 1, 215-241.

Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class, Rutgers University, School of Communication, Information, and Library Studies, New Brunswick, NJ.

Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online: http://www.dtic.mil/dtic/kostoff/Swanson2.txt

Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific literature with text mining. Analytical Chemistry (in press). Online: http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt

Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001). Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of the American Society for Information Science, 52, 1148-1156.

Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining using database tomography and bibliometrics: A review. Online: http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm

KRDL (2001). Text mining: transforming raw text into actionable knowledge (white paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/

Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief survey of web data extraction tools. In press.

Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html

Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University, New Brunswick, NJ.

Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics. Journal of the American Society for Information Science, 50, 574-587.

Losee, R. M. (2001a). Natural language processing in support of decision-making: phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.

Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of the American Society for Information Science, 52, 1019-1025.

Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support science and technology management. Online: http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.

Marcus, M. (1995). New trends in natural language processing: Statistical natural language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059.

Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and future prospects. Science, 293, 2051-2055.

Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval: Progress report. Information Processing and Management, 37, 155-178.

Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck & Co., Inc., Rahway, NJ.

Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis. Bulletin of the American Society for Information Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/qin.html

Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique Documentaire, 1997, General Electric, USA.

Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding terminology from biomedical text. Proceedings of the American Medical Informatics Association Symposium, 1999, 127-131. Online: http://www.amia.org/pubs/symposia/D005564.PDF

Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on Biocomputing, 2000, 517-528.

Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall.

Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and Management, 28, 441-449.

Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264, 1421-1426.

Saracevic, T. (2001). Personal communication and class discussions, Seminar in Information Studies, Rutgers University, School of Communication, Information and Library Studies, New Brunswick, NJ.

SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International Conference on Data Mining, Arlingon, VA, April 13, 2002. Online: http://www.cs.utk.edu/tmw02/

Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings: identification of findings in medical literature using restricted natural language processing. Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996, 239-243.

Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available: http://www.bmm.icnet.uk/~stapleyb/biobib/

Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on Biocomputing, 2000, 529-540.

Swanson, D. R. (1988). Historical note: Information retrieval and the future of an illusion. Journal of the American Society for Information Science, 39, 92-98.

Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91, 183-203.

Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51.

Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 52, 797-812.

Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999). MedMiner: An Internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques, 27, 1210-1217.

Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing, 2000, 541-552.

Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine, 26(3):209-222.

Wise, J. A. (1999). The ecological approach to text visualization. Journal of the American Society for Information Science, 50(13):1224-1233.

Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).

 

 

 

Table 1.

Initial List of Information Retrieval (IR) Concepts Related to Text Mining.

IR concept

Authority (see References)

Artificial intelligence

Fan; Perrin

Bioinformatics

Futrelle; Perrin

Citation mining

Kostoff

Computational Linguistics

Fan; Hearst

Conceptual Graphs

KRDL

Data Abstraction

Fan

Data Mining

Fan; Perrin; SDM

Database Tomography

Kostoff

Document Mining

Fan

Domain Knowledge

KRDL

Electronic Commerce

Fan

Factor Analysis

SDM

Information Access

Hearst

Information Extraction

Chowdhury; Fan; Futrelle; Kostoff; Perrin

Information filtering

Fan

Information Integration

Fan

Information Retrieval

Fan; Perrin

Information Visualization/Mapping

Futrelle; Fan; SDM

Intelligent Agents ("bots")

Fan

Knowledge Discovery

Fan

Knowledge Extraction

Perrin

Knowledge Representation

Perrin

Language Identification

IBM

Machine Learning

Fan; Futrelle; Perrin

Metadata Generation

SDM

Natural language processing

Fan; Futrelle; Perrin; Rindflesch; Saracevic

Ontologies/Vocabularies/Lexicons

Futrelle

Phrase Extraction

Fan

Question Answering

Futrelle

Resource Discovery

Fan

Resource Indexing

Fan

Semantic Modeling

Perrin; SDM

Semantic Processing

Rindflesch

Statistical Language Modeling

Fan

Stemming

SDM

Syntactic Processing

Saracevic

Template Mining

Chowdhury; KRDL

Text Analysis

Futrelle; IBM

Text Classification/Categorization

Fan; Hearst (distinct); IBM; SDM

Text Clustering

Fan; IBM

Text Data Mining

Hearst; Kostoff

Text Parsing

SDM

Text Purification

SDM

Text Segmentation/"TextTiling"

Hearst; SDM

Text Summarization

Futrelle; IBM; Saracevic; SDM

Text Understanding

Futrelle; Fan

Web Data Mining

Hearst

Web Mining

Fan

Web Utilization Mining

Fan

 

 

Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents (Cartia, 2000, expired website).

 

 

 

Figure 2. BioBiblioMetrics retrieval from a search on "DNA repair" and "recombination" (Stapley, 2000).