Information Retrieval Design, a book by James D. Anderson and Jose Perez-Carballo
How to obtain a full version of this book Find out about free shipping offer


[Brief table of Contents] [Previous: Chapter 22] [Next: Bibliography] [book index]

Glossary

Terms with their own definitions are printed in boldface in other definitions.

abstract entity. Abstract entities are abstractions — constructs abstracted from (drawn from, based on) experience or thought — that canÕt be seen or touched but whose existence is made known indirectly through various recognized symptoms or indicators. Examples include the American Medical Association, Rutgers University, communism, Islam, and the theory of relativity. In the case of organizations and corporations, many are incorporated (embodied, from Latin ÒinÓ plus ÒcorpusÓ = ÒbodyÓ) through law. See also concrete entity; entity; and chapter 2.

ad hoc string syntax. Ad hoc (Òto thisÓ) string syntax provides a means for coding natural language statements so that a computer program can convert them into meaningful index headings. The statements can be already existing, such as document titles, or they can be created as part of the indexing process. The coding tells the computer how to break up the statement into individual terms, how to arrange the terms within headings, and which terms to place in lead position as access points in a displayed index. NEPHIS (Nested Phrase Indexing System) is the ad hoc string syntax used for the index of this book. See also section 12.2.2.3.

ad hoc syntax. Ad hoc (Òto thisÓ) syntax refers to syntax that is developed Òon the spotÓ for a one-time indexing project, as opposed to an ongoing indexing operation, such as a regularly updated IR database or indexing and abstracting service. The most common example of ad hoc syntax is that used in most back-of-the-book indexes. Indexers who create the detailed indexes to individual books rarely use a pre-established system of syntax, but rather put terms together as they see fit for the particular situation. Because such a book index is a one-time operation, there is no need to record practice for the sake of long-term consistency, although good indexers attempt to maintain consistency throughout a single index. See also section 12.2.7.

analysis base. See indexable matter.

automatic indexing. Automatic indexing refers to indexing by machine, or the analysis of text by means of computer algorithms. The focus is on automatic methods used behind the scenes with little or no input from individual searchers, with the exception of relevance feedback. Thus automatic indexing does not include searching options and techniques used by human searches, such as methods for creating effective search statements, adding weights to terms, specifying proximity requirements, using truncation, wild cards, or combining terms with boolean or role operators. See also section 8.3.

best match syntax. Best match syntax refers to a growing variety of electronic term-matching methods that apply techniques for predicting potential relevance of documentary units in response to a search statement and then ranking documentary units according to predicted ÒrelevanceÓ scores. Because the most common approaches are based on some method of assigning weights to terms (search terms, index terms or both), this type of syntax can be called Òweighted term syntax.Ó Particular models of weighted term syntax include the vector space model, the probabilistic model, and the language model. See also sections 8.3.5 and 12.3.2.

bibliographic coupling. Bibliographic coupling is a special form of clustering based on reference citations. The underlying idea is that two or more documents are related (Òbibliographically coupledÓ) if they share the same reference citations. The more reference citations they share (the higher the threshold), the more closely related they may be. See also co-citation; and section 8.3.12.1.

bibliography. In this book, we have subsumed the term ÒbibliographyÓ under the broader, newer term ÒIR database,Ó but ÒbibliographyÓ and ÒbibliographiesÓ are fine old terms that mean writing (graphy) about books (biblio), thus they have come to mean lists and descriptions of books. There is no reason to limit their meaning to Òbooks,Ó because the ÒbiblioÓ part of the word comes from the Greek for papyrus leaves! So by extension, bibliographies can deal with messages, texts, and documents in any format and medium, just as IR databases do.

boolean syntax. See exact match syntax.

bound term. A bound term is a compound term consisting of two or more words, sometimes representing two or more concepts, which almost always occur together and have come to be considered a single concept. ÒInformation science,Ó for example, could be factored into ÒscienceÓ and Òinformation,Ó but the bound term Òinformation scienceÓ is the name of a discipline and to decompose it would be misleading. Proper names of multiple parts are also bound terms, e.g., ÒUnited States.Ó ÒBirth controlÓ is a bound term whose meaning is different than its component parts, because the bound term means control or prevention of conception, as opposed to the control of birth. See also complex term; term.

catalog, cataloging. A catalog is an index for a particular collection of messages (plus texts and documentary units) or of objects (such as a mail order catalog of clothing or a museumÕs exhibition catalog). A union catalog is an index for several collections. Cataloging is the process of creating a catalog, so it is a type of indexing.

cataloging. See catalog, cataloging.

chain syntax. Chain syntax or chain indexing is a syntax technique to create alphanumeric indexes to classification headings that have been arranged in a non-alphabetical relational classified order. The terms of a classification heading or caption are arranged as a chain of terms in the order opposite their use in the classification. Each of these terms may become a lead term in the alphanumeric index, following by subsequent terms for context. See also section 12.2.4.1.

citation index. Reference citations have always been a useful basis for indicating (indexing!) possibly useful relationships. Almost every writer of a term paper, to say nothing of more serious researchers, has pursued reference citations in good documents as a way to find other documents of interest. In effect, a string cluster is created, with each link through a reference citation leading to an older document that was cited in the newer document. This kind of citation indexing can only lead backwards in time, because it is impossible to cite a document that has not yet been created! Creating indexes that could trace reference citations forward in time was extremely laborious before the advent of the computer. Such citation indexing was limited largely to the legal literature until the Institute for Scientific Information (1961, 1969, 1976) introduced the Science citation index in 1961, followed by the Social science citation index in 1969 and the Arts and humanities citation index in 1976. These indexes have now become standard tools, available in both print and electronic forms. They permit the user to begin with a given document and to trace its citation in subsequent documents forward in time. To the extent that a reference citation indicates a link between messages related with respect to topic, purpose, meaning or significance, these links can be quite useful for IR searching. See also  bibliographic coupling; co-citation; and section 8.3.12.

class. A grouping of items sharing some similarity. See also classification.

classification. Classification literally means to place items in classes, resulting in groupings of items sharing some similarity. By extension, it can refer to the creation and/or naming of these classes. By further extension, it often includes the arrangement of classes in a logical, relational, non-alphanumeric order. At the fundamental level, indexing and classification are the same process, because in both operations, messages must be analyzed, and based on this analysis, grouped into categories or classes. Finally, these groupings must be named and arranged to provide access. At the more superficial level, but reflecting its most common usage, classification refers to the logical, relational (non-alphanumeric) arrangement of classes, in contrast to alphanumeric indexes in which classes are simply arranged in alphanumeric order on the basis of their names.

classing. The act creating classes and assigning items to classes or placing items into classes.  See also classification.

clustering. ÒClusteringÓ means to create or identify groupings or clusters of items. At one level, Òclassing,Ó Òclassification,Ó and clustering all mean the same thing — the assembling of items into groups or categories. However, the term ÒclusteringÓ is used more often when the classing, or gathering together, is done through automatic, or algorithmic, means. The term ÒclassingÓ or ÒclassificationÓ usually implies human judgment.

co-citation. Like bibliographic coupling, co-citation is a form of clustering based on reference citations. In co-citation clustering, however, clusters are not based on reference citations shared by documents (as in bibliographic coupling) but on two or more documents being cited together in a subsequent document. If new papers, hot off the press, frequently cite both documents A and B, then, the reasoning goes, documents A and B must be related, and the more often documents are co-cited (cited together in later documents), then the closer the relationship is. Because new documents keep coming out, with different sets of reference citations, the co-citation clusters keep changing over time, showing new patterns of emerging relationships among documents, authors, and the topics they address. This constant change, incorporating new citation patterns, is the basis of the claim (or hope) of its proponents that these co-citation clusters can identify hot topics and emerging research fronts. See also section 8.3.12.2.

co-extensive headings. See statement/heading specificity.

complex term. Sometimes Òcomplex termÓ is used for a single phrase denoting more than two distinct concepts. The Library of Congress introduced the complex term Òtelephone assistance programs for the poorÓ in 1990. This single term could be broken up into separate terms for Òtelephones,Ó Òassistance programsÓ and Òpoor people,Ó so it could qualify as an example of a complex term. See also bound term; compound term.

compound term. ÒCompound termÓ can refer to a term consisting of more than one word, but more often it refers to a term consisting of more than concept, such as Òjuvenile delinquency,Ó which includes the concepts of both Òyoung personÓ and Òdelinquency.Ó See also bound term; complex term.

concrete entity. Concrete entities are those one can see and touch (at least in theory), like persons, tables, chairs. Imaginary concrete entities such as unicorns, Paul Bunyan, angels and faeries (if they are imaginary), can play the same types of roles in messages as do concrete real entities, so for facet analysis, they can be considered a type of concrete entity. See also abstract entity; entity; and chapter 2.

concrete entity and event database. Databases can also be characterized by the nature of the objects or phenomena that they are designed to describe. Concrete entity and event databases organize data about real concrete entities and events. Examples include airline databases that contain data about airplanes and all their parts, their maintenance, their crews, particular flights, passengers, fares, supplies, including which passengers get special meals, etc.; or bank databases that contain data about all customers, all their accounts, their balances and every banking transaction. The focus of these databases is on concrete entities and concrete events. In contrast, IR databases are designed to describe messages. These messages, of course, may be about concrete entities and events, but just as often, they can be about abstract entities or ephemeral phenomena, such as theories, feelings, emotions, and aesthetics. See also section 1.6.

data. See datum, data

database. ÒDatabaseÓ is a relatively new word for a collection of data that is organized for retrieval. It is sometimes restricted to organized collections of data in electronic media, but in this book, the term ÒdatabaseÓ is used for any collection of data organized for retrieval, regardless of medium, so that printed indexes, catalogs, encyclopedias, and similar reference works constitute examples of databases as well as electronic retrieval tools on CD-ROM or available online or via the world-wide web. (There is a brief note on the origin of this term in section 1.1 and a recap on IR databases versus other types of databases in section 1.6.)

Databases (along with the systems for access that accompany those in electronic form) can be categorized in many ways: by mission or purpose (such as MIS: management information systems), by subject areas (such as GIS — geographical information systems), by models of organization (such as relational, hypertext, object-oriented, flat-file), or by phenomena represented by data (such as real, concrete entities (things, objects!) and events versus messages about entities and events, including abstract entities, imaginary entities and fictitious events). This book focuses on databases designed for the purpose of facilitating discovery and retrieval of messages of all types, so our databases are called Òinformation retrieval databasesÓ or, for short, IR databases. Their purpose is information retrieval. The primary data in such databases describe messages rather than concrete entities and events. See also sections 1.5 and 1.6.

datum, data. A ÒdatumÓ (singular of data) may be considered to be a single fact or item of evidence. To be informative, a datum needs one or more additional data of different sorts to provide context. Thus it can be said that a message (potential information) needs at least two data. A set of numerical data, such as Ò70, 90, 28, 64,Ó is meaningless unless some explanation is provided. Do these data refer to temperatures? sport scores? or what? Similarly, a simple datum regarding color, such as Òred,Ó carries much more meaning when it is combined with at least one more datum, such as Òchair.Ó Data are often presented in tables, along with explanations, e.g., average temperatures by month and place or the scores of yesterdayÕs football games. Because IR databases focus on messages, they rarely deal with raw data except in the context of messages, where data are placed in context. See also knowledge.

descriptive cataloging, descriptive indexing. ÒDescriptive catalogingÓ is an old and honorable term that refers to the description and indexing of texts and documents with respect to features other than the content, purpose, or meaning of the textÕs message. Such features include the authors and other creators of texts (editors, composers, illustrators, translators, artists, etc.); the names or titles of texts (including subtitles, parallel titles, alternate titles, running titles etc.); the publishers or manufacturers and distributors of documents; the size and medium of documents; and the symbol set and code used to encode the text. Codes and symbols used to encode texts include natural languages and their writing systems (French, German, Chinese), but also codes and symbols for music, dance, chemistry, mathematics, etc., and, at another level, codes for the representation of messages in digital media. Names and index terms are established for the most important of these features. Descriptive cataloging (along with subject cataloging) is part of the process for making a catalog. ÒDescriptive indexingÓ is a rarely used term for the same process outside of the context of catalogs for particular collections of documents.

descriptive indexing. See descriptive cataloging.

descriptor. The term ÒdescriptorÓ is usually reserved for a term that is part of a controlled indexing language. Such indexing languages are often listed in a thesaurus. For each concept included in the indexing language, one descriptor will be chosen to represent the concept, and all other terms that can be used for the same concept are linked to the descriptor by means of cross references. Thus, if a thesaurus uses the descriptor Òlawyer,Ó then it might not use the terms Òattorney,Ó Òbarrister,Ó Òsolicitor,Ó or Òcounselor-at-law.Ó Each of these alternative terms would be linked to the preferred descriptor ÒlawyerÓ and would be given the status of un-used synonymous or equivalent terms. (Equivalent terms are terms that are not truly synonymous, but are close enough so that they can be considered equivalent in the context of an IR database. Anyone who knows the English legal system knows that ÒbarristerÓ and ÒsolicitorÓ are not exactly the same as U.S. Òlawyers,Ó but in many databases, the distinction would not be important enough to make, so that ÒbarristerÓ and ÒsolicitorÓ could be considered equivalent to Òlawyer.Ó) See also vocabulary control/vocabulary management.

digital library. Digital libraries are full-text databases that replicate, in digital media, many of the functions of traditional libraries. They tend to contain a purposefully selected collection of texts plus various means of access to these texts. ODLIS: Online Dictionary of Library and Information Science (Reitz 2000) defines Òdigital libraryÓ simply as: ÒA library in which a significant proportion of the resources are available in digital (machine-readable) format, as opposed to print or microform. The process of digitization began with indexes and abstracting services, then moved to periodicals and reference books.Ó

direct file. A direct file is generally the original file of documentary unit records or surrogates. From it, an inverted file is created by extracting search terms and rearranging them for quick access and processing by computer algorithm. See also non-displayed index.

displayed index. Displayed indexes are indexes that are displayed for direct examination, browsing, or scanning by users, as opposed to non-displayed indexes that are meant for computer manipulation and are not displayed for human examination. See also chapter 11.

document. A document is a combination of text and medium. Texts cannot exist without embodiment in some medium, whether ephemeral, like airwaves, or longer lasting, like paper, film, or electronic media for digital data. Usually we use ÒdocumentÓ to refer only to texts recorded in the longer-lasting media, and it is these documents that are susceptible to indexing and later retrieval.

documentary domain. Documentary domain is the territory (domain) from which documents are gathered for an IR database. Two IR databases can have identical subject scopes and identical documentary scopes, yet provide very different coverage because of different documentary domains. An IR database that obtains documents only from the holdings of one library, for example, will have very different coverage compared to an IR database that combs the entire world for documents that fall within its subject and documentary scopes.

documentary scope. Documentary scope defines and describes the kinds of messages, texts, and documents a user can retrieve via an IR database in terms of non-topical features, such as authorship or kinds of authors; media; codes and symbols systems used to encode messages as texts, including the various human languages used for creating language texts; the forms, formats, and genres of texts; the complexity or technical level of messages, including the kinds of audiences for whom they are intended (children, professionals, general public, etc.); points of view, biases, and methodological approaches characterizing the treatment of topics in messages; time and place of creation, manufacture or publication, etc. See also chapter 3.

documentary unit. A documentary unit is the portion of a document that can be directly retrieved by an IR database. Documentary units may be complete documents, such as complete books, or complete periodical articles. Or they may be parts of complete documents — chapters in books, or paragraphs or charts or diagrams or illustrations in periodical articles. This same variety in the size of documentary units applies to all media. An IR database for videotapes, for example, might retrieve only complete videotapes (so that the documentary unit is the complete tape), or it might be able to retrieve individual frames or short sequences of frames, in which cases, either the individual frames, or the short sequences of frames, constitute the documentary units. In all cases, the documentary unit is the unit that is analyzed for indexing (either by machine algorithm or by human inspection). Consequently, the documentary-unit is also called the Òunit-of-analysis.Ó ÒBibliographic unitÓ has also been used for this concept, indicating the unit described and retrievable via a bibliography. Small documentary units have also been called Òinformation units,Ó but one should hope that all documentary units will be informative! See also indexable matter; and chapter 6.

domain. See documentary domain; subject domain.

end-user thesaurus. Traditionally, thesauri were designed to guide indexers, who were compelled to use preferred term from indexing thesauri for every concept. There are no preferred terms in an end-user thesaurus.  Instead, for every concept included, all variant, synonymous, and equivalent terms are displayed, along with narrower, broader, and other related terms.  The purpose is to help searchers find all as many relevant terms as possible for their searches.

entity. Entities, or things, are one of the fundamental facets in facet analysis for indexing and classification. Ranganathan referred to the entity facet as the ÒpersonalityÓ facet, because he focused on the defining characteristics of entities (their personalities!). Entities include concrete entities (living beings and inanimate objects, naturally occurring or human-made, whether real or imaginary) and abstract entities (institutions, theories, ideologies, etc.).

entry. In displayed indexes, an entry represents and points to a documentary unit. An entry consists of a heading (of one or more terms) and a single locator, such as ÒUnited States  23Ó or ÒUnited States. history. civil war. bibliography  44.Ó The locator leads to the documentary unit. In this example the locators 23 and 44 might refer to particular paragraphs or pages or to entries in a list of document citations or to documents on shelves or in a filing cabinet. See also entry array.

entry array. When two or more entries have identical headings or subheadings, these duplicate headings are usually merged for display, resulting in entry arrays that might look something like this:

      United States

     Armed Forces

          Afro-Americans. Bibliography  25

                          History  24, 30, 339

          California. History. 20th century  54

          China. History. 20th century  332

                 Military life. History  442

          Gays   74-80, 445-450

                Government policy  76

                History. 20th century  78-80

                Legal status, laws, etc. 76-78

In this example, there are three separate entries for ÒUnited States. Armed Forces. Afro-Americans. History.Ó They have been merged to save space and make the display of the index more convenient for the user. Each separate locator indicates a separate entry in an index.

equivalent term. Equivalent terms are synonymous terms but also terms that are not truly synonymous, but are close enough so that they can be considered equivalent in the context of an IR database. Anyone who knows the English legal system knows that ÒbarristerÓ and ÒsolicitorÓ are not exactly the same as U.S. lawyers, but in many databases, the distinction would not be important enough to make, so that ÒbarristerÓ and ÒsolicitorÓ could be considered equivalent to Òlawyer.Ó

exact match syntax. Exact match syntax for electronic matching of terms in non-displayed indexes requires that terms associated with documentary units match exactly the requirements of the search statement. Most often exact match syntax is implemented using boolean operators; it is often called Òboolean syntax.Ó Note however that the requirements of the boolean ÒorÓ operator permit terms linked with ÒorÓ to be either present or absent. Only documentary units whose terms match search statements exactly (within the parameters provided by search syntax options such as truncation, proximity ranges, and stemming) are retrieved. See also best match syntax; and section 12.3.1.

exhaustivity. Exhaustivity of indexing refers to the detail with which the topics and features of messages, texts, and documentary units are described. How many different descriptors or terms are used to describe the content or features of a typical documentary unit? This number of terms or descriptors is a measure of exhaustivity. See also chapter 9.

facet. Facets are fundamental categories, aspects, or ÒfacesÓ of phenomena not unlike the journalistÕs Òwho, what, where, when, why.Ó Facets represent fundamental characteristics by which any message can be analyzed and described. Facets also represent the important aspects of a subject area that form the basis for creating and arranging a relational classification. Thus a classification of literature might be arranged by the facets of language, nationality, genre, period, writer, theme, etc. The term comes from the French diminutive for Òface,Ó ÒfacetteÓ (WebsterÕs 1966). Examples of generic facets are:

● entities or things (persons, artifacts, natural objects, animals and plants, institutions, and other abstract entities, etc.)

● attributes or constituent materials

● actions (operations, processes, and events)

● places

● times

Specialized IR databases will make use of much more specialized facets. See also chapter 2.

facet analysis. The analysis of topics with respect to their basic aspects or facets. See also chapter 2.

faceted syntax. Faceted syntax is used when there is a need or desire to have the individual terms or descriptors in an index heading arranged in some meaningful order. Terms are assigned to facet categories and these categories are used to determine the order of terms in the heading. See also section 12.2.2.2.

flat-file database.   A flat-file database is an IR database based on the flat-file data model. In contrast to the rather sophisticated and highly structured relational and object-oriented models, a simple flat-file data model calls for nothing more than Òa single file containing many records, each of which contains the same set of fieldsÓ (FOLDOC 2002, ÒdatabaseÓ). This simple model, sometimes called a Òflat fileÓ design, is quite common for IR databases.

format. Texts come in many shapes and styles, influenced by the medium on or in which a message is encoded, by the meaning and purpose of the message, and by the intended recipients of the message. Shape and style contribute to something we usually call format. It covers a wide spectrum of attributes, such as literary genre (poetry, narrative, drama, essay, speech, fiction, etc.), type of presentation (chart, diagram, picture, cartoon, list, etc.), and type of publication (such as book versus pamphlet versus broadside or poster on paper media, slide versus motion picture in film media). D. W. Langridge (1992, p. 28) suggests six types of attributes related to Òthe method of selection, arrangement or displayÓ of message content under the heading of format: 1. order, how material is arranged, e.g., alphanumerically, chronologically, or in some classified order according to mutual relations of message content or features; 2. literary forms or genres (poetry, drama, essay, narrative fiction, short story); 3. reductions (abstracts, excerpts, quotations, summaries); 4. collections (encyclopedias, compendia, handbooks, readers); 5. keys to other documents (indexes, bibliographies, catalogs); and 6. rules (standards, codes, recipes). The distinction between ÒformatÓ and ÒmediumÓ is sometimes fuzzy. For example, the meaning of ÒbookÓ usually includes its medium (paper) as well as its shape (leaves bound together along one edge). When the content of a book is moved to electronic media, is it still a book? Probably the meaning of ÒbookÓ will continue to shift as the media on or in which book-like messages are conveyed change over time. After all, our word ÒbookÓ came from the Anglo-Saxon word for beech tree, because ancient runes were once written on beech bark! The connection in German is quite close: ÒBucheÓ for Òbeech treeÓ and ÒBuchÓ for Òbook.Ó

free-text term. Often shortened to Òfree text,Ó Òfree-text termÓ usually refers to the use of uncontrolled words or terms from natural language text for indexing or searching. When one searches the actual text of a document, one is searching the free-text terms that are found in the document. The difference between Òfree-text termsÓ and just ÒtermsÓ is that sometimes terms may be standardized, at least a little, with respect to format, and they may also have links with the most common synonyms or equivalent terms, even if they are not controlled to the extent of descriptors. In this paragraph, every term or phrase is a free-text term. Some of the smaller words (such as Òof,Ó Òthe,Ó Òto,Ó etc.) may be listed on a stop list of unsearchable terms — terms that cannot be searched for by themselves, but they are still free-text terms! ÒKeywordÓ is often used to indicate the more important free-text terms.

full-text database. Full-text databases are IR databases that contain the full text of the documents that they describe and organize for retrieval. Such texts may be based on a variety of representation codes, such as linguistic, pictorial, musical, mathematical, etc. We have long had full-text databases in print media. Examples include handbooks and encyclopedias. In addition, monographs with their own back-of-the-book indexes also qualify as full-text databases, because the text of the monograph is presented together with an index, and this index describes and reorganizes the content and other features of the text for retrieval. But usually, Òfull-text databaseÓ refers to electronic databases.

generic posting. Generic posting is the use of a broader more generic term in addition to a specific term to represent a topic or feature of a message, text or documentary unit. Some experts make a distinction between Ògeneric postingÓ and Òup-posting,Ó limiting Ògeneric postingÓ to the use of a broader, more generic term in place of a specific term (e.g., using ÒfurnitureÓ in place of ÒsofasÓ) while using Òup-postingÓ for the use of both a specific term and a broader, more generic term (using both ÒsofasÓ and ÒfurnitureÓ).

heading. In displayed indexes (indexes that are designed for visual inspection by humans as opposed to non-displayed indexes that are searched by computer algorithm), index terms are combined into headings consisting of multiple terms. It is possible to have index headings with only single terms, but headings of two or more terms are more meaningful, because the lead term is modified or amplified or described by the subsequent term or terms. The subsequent term or terms create a context for the first, or lead, term. Compare, for example, the meaning of the simple heading ÒUnited StatesÓ versus the more detailed meaning of ÒUnited States — history — civil war — bibliography.Ó In the second heading, ÒUnited StatesÓ has been modified or defined by aspect or approach (history), event or period (civil war), and format (bibliography). An index heading is an essential part of an index entry. When displayed indexes are displayed in classified rather than alphanumeric order, the headings are often called Òcaptions.Ó

hierarchical specificity. This type of specificity has nothing to do with the specificity relationship between the meaning of a term and the message, text, or documentary unit to which it refers. Instead it relates to the relative narrowness or breadth of the meaning of a term in a hierarchy. Weinberg and Cunningham (1984, 1985) used this definition in comparisons with operational specificity. Thus this hierarchical term-term relationship is entirely different from the term-document relationship that forms the basis of the semantic term-document relational definition of ÒspecificityÓ as used in this book. See also section 10.1.

hierarchy. From the Greek for ÒhierarchÓ or Òhigh priest,Ó ÒhierarchyÓ is now used to indicate an array of terms or descriptors or categories arranged from broader to narrower. There is a strong theoretical proposition that broader-narrower relationships exist only within facets (Kwasnik 1999).

HTML (HyperText Markup Language). See text encoding schemas.

human indexing. Human indexing is indexing done by humans based on analysis using the human intellect. See also section 8.2.

hypermedia Hypermedia is really hypertext. The medium is electronic and digital. It is the format that is hypertextual.

hypertext. Hypertext is text displayed in an interactive format so that a user (a reader or viewer or listener) has the capability of skipping around from place to place rather freely. The various parts of the text are linked via hyperlinks. Hypertext can be contrasted with traditional static linear text. See also section 21.1.

hypertext database. With the advent of the world-wide web, hypertext databases have become more and more common. According to FOLDOC (2002, ÒhypertextÓ), ÒhypertextÓ refers to a Òcollection of documents (or ÔnodesÕ) containing cross-references or ÔlinksÕ which, with the aid of an interactive browser program, allow the reader to move easily from one document to another.Ó In a hypertext IR database, some of these documents may be summary records or surrogates, which can lead the user to documents containing the full text of messages. Also, in this context, ÒdocumentsÓ may be documentary units of any size, e.g., paragraphs or individual images.

index, indexing. An index is any device that is (or can be) used to indicate or point to something of interest. Indexing is the creation of such indexes. Indexes are used in many fields in addition to library and information science, such as the consumer price index in economics, where the index points to the rise and fall of prices. In information retrieval, an index is used to indicate the content and features of messages, their texts and documentary units, and their location and/or the location of particular content or features within these messages, texts, and documentary units. There are many types and varieties of indexes, corresponding to types of IR databases listed in section 1.5. Indexes are produced in many different ways, both by human analysis and computer algorithmic processing.

indexable matter. Indexable matter is the actual portion of a documentary unit on which indexing or classification is based — on which index terms or headings are based or from which terms are extracted. Not all indexes need to be based on the entire text of a message. Sometimes a message can be adequately summarized by a part of its text. Thus, if an index does not need to be very detailed, a good title might be sufficient to represent the message of a periodical article for purposes of indexing or classification. In that case, the title could be the indexable matter for the documentary unit — the periodical article. Abstracts of scholarly articles are a common example of indexable matter. Many indexing and abstracting services base their indexing and classification only on the abstracts of the messages that they cover. For important messages, the entire text of the message may need to be consulted, thereby making the entire text the indexable matter. Sometimes, whole categories of messages may be excluded from indexable matter (and also from documentary domain). An index for a scholarly journal, for example, may index only substantive research articles and exclude from indexable matter all advertisements, letters to the editor (unless they comment on articles that are indexed), announcements, calls for papers, etc. (ÒIndexable matterÓ is also called Òanalysis base,Ó because it constitutes the base, or basis, of analysis — the text on which analysis is based.) See also chapter 7.

indexer thesaurus. Traditionally, thesauri were designed to guide indexers, who were compelled to use the preferred term in a thesaurus for every concept. Each preferred term is generally accompanied with synonymous, equivalent, variant, narrower, broader, and other related terms. Sometimes source and scope notes are included. Although designed for indexers, indexing thesauri can be very helpful for searchers as well.   See also end-user thesauri.

information. This is a slippery concept that is best avoided, except in terms like Òinformation scienceÓ (the established name of a discipline) and Òinformation retrievalÓ (the name of a primary focus in information science). The problem with ÒinformationÓ is that it has come to have too many meanings, and it is therefore often vague and unclear. On the one hand, it is used to refer to the process of informing or becoming informed. But more frequently, it is used to stand for data, messages, texts, and documents, whether or not these are actually informative for a person confronting them. Does it really make sense to call ancient Greek manuscripts ÒinformationÓ if one canÕt even read them?

IR database (information retrieval database). Also called Òbibliographic databases,Ó Òdocument databases,Ó Òtextual databases,Ó Òtextbases.Ó The basic definition for the term ÒIR databaseÓ as used in this book is any database in any medium used for discovering and retrieving messages, texts, and documents. Thus, it includes the whole gamut of IR databases presented to users via online connections, the world-wide web, CD-ROMs, or in print on paper: indexing and abstracting services (regardless of medium), library catalogs (including OPACs, online public access catalogs, and older card catalogs), bibliographies, and indexes, including back-of-the-book indexes (which can now be presented electronically with electronic books!).

Thus IR databases have as their primary purpose the organization of data about messages, texts, and documents to facilitate their retrieval. For the most part, IR databases are not directly concerned with concrete entities or events, except as they are represented as topics of messages or features of texts. In contrast to typical concrete entity and event databases, the numbers and variety of concrete entities and events, and other topics, that can be represented in texts is enormous and their mutual relationships multitudinous, so that usually there is no attempt to structure their relationships in advance. Hence, IR databases are often called Òunstructured,Ó or their data is called Òunstructured information.Ó This is why IR databases in electronic media most often use the flat-file model.

Despite this general focus on the content and features of messages and texts, many IR databases must also deal with concrete entities and events related to the creation and transmission of messages and texts. They must describe the concrete documents in which messages and texts reside and the persons and organizations that create, manufacture, publish and send these documents.

In contrast to concrete entity and event databases however, IR databases are just as likely to focus on abstract, fictitious or imaginary entities, attributes and events, as compared to real concrete entities and events. Examples of abstract, fictitious or imaginary phenomena include hypotheses, theories, opinions, beliefs, aesthetics, feelings, emotions, and mythical or fictional figures, characters and events.

IR system (information retrieval system or information storage and retrieval system). IR systems are the systems that make it possible to search IR databases. They provide the search interfaces that permit users to compose searches and match them against database indexes (non-displayed indexes) or to browse indexes that are displayed for visual inspection (displayed indexes). Often the search system is so integrated into the database itself that it is inseparable. This is especially true for print-on-paper databases, such as printed indexes, catalogs, bibliographies, handbooks and encyclopedias. In electronic retrieval, however, the information retrieval system may be completely separate, so that the same IR database can be vended or made available by different vendors or agencies, each of which provides an entirely different information retrieval system, with entirely different search interfaces, different search engines, different search commands, and different display options.

inverse document frequency. Inverse document frequency (IDR) is a measure of how infrequently a term occurs in documents in a collection or IR database, hence the term ÒinverseÓ document frequency. Sometimes term frequency (TF) within documents does not help much in distinguishing one text from another within a single collection or IR database. Take librarianship, for example. The word ÒlibraryÓ will probably occur in most if not all texts in a collection or IR database on librarianship, so the mere fact that it occurs frequently in a text doesnÕt tell us very much. But comparing frequency counts in single texts with the overall occurrence for the same words in an entire collection or IR database often helps to pinpoint the more important terms. We can identify words that are unusually frequent in particular texts — words that occur frequently in some texts but do not occur frequently across the entire collection. This relative frequency can be more useful in finding useful documents than simple word frequency within documents. The fewer the documents that have a term (or the lower its frequency in most texts), the higher the IDF score. The IDF score can be combined with term frequency (TF) within particular documents to help identify useful documents. See also section 8.3.5.

inverted file. An inverted file or inverted index is Òa sequence of (key, pointer) pairs where each pointer points to a record in a database which contains the key value in some particular field. The index is sorted on the key values to allow rapid searching for a particular key value .... The index is ÔinvertedÕ in the sense that the key value is used to find the record rather than the other way round. For databases in which the records may be searched based on more than one field, multiple indices may be created that are sorted on those keysÓ (FOLDOC 2002). The ÒkeyÓ in this FOLDOC definition may be any descriptor, term, or keyword, including names of authors or other features, such as languages, formats, media, etc. In contrast a direct file is generally the original file of documentary unit surrogates (records). Search terms (keys) are extracted from the records in such a direct file and rearranged for quick access and processing by computer algorithm. See also non-displayed index.

keyword indexing. Keyword indexing is based on words (keywords) in natural language text. It is commonly the basis for electronic searching of non-displayed indexes or full texts, but it is also the basis for some popular natural language syntax for displayed indexes. KWIC (KeyWord In Context) syntax creates a heading for every keyword in a text segment (title or other statement), with the rest of the text segment used for   context preceding and following the keyword in the original word order. In KWIC, the keyword (and sorting word) is in the middle of the heading. KWOC (KeyWord Out of Context) pulls the key word out of its context to place it in its traditional place at the left of the heading. KWAC (KeyWord Alongside Context) attempts to restore some context by keeping the original word order, but placing words that appear preceding the key word at the end of the heading. See also sections 8.3.2 and 12.2.5. Examples of KWIC, KWAC and KWOC headings are in sections following 12.2.5.

knowledge. ÒKnowledgeÓ refers to what someone knows. It resides in the mind and the brain, but it can be reflected in messages. ÒWisdomÓ refers to the wise use of knowledge. It is not technically correct to say that knowledge resides in messages. According to the definitions for ÒtextÓ and Òdocument,Ó messages are embodied by organized sets of symbols (texts) recorded on media (documents). But these messages and texts can refer to what persons know, think, believe, feel, and understand, so it is entirely proper to say that messages, texts, and documents record, reflect, and convey the knowledge of a person, a group, or an entire culture.

KWAC index. See keyword indexing.

KWIC index. See keyword indexing.

KWOC index. See keyword indexing.

language model syntax. Throughout the history of automatic indexing based on term weighting and relevance prediction, two major theoretical models have emerged: the vector-space model and the probabilistic model. Recently, a Òlanguage modelÓ for IR has been proposed, as a modification or simplification of the probabilistic model. The differences are subtle. Instead of attempting to predict the probability of document relevance for an IR query statement, the language model is used to predict probable query search terms. Retrieval is effected when predicted (highly probably) search terms match the actual search terms of users. Like all basic models, language model probabilities are based, for the most part, on term frequencies within documents and the inverse document frequency of terms across collections or IR databases (Ponte & Croft 1998).

latent semantic indexing. Latent semantic indexing (LSI) is one of the most sophisticated modern attempts at high quality automatic indexing. It is based on co-occurrence clustering of terms and the identification of documents associated with these term clusters. By relying on co-occurrence data, LSI is also able to deal with the problem of the variety of terms that can be used to express similar concepts. For example, both ÒlawyersÓ and ÒattorneysÓ are likely to belong to the same cluster with related terms such as Òcourts,Ó Òtrials,Ó Òjudges,Ó Òsentencing,Ó etc. See also section 8.3.11.1.

literary warrant. Literary warrant simply means that the vocabulary of indexed or cataloged documents should be accepted as terminology for index headings, descriptors, or preferred terms in thesauri, because it is warranted (authorized) through actual usage in documents. Literary warrant is complemented by user warrant.

locator. The locator is the part of an index entry that leads the user to the documentary unit to which the index entry refers. It indicates the location of the documentary unit or the location of a representation (surrogate) of the documentary unit (such as a citation, abstract, description, or thumb-nail image). The locator can be as brief as a number, representing a page or paragraph in a back-of-the-book index, or it can be long enough to include a full citation that can be used to locate a documentary unit, perhaps in a library or on the internet. See also chapter 15.

manual indexing. Indexing has never been done by the hands!   Humans use their intellect, their minds, to index.   See human indexing!

medium. Media (the plural of medium) are the physical substances on which or in which a text is recorded and conveyed. Ephemeral media include the airwaves over which sound (including speech) is sent and received. Information retrieval generally deals with longer lasting media, such as stone, clay, metal, paper, film, and the newer media for the recording of electronic digital data: disks, tapes and chips made from various forms of plastic, silicon and metal. An important responsibility of our profession is to make sure that the media on which messages are recorded can actually be preserved for as long as the messages have value. This is a big challenge, especially for untested newer magnetic and optical media for digital data. One hopes that these modern media, including silicon (ceramic!) chips, will be as long-lasting as their ancient relatives, the clay tablets of the Middle East and elsewhere. And even more important thus far is that the technology required to read various media not be discarded before the texts on older media are transferred to media that can be read.

message. IR databases are used to find and retrieve messages. A message is the content of a meaningful communication. In order to be communicated — to be sent and to be received, messages must be encoded into texts, using symbols or representations that can convey meaning to recipients of the message. And the text of a message must be recorded on a medium to create a document. A message is potential information. If a message is actually received by someone who pays some level of attention to it, that person can be said to have been informed by the message, and the message itself can qualify as information.

metadata. Metadata is data about data, or more specifically within our context, data about messages, texts, and documents. The ÒmetaÓ of ÒmetadataÓ comes from the Greek for Òalong withÓ or Òover,Ó so literally, ÒmetadataÓ is Òalong withÓ data or ÒoverÓ data. Some experts have suggested that the term ÒmetadataÓ should be reserved for data about messages, texts and documents that are embedded within the documents themselves. (Other data about messages, texts, and documents could be called bibliographic records, or something similar.) Certainly the inclusion of metadata within the document itself is the expectation for metadata that describes a digital document. But any kind of document — digital, print on paper, or of any other medium — can certainly contain its Òbibliographic recordÓ within itself, as has long been the practice with CIP (Cataloging in Publication) in printed books. This distinction, between records separate from documents or included in documents, does not seem to be very useful. Thus, metadata may not be very different in meaning or purpose from surrogates or records.

natural language syntax. Natural language syntax is syntax for displayed indexes that is applied to statements or segments of text that already exist (i.e., in natural language). Most commonly, it is applied to titles of documents. The most common natural language syntaxes are KWIC, KWAC, and KWOC, which are described in the entry for keyword indexing. Permuted syntax can also be used on natural language terms, as well as assigned terms. Ad hoc string syntax, such as NEPHIS, can also be applied to natural language text or titles. (Many non-displayed indexes consist of natural language terms, but the term Ònatural language syntaxÓ usually refers to syntax used to create displayed indexes.) See also section 12.2.5.

NEPHIS. ÒNEPHISÓ stands for ÒNested Phrase Indexing System,Ó developed by Timothy Craven (1986). Natural language statements are coded with special symbols to identify phrases that should be lead terms (main headings) in an index and to arrange remaining terms into meaningful subheadings. See also section 12.2.2.3.

non-displayed index. A non-displayed index is one that is not displayed for direct human use. Instead it is designed to be searched by machine, mechanically in the early days and electronically in more recent decades.   Only in the past century have we begun creating indexes that are used for machine matching rather than for visual inspection by the human eye. The earliest such indexes predated the computer, but they relied on early examples of the same kind of matching techniques (exact match syntax) that became nearly universal with the advent of computer-based IR systems. An example of a pre-computer non-displayed index are the cards used in the optical coincidence, or peek-a-boo, retrieval system that is described in section 5.1.3. Now non-displayed indexes are almost always used by computer programs. Such indexes may not even exist until a search is performed. They may be created Òad hocÓ or Òon the flyÓ for each search, or inverted files of terms may be created in advance of searches in order to speed up the machine matching process. Inverted files are created by taking all, or selected, terms from message, text, or document descriptions or from full text, and sorting them in ways that speed up machine processing. See also chapter 11.

object-oriented database. A more recent data model for databases is called Òobject-oriented,Ó as in Òobject-oriented databases,Ó related to object-oriented programming (FOLDOC 2002, Òobject-oriented databaseÓ). In these databases, algorithms for processing data are integrated with the data, so that data related to each object of importance have their own associated object-oriented programs.

ontology. In philosophy, ontology refers to the study of existence, or the creation of Òa systematic account of existence.Ó From this, the field of artificial intelligence (AI) took the term ÒontologyÓ to refer to Òan explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among themÓ (FOLDOC 2002). From the AI field, ÒontologyÓ has become a new ÒinÓ word referring to thesauri and classifications: Òthe hierarchical structuring of knowledge about things by subcategorising them according to their essential (or at least relevant and/or cognitive) qualities. ... This is an extension of the previous senses of ÔontologyÕ (above)   ...Ó (FOLDOC 2002). Ontologies are generally created for machine manipulation, whereas thesauri are generally designed for human use. See also section 13.3.5.

operational specificity. Elaine Svenonius (1971) redefined specificity in terms of the number of postings associated with a term or descriptor and called it Òoperational specificity.Ó This has also been the definition used in most subsequent information science research on specificity. The fewer the postings, the higher the level of operational specificity. Terms linked to few documentary units are considered to be highly specific. Terms linked with many documentary units are considered to lack specificity. Karen Sparck Jones (1972) called this Òstatistical specificity.Ó See also section 10.1.

optical coincidence IR system. See peek-a-boo IR system.

paradigmatic relationship. Also called Òsemantic relationship.Ó This is a relationships that always exists based on the definitions of terms and permanent relationships among concepts, such as the taxonomic relationships among Òdogs,Ó Òcanines,Ó Òmammals,Ó Òvertebrates,Ó and Òanimals.Ó Contrast to syntagmatic relationship. See also section 12.2.3.1.

peek-a-boo IR system. ÒPeek-a-booÓ is the nick name and the more common name for the Òoptical coincidenceÓ IR system, because light peeking through pin-holes indicates the presence of a hit (one or more documents matching search criteria). This was one of the most prominent pre-computer systems for exact match syntax (boolean) searching. Cards approximately one foot square were used to represent index terms or descriptors for topics or features, including names of authors. After a document was indexed, the cards for each term assigned to the document were pulled from an alphabetical file and the document was recorded on the card by drilling a small hole to represent the document number. On each card was a grid with 100 positions along the horizontal and vertical axes, so that 10,000 unique positions were available to represent 10,000 documents. Each document was given a two part number, corresponding to the horizontal and vertical axes, so that document number 59-23 would get a hole drilled exactly 59 spaces to the right of the left margin and 23 spaces down from the top. A highly calibrated drill press was used to make these holes. See figure 5.1 for photos of a Òpeek-a-booÓ optical coincidence IR system, with equipment manufactured by the Jonker Corporation, circa 1967. See also section 5.1.3.

permuted syntax. Permuted syntax was developed to provide direct access to every two-word pair in an original indexing statement (a document title or a statement prepared by an indexer) or a set of index terms, regardless of whether or not these two words appear next to each other in the original statement or set of index terms. Direct access is provided to every keyword (every word not on a stop list), and each access word is linked to every other non-stop-list word that occurs in the same index statement or set of terms. Compare permuted syntax with the common syntaxes of keyword indexing: KWIC, KWOC, and KWAC. See also section 12.2.6.

postcoordinate, precoordinate syntax. The terms Òpostcoordinate syntaxÓ and Òprecoordinate syntaxÓ are used to indicate when terms are put together to represent documentary units, either before (pre) or after (post) a search begins. All index headings that are constructed for displayed indexes, which users may browse during the searching process, must of necessity be created before the search, so they are called Òprecoordinate headingsÓ based on precoordinate syntax. Postcoordinate syntax is used almost exclusively for machine matching, where searchers create search statements, putting terms together at the time of the search, then make use of computer algorithms to find matching records or texts. See also section 12.1

postings. The term ÒpostingsÓ refers to the assignment or ÒpostingÓ of a term to a record for a documentary unit. It is used for term-document associations regardless of whether a term was assigned by a human or by machine algorithm. The number of postings is valuable information for searchers because it indicates how many documents in an IR database will be retrieved by a particular term.

postings specificity. See operational specificity.

precision. Precision and Recall are traditional measures of retrieval effectiveness. While true recall is difficult, if not impossible, to determine, precision is easy to calculate — at least it would be easy if making relevance judgments were easy. It is based only on retrieved documents. ItÕs the ratio between the number of relevant documents retrieved over all the documents retrieved — both the relevant documents and the junk:

             number of relevant documents retrieved

precision = ----------------------------------------

             number of all documents retrieved

See also section 9.1.2.

precoordinate syntax. See postcoordinate, precoordinate syntax.

probabilistic model syntax. Throughout the history of automatic indexing based on term weighting and relevance prediction, two major theoretical models have emerged: the vector-space model and the probabilistic model. Probabilistic syntax is based on the probabilistic model, in which statistical data for term frequency and distribution is used to predict the probability of relevance. See also sections 8.3 and 12.3.2.

pseudo relevance feedback. Some researchers are experimenting with relevance feedback systems that donÕt require user input or evaluation. The initial search is simply modified on the basis of the most highly-ranked documents in the initial retrieval set. This technique is called Òpseudo relevance feedback.Ó

recall. Recall refers to the extent to which an IR retrieval system, including the indexing provided, is able to retrieve everything useful within its reach in response to a search. Its formal definition is the ratio of the number of relevant documents retrieved over all the relevant documents in an IR database or collection:

          number of relevant documents retrieved

recall = --------------------------------------------------------

          number of relevant documents in database or collection

The denominator of this formula, the total number of relevant documents in an IR database or collection, is impossible to determine. If it were possible, then all relevant documents would be retrieved! Consequently, researchers attempt to estimate the total number of relevant documents, with varying levels of success. See also precision; and section 9.1.1.

record. A record (or database record) contains the description of a message, the text in which it is encoded, and the documentary unit that contains the text. In some contexts, such a record is now called Òmetadata.Ó All the information or data in a IR database about a particular message, text and documentary unit goes into its record. Examples of such data include: a citation to the text and its documentary unit, including creator, title, publisher or manufacturer, format and medium; an abstract or some other description of the message content and features of the message, text, and documentary unit, sometimes including a small picture (thumbnail) of an image document or a short segment of sound; and all the content and feature terms, descriptors or headings associated with the documentary unit. The database record is usually structured or formatted according to some regular pattern. For example, many library catalogs use the MARC (Machine-Readable Cataloging) record format, developed initially by the Library of Congress and now a world-wide standard. Many IR databases create their own record format. In some database models, especially relational databases, the record is not a single unit, but is a node that contains links to all the data related to a particular message, text and documentary unit. For example, the name of a publisher may be recorded in a table of publishers and the name of an author may be in a table of authors. The particular publishers or authors linked to a particular message, text and documentary unit are called into a record or surrogate display when that display is requested. See also chapter 20.

record format. The record format defines the way that data is tagged or labeled and stored in electronic computer-readable media. Such electronic records are not generally displayed directly to end users, unless specifically requested. Rather they are used to generate the various displays that are especially designed for the end user (such as those described in chapter 16 on surrogate displays). The general principle to follow in setting up a record format is that every element of data that will be important for the implementation of any database display or search option should be separately identified. Each of these elements will have a separate field (or subfield) in the record, and each of these fields will have a name or caption, which is abbreviated or represented by some type of tag, label or notation. See also chapter 20.

reference database. Reference databases are IR databases that point to (refer to!), but do not include, the full text of the documents that they describe. Documents are represented by surrogates, such as citations, abstracts, excerpts, notes, and pictures.

relational database. One of the most common database models is called Òrelational,Ó resulting in relational databases. According to FOLDOC: The Free On-line Dictionary of Computing (2002), a relational database is one in which Òthe data and relations between them are organized in tables.Ó The name reflects a special way for organizing data and for indicating relations among data or categories of data. The name can be misleading, however, because all databases, regardless of data model, describe and display relationships to some degree, in one way or another.   See also concrete entity and event database.

relevance. Judgments of relevance are used in information retrieval as an indication of the usefulness of retrieved documentary units in response to a request or a search. The common measures of retrieval effectiveness, recall and precision, are both based on a determination of relevance (see also section 9.1). Sometimes researchers try to make distinctions between Òrelevance,Ó Òutility,Ó Òpertinence,Ó and similar concepts, or to distinguish types of relevance, such as Òtopical relevanceÓ as opposed to Òuser relevance (the idea being that a document might be on the topic, and therefore topically relevant, but the user canÕt use it or doesnÕt want it — perhaps he or she canÕt read the language or already has the document or the writing is too complex, etc.).

relevance feedback. Relevance feedback refers to methods for adjusting a search statement based on preliminary relevance judgments by the user. The usual approach is for a preliminary search to proceed using terms (and modifications such as term weights, truncation, proximity limits, etc.) provided by the user. The results of this initial search are presented to the user, along with an evaluative questionnaire in which the user can indicate preliminary relevance judgments concerning the value of the retrieved documents. These judgments are then used by the system to modify the initial search statement (e.g., adding weights to the more successful terms, decreasing weights for the less successful terms or eliminating them altogether), and a second search is performed. This interaction can continue as long as the user wishes. See also pseudo relevance feedback; and section 8.3.13.

rotated term syntax. This is the simplest of all string syntax patterns. All terms or descriptors assigned to a documentary unit are arranged in alphanumeric order within the string, and then each term is rotated out to the lead position, one at a time, for access purposes. See also section 12.2.2.1.

search interface. The most important aspect for all IR databases is the way in which their content and access options are presented to the user. Database presentation has earned the relatively new name Òsearch interface.Ó We use it in the broadest sense, not only for electronic interfaces for electronic searches, but also for browsing in both electronic and print media. All IR databases, regardless of medium, must present their content and their access options to users, so all have search interfaces. See also chapter 19.

SGML (Standard General Markup Language). See text encoding schemas.

specificity. Specificity has been a rather slippery term with respect to its meaning and applications in library and information science. In this book, ÒspecificityÓ refers to the degree or closeness of fit or correspondence between the meaning of an index term or descriptor and the topic or feature to which it refers in a message, text, and documentary unit. This is the Òsemantic term-document relational definition.Ó See also chapter 10, especially section 10.1. where several definitions of specificity are discussed. And see also these alternative definitions: operational specificity, hierarchical specificity, and statement/heading specificity.

standards. Standards are codes of practice on which participants in an operational domain agree in order to promote interoperability, efficiency, and improved service. Since the beginning of librarianship, millennia ago, improvements in practice have come about mainly through the development of new and better standards or codes of practice. Scientific research, as a means to study and understand phenomena and thereby improve practice, is a relatively recent innovation that came into librarianship, for the most part, with the advent and popularity of information science, mostly after World War II. Whereas scientific research is based on empirical testing of hypotheses, standards and codes of practice are based on expert opinion. See also section 1.4.

statement/heading specificity. This refers to the closeness or accuracy with which a search statement or a complete index heading describes the overall content of a message, text and documentary unit.   Highly specific search statements or index headings are often referred to as Òco-extensiveÓ with the scope of the message, text, and documentary unit. Specificity in the sense used in this book refers only to individual index terms or descriptors, not to strings of terms or multi-term headings or to search statements consisting of multiple terms. However, the specificity of individual terms will contribute to the overall specificity of multi-term headings and to multi-term search statements. The specificity of index headings and search statements can also be increased by adding additional terms. Thus Òdogs — New JerseyÓ can be a more specific heading than ÒdogsÓ by itself, or ÒNew JerseyÓ by itself. And a search for ÒdogsÓ and ÒfleasÓ can be a more specific search statement than ÒdogsÓ by itself or ÒfleasÓ by itself, even though the term specificity of Òdogs,Ó ÒNew Jersey,Ó or ÒfleasÓ has not changed. See also section 10.7.  The  methods for combining terms in index headings and search statements are governed by rules and patterns of syntax, not by term specificity. Syntax is the topic of chapter 12.

statistical specificity. See operational specificity.

stemming. ÒStemmingÓ refers to procedures for automatically removing certain common suffixes, or word endings, (and sometimes prefixes, like ÒreÓ or Òre-Ó as in Òre-indexingÓ) in order to increase the frequency count for important words, and also in order to find word occurrences when the word form in the text does not match the word form in the search statement. There are often sets of related words that are derived from a common root and appear in a variety of forms, depending on particular functions in a sentence or variations in meaning. Thus we have Òindex,Ó Òindexes,Ó Òindexer,Ó Òindexing,Ó Òindexable.Ó We also have variants, such as ÒindicesÓ as another form for the word Òindexes.Ó See also section 8.3.6. for discussion of stemming algorithms.

stop list. A stop list is a list of insignificant words, designed to eliminate indexing of and retrieval by words like ÒanÓ and Òthe.Ó Eliminating stop words can reduce the size of the index significantly, and speed up processing. Francis, Kučera and Mackie (1982) suggest that the ten most frequently used words in English can account for twenty to thirty percent of the words in a text. See also section 8.3.3.

string indexing See string syntax.

string syntax. String syntax is the modern version of subject heading syntax, inspired by the desire to take advantage of computer technology for the creation of headings. Because instructions for the combination of terms into headings are programmed for the computer, string syntax tends to be much more regular than the idiosyncratic variety exhibited by subject heading syntax. The name Òstring syntaxÓ or Òstring indexingÓ comes from the custom of displaying headings as strings of terms — terms strung together in various configurations. The variety of string syntax approaches is mostly related to how terms are arranged in these strings. See also section 12.2.2.

subject cataloging, subject indexing. Whereas descriptive cataloging and descriptive indexing focus on the surface features of texts and documents, subject cataloging and subject indexing focus on analysis, description and indexing of the content, purpose or meaning of messages, in other words, the topics or subjects of messages and texts. The description of certain non-topical features of messages, texts and documents is frequently included in subject cataloging and indexing as well. Examples include special audiences (books for children), special formats (poetry, fiction, dictionaries, periodicals, statistics), special aspects or approaches (history, case studies), special media (film, video recordings, audio recordings, world-wide web), etc. The goal is to identify and provide access to all important topics and features. The challenge, of course, is figuring out what is, or will be, important for future users!

subject domain. The subject domain of an IR database sets the subject scope into the context of the work or life situation (the domain) in which users will be operating and seeking messages. Typical domains include the various scholarly disciplines, the professions, industries, business, occupations and trades, but also every other sphere of human life and activity, such as sports and recreation, hobbies, religion, entertainment, travel, relationships, child rearing, and home management. Subject domains also include cultural domains, often characterized by such human attributes as economic level, living environment, religious and ethnic heritage, gender, sexual orientation, and age. Subject domain analysis will differentiate between interests and needs in the same subject area for users operating in different subject domains, such as persons seeking novels for entertainment versus literary scholars; week-end soccer players versus sociologists of sport or students of sports medicine; urban high-income African American gay men seeking health information versus low-income, rural, white migrant worker pregnant women. In medicine, for example, researchers, health care practitioners, patients, the general adult public, and children all occupy different subject domains. See also chapter 2.

subject heading syntax. Subject headings are the most widely-used type of pre-coordinate syntax headings in indexes and catalogs. They were developed in the 19th century to provide predictable, uniform and direct alphabetical access to topics in library catalogs, indexes, and bibliographies. Subject heading syntax consists of main headings modified by subheadings or subdivisions representing related topics, places, times, or formats and forms of treatment. There are no over-arching syntactic rules. Instead, every heading and every subheading tends to have its own rules. In the United States, the two most widely used subject heading systems are Sears list of subject headings (Sears 1997), for smaller libraries, and Library of Congress subject headings (Library of Congress 2003) for larger libraries. Specialized lists of subject headings have been developed for many subject areas, such as MeSH: Medical subject headings (National Library of Medicine 1999a). See also section 12.2.1.

subject indexing. See subject cataloging.

subject scope. The subject scope of an IR database describes the kinds of questions or desires that an IR database can respond to. Generally this can be done by specifying anywhere from ten to thirty categories of topics that the IR database addresses. When IR databases are presented to users electronically, an ideal number of key subject scope categories is between ten and fifteen, because this is the number of topics that can be clearly displayed on an opening electronic search interface, where an overall view of the IR database should be presented to potential users. The analysis and definition of a subject scope can often begin with generic categories or facets of topics that pertain to all subject fields. These are categories like:

● entities or things (persons, artifacts, natural objects, animals and plants, institutions, and other abstract entities, etc.)

● attributes or constituent materials

● actions (operations, processes, and events)

● places

● times

Specialized IR databases will have much more specific or narrower categories or facets.

surrogate/surrogation. A surrogate is a representative. A message/text/document surrogate stands in the place of the original full text containing the message of interest. A message/text/document surrogate represents only certain key aspects of a message, text, and documentary unit. Its nature and content will depend on the nature and content of the full documentary unit and the needs of users. Typical components are citations, index terms, headings, or descriptors, abstracts, and thumb-nail illustrations. See also record; and chapter 14.

surrogate display. Surrogates are almost always displayed in stages in both print and electronic IR databases. The purpose is to provide to users only what is useful at a particular stage of a search. Surrogate display deals with what portions of surrogates to display and how best to order the elements of a surrogate display in various situations. See also chapter 16.

syndetic structure. Syndetic structure consists of cross-reference links between descriptors or headings in an indexing system. ÒSyndeticÓ comes from the Greek words ÒsynÓ for ÒtogetherÓ and ÒdeinÓ for Òto bind or tie.Ó Thus, the syndetic structure ties or binds the individual descriptors or headings into a complete and connected access system. Syndetic structure results from vocabulary control and management, whereby cross references are created linking synonymous, equivalent, broader, narrower, or other related descriptors or headings. See also chapter 13.

syntactic cross-references. In some types of index heading syntax, such as subject headings, a heading is entered only under a main heading so that there is no direct access through the subheadings in displayed indexes. A syntactic cross reference provides indirect access under such subheadings. For example, the subject heading ÒUnited States — History — Civil War, 1861-1865 — BibliographyÓ is not also entered under Òcivil war,Ó Òbibliography,Ó or Òhistory.Ó Therefore, cross references are needed from these terms to the established heading:

     History

             =====================================================

             | See also names of countries, regions, cities,      |

             | other places and topics followed by the           |

             | subdivision "-- history," e.g., United States --   |

             | History; Piano -- History.                        |

             =====================================================

     Civil war

             =====================================================

             | See also names of countries followed by the       |

             | subdivisions "History -- Civil war," e.g.,         |

             | Spain -- History -- Civil war; United States      |

             | -- History -- Civil war                           |

             =====================================================

See also section 12.2.8.

syntagmatic relationships. Syntagmatic relationships are non-permanent relationships that exist in life situations and in messages and texts that describe them, such as a relationship between ÒdogsÓ and ÒbreedingÓ or between ÒcatsÓ and ÒcareÓ or ÒfeedingÓ in the ÒUnited StatesÓ or ÒBrazil.Ó These are the kinds of relationships that index headings should express through the use of syntax. Contrast with permanent paradigmatic relationships.

syntax. ÒSyntaxÓ is a linguistic term meaning (1) Òorderly or systematic arrangement,Ó or more precisely, (2) Òthe arrangement of words as elements in a sentence to show their relationship; sentence structureÓ (WebsterÕs 1966, p. 1480). It comes from the Greek for putting or arranging together. The first meaning is labeled Òobsolete,Ó but it is closer to the meaning intended here in borrowing ÒsyntaxÓ from linguistics and applying it to index headings and search statements. ÒSyntaxÓ is used in this book to mean rules or patterns for the combination of terms to form meaningful index headings or effective search statements. Index headings consist of terms arranged in a certain order, and they may display a certain structure as well, so the application of the idea of ÒsyntaxÓ seems appropriate. In modern search statements for electronic IR databases, the order or particular arrangement of terms is often immaterial, but by extension, the idea of ÒsyntaxÓ is used to refer to the rules or patterns for the combination (as opposed to the arrangement) of terms (for example the use of boolean operators OR, AND, or NOT between terms), and also for the application of techniques for indicating term weights, proximity limits, and truncation, and for stemming and similar refinements to influence the results of a search. Here the analogy corresponds to the grammatical use of inflections (word endings or changes in form) to indicate the role of words in a sentence with respect to number (singular or plural), case (subject, object, possessive), gender (male or female) or tense (past, present, future). In short, indexing or searching syntax is used to refer to the rules or patterns for creating index headings or search statements! See also chapter 12.

taxonomy. The term ÒtaxonomyÓ comes from the Greek for arrangement or division (ÒtaxisÓ) and law (ÒnomosÓ). Thus it refers to rules of division and arrangement. Such rules can be much more uniform and strict within a facet, so the traditional use of this term for classification within a single facet is still appropriate. Thus, the rules for zoological taxonomic classification are often rather narrowly focused on physical characteristics (e.g., back-bones or not, resulting in vertebrates versus invertebrates) and ancestry/evolution.

TEI (Text Encoding Initiative). See text encoding schemas.

term. A term is a word or a phrase representing a single concept or multiple concepts that are tightly bound together in the context of a particular IR database. An Òindex termÓ is such a word or phrase associated with a documentary unit for the purposes of retrieval. Some concepts need more than one word to express them, for example, Òinformation scienceÓ or Òvenetian blind.Ó Some terms could be divided into two separate terms, but they are used so commonly together in a consistent order that they are considered a single bound term or compound term. Terms subjected to vocabulary control and management are often called descriptors. Terms are combined to form index headings or search statements.

text. Messages are encoded in texts. Texts are meaningful collections of symbols assembled to convey a message. The word ÒtextÓ is related to the word Òtextile,Ó and just as textiles consist of organized fibers or threads, a text consists of an organized set of symbols. Spoken language (speech) texts consist of meaningful sequences of sounds (phonemes). Writing is the representation of speech, and it uses visual symbols to represent the sounds of speech (phonemes). Examples of writing symbols include Chinese characters (which represent not only sounds but also meanings), Japanese kana syllabaries (in which each symbol represents a syllable — a combination of a vowel and a consonant), and Roman, Cyrillic, Greek and other alphabets (in which each symbol represents one or more phonemes, with separate symbols for vowels and consonants). In addition to language texts, there are many other kinds of texts that convey messages — musical texts, image texts (such as those embodied in still or moving pictures), three-dimensional texts (such as those created through architecture, sculpture, and industrial design), dance and other performance texts, and mathematical and chemical texts. Music has a well developed symbol system that is used to represent sound, including pitch and length, in visual media such as print on paper. We call these texts Òscores.Ó Dance choreography can also be represented symbolically, and scientific disciplines like mathematics and chemistry have well developed sets of symbols and codes for representing mathematical and chemical concepts. The symbol systems of art, as in painting and sculpture, are less formal, but most people would agree that paintings and sculpture do convey messages, even if it is not always easy to discern them or to agree on what they are. In painting, the field of iconography is devoted to the study and identification of artistic messages.

text encoding schema. Many schemas have been developed for encoding texts of various formats for electronic digital presentation, analysis and retrieval. These include SGML (Standard General Markup Language), HTML (HyperText Markup Language), XML (eXtensible Markup Language), and TEI (Text Encoding Initiative).  See also section 21.2.

textual database, textbase. See IR database.

thesaurus. The term ÒthesaurusÓ is based on the Greek word for Òtreasure.Ó The term was adopted by Peter Mark Roget (1779-1869), the compiler of the first modern classified ÒtreasuryÓ of words designed to bring together terms with similar meanings as an aid for writers. It is somewhat ironic that the main objective of RogetÕs thesaurus (and its modern successors) is almost exactly the opposite that of the modern information retrieval thesaurus. While RogetÕs thesaurus helps writers identify the best term for their particular purpose (an objective that both types of thesauri share!), its main purpose is often seen as encouraging and facilitating variety in expression, something prized in many contexts. The information retrieval thesaurus aims to control or compensate for such variety — to bring together the many terms that might be used to describe essentially the same, or closely related topic, to facilitate searching. The typical thesaurus consists of records for terms representing concepts, with links for synonymous, equivalent, narrower, broader, and other related terms. See also end-user thesaurus; indexing thesaurus; and chapter 13.

unit of analysis. See documentary unit.

up-posting. Up-posting is using both a specific term and a broader, more generic term to represent a topic or feature in a message, text or documentary unit (e.g., using both ÒsofasÓ and ÒfurnitureÓ for a message about sofas). See also generic posting; specificity; and section 10.1.

user warrant. User warrant means that the vocabulary of users or potential users should be accepted as terminology for index headings, descriptors, or preferred terms in thesauri, because it is warranted (authorized) through actual usage by users. User warrant is complemented by literary warrant.

vector space model syntax. Throughout the history of automatic indexing based on term weighting and relevance prediction, two major theoretical models have emerged: the vector-space model and the probabilistic model. Vector space syntax is based on the former, in which statistical data for term frequency and distribution is used to create vectors (like arrows) in multi-dimensional space, the length of the vector representing the importance of the term. Similar vectors are created for search statements, and the combined vectors for documentary units that most closely match the combined vectors for a search statement indicate the most promising documentary units, which are retrieved in rank order, based on the degree of vector similarity. See also sections 8.3 and 12.3.2.

vocabulary control/vocabulary management. Vocabulary control and management are the efforts to deal with the enormous variability of human language and the unpredictability of how a particular concept might be named in a particular IR database. Solutions currently in use, or suggested, to assist searchers with vocabulary problems include:

1. Syndetic structure (cross references) for equivalent, narrower, broader, and other related terms integrated into browsable alphanumeric displayed indexes.

2. Indexing thesauri designed to guide the assignment of terms by indexers. Such thesauri can guide searchers as well.

3. End-user thesauri, designed for searchers rather than indexers. Instead of aiming to control the terminology used by indexers, the purpose of an end-user or searching thesaurus is to help searchers find useful terminology for searches, often for searches across multiple IR databases.

4. Co-occurrence term clustering. Here computer programs are used to compile lists of terms that occur together most frequently in various contexts. The most frequently co-occurring terms are likely to include terms closely related to the term with which a searcher begins, from which the searcher can select likely terms to improve a search.

5. Ontologies. Ontologies for IR attempt (or claim) to raise the level of more traditional thesauri to the realms of virtual reality (ontology is the study of being or existence and came into IR from artificial intelligence!). Ontologies are generally created for machine manipulation, whereas thesauri are generally designed for human use.

See also chapter 13.

weighted term syntax. See best match syntax.

XML (eXtensible Markup Language). See text encoding schemas.



perez-carballo@acm.org Last modified: Tue Jun 6 18:02:09 CDT 2006

Valid HTML 4.1!