Information Retrieval Design, a book by James D. Anderson and Jose Perez-Carballo
How to obtain a full version of this book Find out about free shipping offer


[Brief table of Contents] [Previous Chapter] [Next Chapter] [book index]

Part I: Chapter 1. Introduction and Background Issues

Contents of Chapter 1

1.1. Purpose.
1.2. Assumptions.
1.3. Terminology.
1.4. Standards and Codes of Practice.
1.5. Types of IR Databases.
1.5.1. Kinds of Objects Represented in Index Terms, Headings, and Entries.
1.5.2. Kinds of Index Terms Used.
1.5.3. Kinds of Indexable Matter Used.
1.5.4. Presentation and Methods for Searching.
1.5.5. Arrangement of Index Entries.
1.5.6. Methods for Analysis.
1.5.7. Methods for Term Selection in Indexing.
1.5.8. Methods for Term Combination in Searching.
1.5.9. Kinds of Documents Being Indexed.
1.5.10. Media of IR Databases.
1.5.11. Proximity of Documents Being Indexed.
1.5.12. Size of Documentary Units.
1.5.13. Periodicity of IR Databases.
1.5.14. Authorship of IR Databases.
1.5.15. Continuing Examples.
1.6. IR Databases Versus Other Types of Databases: A Recap.
1.6.1. Two Types of Databases.
1.6.2. IR Databases.

1.1. Purpose.

history of indexing, cataloging, librarianship, IR databases : 1

Ever since humankind learned how to record messages on portable long-lasting media — clay tablets, papyrus, much later paper and more recently various electronic media — we have devised ways to describe and organize these messages so that they could be found, used and enjoyed later on. This ancient practice has evolved over the millennia into the ancient and honorable profession of librarianship and related specializations such as cataloging and indexing. In the twentieth century, this basic human need to analyze and organize messages for later retrieval has become the main preoccupation of information science, under the rubric of "information retrieval."

Readers interested in the history of indexing and information retrieval databases may want to consult the following book and articles:

Katz, Bill. Cuneiform to computers: a history of reference sources. Lanham, MD: Scarecrow Press; 1998. xvi, 417 p. (History of the book series; no. 4). ISBN: 0-8108-3290-9.

Metcalfe, John Wallace. Information retrieval, British & American, 1876-1976. Metuchen, N.J.: Scarecrow Press; 1976. v, 243 p. ISBN: 0-8108-0875-7.

Taylor, Arlene G. "Development of the organization of recorded information in western civilization." In: The organization of information. Englewood, CO: Libraries Unlimited; 1999: 37-55. xx, 280 p. ISBN 1-56308-498-9.

Wellisch, Hans H. "Early multilingual and multiscript indexes in herbals." The indexer. 11: 81-102; 1978 Oct.

Wellisch, Hans H. "How to make an index, 16th century style: Conrad Gessner on indexes and catalogs." International classification. 8: 10-15; 1981.

Wellisch, Hans H. "The oldest printed indexes." The indexer. 15: 73-82; 1986 Oct.

For more historical references, see:

Wellisch, Hans H. Indexing and abstracting: an international bibliography. Santa Barbara, CA: ABC-Clio, c1980. xxi, 308 p. ISBN 0-87436-300-4.

Wellisch, Hans H. Indexing and abstracting, 1977-1981: an international bibliography. Santa Barbara, CA: ABC-Clio Information Services, 1984. xix, 276 p. ISBN 0-87436-398-5.

role of information retrieval in support of civilization : 2

It can certainly be said that civilization is based on the cumulated knowledge that current and former generations have developed, organized, and stored for use by current and future generations. We are long past the time when any educated person could learn and know all of the world's knowledge — if there ever was such a time. Thus, the knowledge base or foundation of civilization must be organized, described, and made accessible through libraries and other systems of information retrieval. It is no exaggeration to say that the preservation and advancement of civilization is absolutely dependent on effective information retrieval.

origin of "database" as term : 3

The term "database" emerged in the early 1960s among workers involved with military information systems. The term referred to collections of data available to users of computer systems (Oxford English dictionary 1989). The idea implied is that these collections were bases (plural of basis) of data on which decisions might be made — on which decisions could be based. As databases developed since then, they were organized so that data could be accessed in a wide variety of ways.

libraries compared to databases : 4

Of course libraries have been collecting data, in the form of documents, for millennia and have been describing these documents and organizing them for access. This vital function has been shared with a growing number of indexing and abstracting services and a multitude of separately produced indexes, bibliographies, catalogs, and compendia of data of various types, especially in the 19th and 20th centuries. Recently, digital libraries have been added to this rich mix. All of these systems for information retrieval can now be encompassed by the broad modern term "information retrieval database."

data versus information, knowledge; varieties of messages : 5

Information retrieval (IR) databases focus on the retrieval of messages more than simple or raw data. Messages are recorded in documents of many varieties in many media and formats. They make use of visual written language and pictorial images, as well as visual texts based on other representation codes (such as musical, choreographic, chemical, and mathematical notation), aural spoken language and musical performances, and even tactile texts for the visually impaired. These messages reflect the knowledge and understandings of the persons, across generations and cultures, that created them. The lines between "data," "information," and "knowledge" are fuzzy. These terms will be discussed in section 1.3 on terminology. Data, information, and knowledge can all be encompassed, represented, or reflected in the messages that humans create. Information retrieval databases are designed to describe and organize such messages so that anyone can find messages they need or desire whenever they want them.

design of IR databases : 6

Thus this book is about the design of databases that will help retrieve messages. Its purpose is to help the designer consider all the relevant factors and to choose the best available options. In most cases, there are no single correct or right answers, only better and worse choices for given purposes and persons. What this book opposes is simply accepting designs and procedures without considering alternative possibilities and matching them with the needs, desires, preferences, and resources of the persons who will use IR databases, or be served by them.

1.2. Assumptions.

components of problems in information retrieval : 7

This book assumes that the IR database designer is confronted with an information retrieval problem and that the designer knows quite a bit about this problem. The components of this problem usually consist of:

a. a large enough set of messages so that it is impossible to easily examine all of them, when certain ones or certain types are desired.
b. a group of actual or potential users or clientele who need access to these messages for purposes of business, life-enhancement, entertainment, or similar impelling reasons.

characteristics of messages, of users : 8

The IR database designer needs to know quite a bit about these messages, what they are about, what is significant about them, their format, media, location. Even more important, the IR database designer needs to know even more about the potential users or clientele for the IR database: their interests, needs, and information-seeking experience, skills, habits and preferences.

user studies and needs assessment not within scope of this book : 9

This book will not discuss and is not meant to guide the conduct of user studies or user needs assessment. For this, the reader might consult Information tasks: toward a user-centered approach to information systems, by Bryce L. Allen (1996), and also some of the many publications he cites in his book. The last chapter in this book, chapter 22, discusses literature on methods testing and evaluation of IR databases, including users of these databases. Many of the methodologies discussed there may also be used for initial user needs assessment.

expert users versus novice users; new users versus frequent users : 10

The discussion of design options throughout this book will assume a wide and varied potential audience for even the most technical or esoteric of IR databases. The reason for this assumption is that we know from decades of information-seeking behavior research that experts rarely consult IR databases for messages falling within their own areas of expertise. For their own literature, they usually have much more direct sources of information, such as colleagues with whom they interact in their research or professional activities and presentations or discussions at conferences and other gatherings. When such experts do consult IR databases, it is usually for information about messages that fall outside their immediate area of expertise, so now they may be playing the role of a new or even a novice user, and perhaps an exploring user. So in all designs, we must welcome new, novice and exploring users of all types. We can add options for frequent and expert users, such as librarians, to enable them to bypass features meant for new and exploring users, but only as special options. Our goal will be to make everyone welcome, and make our IR databases as clear and as easy to use as possible for everyone.

1.3. Terminology.

terminology of IR database design : 11

It is hard to enter a new field without first learning some of its basic terminology — the special vocabulary used by practitioners and researchers, and the special meanings often given to terms in the context of the field. Here are definitions and discussions of the most important terms related to the design of IR databases. The ways these terms are used also help to lay out some of the assumptions underlying this book. For additional help with terminology, see Hans Wellisch (2000), Glossary of terminology in abstracting, classification, indexing, and thesaurus construction.

12

Here the major terms are arranged in a logical order, so that whenever possible, definitions that depend on the meaning of other key terms will come later in the list. In some cases, however, you may want to skip ahead in this list of terms. For example, as you read about IR databases, you may want to check the discussions of "message," "text," and "document"!

13

All of these terms are also defined in the glossary at the end of the book, where they are arranged in alphabetical order. So later on, if you are looking for a particular definition, it may be easier for you to look there than here.

definition of database : 14

database. "Database" is a relatively new word for a collection of data that is organized for retrieval. It is sometimes restricted to organized collections of data in electronic media, but in this book, the term "database" is used for any collection of data organized for retrieval, regardless of medium, so that printed indexes, catalogs, encyclopedias, and similar reference works constitute examples of databases as well as electronic retrieval tools on CD-ROM or available online or via the world-wide web. (There is a brief note on the origin of this term in section 1.1 of the Introduction to this book and a recap on IR databases versus other types of databases in section 1.6.)

types of databases : 15

Databases (along with the systems for access that accompany those in electronic form) can be categorized in many ways: by mission or purpose (such as MIS: management information systems), by subject areas (such as GIS: geographical information systems), by models of organization (such as relational, hypertext, object-oriented, flat-file), or by phenomena represented by data (such as real, concrete entities (things, objects!) and events versus messages about entities and events, including abstract and imaginary entities and events). This book focuses on databases designed for the purpose of facilitating discovery and retrieval of messages of all types, so our databases are called "information retrieval databases" or, for short, "IR databases." Their purpose is information retrieval. The primary data in such databases describe messages rather than concrete entities and events.

relational database model : 16

relational database. One of the most common database models is called "relational," resulting in "relational databases." According to FOLDOC: The Free On-line Dictionary of Computing (1997), a relational database is one in which "the data and relations between them are organized in tables." The name reflects a special way for organizing data and for indicating relations among data or categories of data. The name can be misleading, however, because all databases, regardless of data model, record or display relationships to some degree, in one way or another.

object-oriented database model : 17

object-oriented database. A more recent data model is called "object-oriented," as in "object-oriented databases," related to object-oriented programming (FOLDOC 1997, "object-oriented database"). In these databases, algorithms for processing data are integrated with the data, so that data related to each object of importance have their own associated object-oriented programs.

flat file database model : 18

flat-file database. In contrast to the rather sophisticated and highly structured relational and object-oriented models, a simple data model calls for nothing more than "a single file containing many records, each of which contains the same set of fields" (FOLDOC 1997, "database"). This simple model, sometimes called a "flat file" design, is quite common for IR databases.

hypertext database model : 19

hypertext database. With the advent of the world-wide web, hypertext databases have become more and more common. According to FOLDOC (1997, "hypertext"), "hypertext" refers to a "collection of documents (or 'nodes') containing cross-references or 'links' which, with the aid of an interactive browser program, allow the reader to move easily from one document to another." In a hypertext IR database, some of these documents may be summary records or surrogates, which can lead the user to documents containing the full text of messages.

definition of concrete entity and event database : 20

concrete entity and event database. Databases can also be characterized by the nature of the objects or phenomena that they are designed to describe. Concrete entity and event databases organize data about real concrete entities and events. Examples include airline databases that contain data about airplanes and all their parts, their maintenance, their crews, particular flights, fares, supplies, passengers, including which passengers get special meals, etc.; or bank databases that contain data about all customers, all their accounts, their balances and every banking transaction. The focus of these databases is on concrete entities and concrete events. In contrast, IR databases are designed to describe messages. These messages, of course, may be about concrete entities and events, but just as often, they can be about rather abstract or ephemeral phenomena, such as theories, feelings, emotions, and aesthetics.

21

To give one more example of the difference between a concrete entity and event database and an IR database, consider an online or print catalog for a large retail chain like JCPenney or a mail-order house like L.L.Bean. These catalogs are databases that directly describe and organize descriptions of the concrete entities or objects that JCPenney or L.L.Bean would like to sell. On the other hand, an IR database dealing with these products would not directly describe and organize the descriptions of these products, but would describe messages about these products (perhaps the kind of messages one might find in Consumer Reports magazine). Thus the focus of concrete entity and event databases is directly on the entities and events that they describe and organize. The focus of IR databases is on messages, their various features and attributes, and what they have to say about concrete, as well as abstract, entities and events.

definition of IR database : 22

IR database. Thus, IR databases have as their primary purpose the organization of data about potential information contained in messages, texts and documents in order to facilitate the retrieval of these information containers. For the most part, IR databases are not directly concerned with concrete entities or events, except as concrete entities and events are represented as topics of messages or features of messages.

indexing of concrete entities and events compared to indexing of messages : 23

Making a clear distinction between concrete entity and event databases versus IR databases (message databases!) is important because the description and indexing of concrete entities and events are so different from the description and indexing of messages. When one has a carburetor in one's hand, there is little dispute what the object is. There may be questions about what to call it, what type it is, what kind of engine it is designed for, but there is usually a clear consensus on what it is.

indexing of messages : 24

Messages are entirely different. What a message is, what it is about, what it means, what it is for (or good for) is entirely in the mind of the beholder. Two or more people may see entirely different messages in the very same text. Arguments about what famous texts mean (the Bible for example) have gone on for centuries! This is precisely why the general level of agreement among indexers describing the same text ranges between only 20 and 25 per cent. Searchers, who in effect index information needs or queries, have about the same level of agreement as to appropriate search terms.

structured data : 25

In the typical concrete entity and event database, the discrete entities and events to be considered are generally known in advance. Their relations are often described and structured in advance (using, for example, a relational database model). Because of the careful structuring of such data, these types of data are sometimes referred to as "structured data."

unstructured data : 26

In contrast, a typical IR database generally will deal with a defined set of messages, but the number of topics in these messages will be too numerous and varied to predict in advance. These topics may well relate to concrete entities and events, but they may just as frequently deal with abstract or imaginary entities, and abstract or imaginary processes, operations, or conditions. In addition, the possible relations among the topics covered in messages is also enormous and practically impossible to predict in advance. Hence, IR databases are often called "unstructured," or their data can be called "unstructured information." Indeed, in many IR databases, certain fields consist of unprocessed text, such as an abstract, or even the full text of a document. Such full-text fields contain the ultimate in unstructured data! Of course such full-text fields do retain the structure of the original text, but the data represented is not re-structured according to any rules of database structure or data models.

concrete entities in IR databases; concrete events in IR databases : 27

Despite this general focus on the content and features of messages and texts (and the documents in which they occur), IR databases must also deal with concrete entities and events related to the creation and transmission of messages, texts and documents. They must describe the concrete documents in which messages and texts reside and the persons and organizations that create these messages and texts, and manufacture, publish and disseminate documents.

abstract entities, fictitious entities, imaginary entities in IR databases : 28

Examples of the kinds of abstract, fictitious or imaginary entities that IR databases typically deal with include hypotheses, theories, opinions, beliefs. IR databases also deal with rather abstract or hypothetical processes, such as evolution, growth and development, cognition, consciousness, and with rather abstract attributes such as aesthetics, feelings, emotions. Mythical or fictional figures, characters and events are also fair game for IR databases. In fact, any topic of interest to human beings will be found in IR databases, because all these topics will be found in human messages, and it is human messages that form the focus of most IR databases.

alternative names for IR databases : 29

IR databases have also been called "bibliographic databases," "document databases," "textual databases," "textbases," and more recently "digital libraries." The modern term "IR database" has come to replace (and to include) a wide spectrum of tools used to organize and provide access to documents: bibliographies, indexes, indexing and abstracting services, information resource guides and handbooks, and reading lists.

definition of full-text database : 30

full-text database. Full-text databases are IR databases that contain the full text of the documents that they describe and organize for retrieval. Such texts may be based on a variety of representation codes, such as linguistic, pictorial, musical, mathematical, etc. We have long had full-text databases in print media. Examples include handbooks and encyclopedias. In addition, monographs with their own back-of-the-book indexes also qualify as full-text databases, because the text of the monograph is presented together with an index, and this index describes and reorganizes the content and other features of the text for retrieval. But usually, "full-text database" refers to electronic databases.

definition of digital library : 31

digital library. Digital libraries are full-text IR databases that replicate, in digital media, many of the functions of traditional libraries. They tend to contain a purposefully selected collection of texts plus various means of access to these texts. ODLIS: Online Dictionary of Library and Information Science (Reitz 2000) defines "digital library" simply as: "A library in which a significant proportion of the resources are available in digital (machine-readable) format, as opposed to print or microform. The process of digitization began with indexes and abstracting services, then moved to periodicals and reference books."

definition of reference database : 32

reference database. Reference databases are IR databases that point to (refer to!), but do not include, the full text of the documents that they describe. Documents are represented by surrogates, such as citations, abstracts, excerpts, notes, and pictures.

definition of information retrieval system : 33

information retrieval system (or information storage and retrieval system). Information retrieval systems are the systems that make it possible to search IR databases. They provide the interfaces that permit users to compose searches and match them against database indexes or to browse indexes that are displayed for visual inspection. Often the search system is so integrated into the database itself that it is inseparable. This is especially true for print-on-paper databases, such as printed indexes, catalogs, bibliographies, handbooks and encyclopedias. In electronic retrieval, however, the information retrieval system may be completely separate, so that the same IR database can be vended or made available by different vendors or agencies, each of which provides an entirely different information retrieval system, with entirely different interfaces, different search engines, different search commands, and different display options.

definition of index : 34

index. An index is any device that is (or can be) used to indicate or point to something of interest. Indexes are used in many fields in addition to library and information science, such as the consumer price index in economics, where the index points to the rise and fall of prices. In information retrieval, an index is used to indicate the content and features of messages and the locations of these messages and/or the location of particular content or features within these messages. There are many types and varieties of indexes, corresponding to types of IR databases listed in section 1.5. Indexes are produced in many different ways, both by human analysis and computer manipulation.

definition of message : 35

message. IR databases are used in order to find and retrieve messages. A message is the content of a meaningful communication. In order to be communicated — to be sent and to be received, messages must be encoded into texts, using symbols or representations that can convey meaning to recipients of the message. A message is potential information. If a message is actually received by someone who pays some level of attention to it, that person can be said to have been informed by the message, and the message itself can qualify as information.

messages versus works : 36

In library cataloging, the term "work" is used in much the same way as "message" is used in this book. In Patrick Wilson's classic treatise, Two kinds of power: an essay on bibliographical control (1968, p. 7-14), he makes a considerable effort to make as clear a distinction as possible between a work (a message) and the various texts in which that work (message) may be encoded. He points out, for example, that a particular poem can be translated into other languages, and that a translation of a poem need not itself be a poem. These translations, some in non-poetic forms, are certainly not identical texts, but they each reflect or encode the same work. In another example, he points out that translations of the works of the German philosopher Schopenhauer are still considered texts representing his works (his messages), even though they are in languages different from the one in which he wrote. So the distinction between messages or works and the texts in which they are or can be encoded is fundamental.

texts versus exemplars : 37

Wilson also makes a distinction between texts and their exemplars, or particular copies or manifestations, pointing out (p. 6-7) that the exemplars of the same text can vary considerably, yet still be said to consist of the same text. We are all aware of this, that the same painting (a text) can be reproduced in copies (reproductions) of different sizes, just as the same language text can be reproduced in different printings with different type faces and sizes and different paginations.

definition of information : 38

information. This is a slippery concept that is best avoided, except in terms like "information science" (the established name of a discipline) and "information retrieval" (the name of a primary focus in information science). The problem with "information" is that it has come to have too many meanings, and it is therefore often vague and unclear. On the one hand, it is used to refer to the process of informing or becoming informed. But more frequently, it is used to stand for messages, texts, and documents, whether or not these are actually informative for a person confronting them. Does it really make sense to call ancient Greek manuscripts "information" if one can't even read them?

definition of datum, data : 39

datum, data. A "datum" (singular of "data") may be considered to be a single fact or item of evidence. To be informative, a datum needs one or more additional data of different sorts to provide context. Thus it can be said that a message (potential information) needs at least two data. A set of numerical data, such as "70, 90, 28, 64," is meaningless unless some explanation is provided. Do these data refer to temperatures? sport scores? or what? Similarly, a simple datum regarding color, such as "red," carries much more meaning when it is combined with at least one more datum, such as "chair." Data are often presented in tables, along with explanations, e.g., average temperatures by month and place in major cities of the Unites States, or the latest scores of today's football games. Because IR databases focus on messages, they rarely deal with raw data except in the context of messages, where data are placed in context.

definition of knowledge, wisdom : 40

knowledge. "Knowledge" refers to what someone knows. It resides in the mind and the brain, but it is reflected in messages. "Wisdom" refers to the wise use of knowledge. It is not technically correct to say that knowledge resides in messages. According to the following definitions for "text" and "document," messages are no more than organized sets of symbols. But these symbols refer to what persons know, think, believe, feel, and understand, so it is entirely proper to say that messages, texts, and documents record, reflect, and convey the knowledge of a person, a group, or an entire culture.

views of Korfhage (Robert R.) on data, information, knowledge, wisdom : 41

Robert Korfhage (1997, p. 8-10) presents a slightly different but useful view of the relations among "data," "information," and "knowledge." He also adds "signal" at the beginning of this continuum and "wisdom" at the conclusion, citing as well Manfred Kochen's treatment of "the chain that begins with signal and ends with wisdom" (Kochen 1974, p. 62). Kochen inserts "understanding" between "knowledge" and "wisdom." But can there be knowledge without understanding? Here are Korfhage's definitions:

definition of signal : 42

Signal: "At one end [of this hierarchy of increasing complexity], less complex than data, is the signal that must be transmitted from one place to another during information processing. This signal may be a bit stream, an electromagnetic wave form, or some other form" (p. 9).

definition of data : 43

Data: "Data are impersonal; they are equally available to any users of the system" (p. 8).

definition of information : 44

Information: "Information, in contrast, is a set of data that have been matched to a particular information need. That is, the concept of information has both personal and time-dependent components that are not present in the concept of data." Information requires "the active intervention of a user" (p. 8).

definition of knowledge : 45

Knowledge: "Knowledge builds upon information, integrating any new information awith that previously known to form a large, coherent view of a portion of reality" (p. 9).

definition of wisdom : 46

Wisdom: "Finally, wisdom adds to this knowledge a broader view still, encompassing all of known reality, and governing the use of the information that has been obtained and the knowledge that has been developed. It involves the capacity to make balanced judgments in the light of certain value criteria" (p. 9).

definition of text : 47

text. Messages are encoded or recorded in texts. Texts are meaningful collections of symbols assembled to convey a message. The word "text" is related to the word "textile," and just as textiles consist of organized fibers or threads, a text consists of an organized set of symbols. Spoken language (speech) texts consist of meaningful sequences of sounds (phonemes). Writing is the representation of speech, and it uses visual symbols to represent the sounds of speech (phonemes) and sometimes to distinguish among meanings (night vs. knight). Examples of writing symbols include Chinese characters (which represent not only sounds but also meanings), Japanese "kana" syllabaries (in which each symbol represents a syllable — a combination of a consonant and a vowel), and Roman, Cyrillic, Greek and other alphabets (in which each symbol represents one or more phonemes, with separate symbols for vowels and consonants).

types of texts : 48

In addition to language texts, there are many other kinds of texts that convey messages — musical texts, image texts (such as those embodied in still or moving pictures), three-dimensional texts (such as those created through architecture, sculpture, and industrial design), dance and other performance texts, and mathematical and chemical texts. Music has a well developed symbol system that is used to represent sound, including pitch and length, in visual media such as print on paper. We call these texts "scores." Dance choreography can also be represented symbolically, and scientific disciplines like mathematics and chemistry have well developed sets of symbols and codes for representing mathematical and chemical concepts. The symbol systems of art, as in painting and sculpture, are less formal, but most people would agree that paintings and sculpture convey messages, even if it is not always easy to discern them or to agree on what they are. In painting, the field of iconography is devoted to the study and identification of artistic messages.

definition of medium, media : 49

medium. Media (the plural of medium) are the physical substances on which or in which a text is conveyed or recorded. Ephemeral media include the airwaves over which sound (including speech) is sent and received. Information retrieval generally deals with longer lasting media, such as stone, clay, metal, paper, film, and the newer media for the recording of electronic digital data: disks, tapes and chips made from various forms of plastic, silicon and metal. An important responsibility of our profession is to make sure that the media on which messages are recorded can actually be preserved for as long as the message has value. This is a big challenge, especially for untested newer magnetic and optical media for digital data. One hopes that these modern media, including silicon (ceramic!) chips, will be as long-lasting as their ancient relatives, the clay tablets of the Middle East and elsewhere.

definition of format : 50

format. Texts come in many shapes and styles, influenced by the medium on or in which a message is encoded, by the meaning and purpose of the message, and by the intended recipients of the message. Shape and style contribute to something we usually call format. It covers a wide spectrum of attributes, such as literary genre (poetry, narrative, drama, essay, speech, fiction, etc.), type of presentation (chart, diagram, picture, cartoon, list, etc.), and type of publication (such as book versus pamphlet versus broadside or poster on paper media, slide versus motion picture in film media). D. W. Langridge (1992, p. 28) suggests six types of attributes related to "the method of selection, arrangement or display" of message content under the heading of format: 1. order, how material is arranged, e.g., alphabetically, chronologically, or in some classified order according to mutual relations of message content or features; 2. literary forms or genres (poetry, drama, essay, narrative fiction, short story); 3. reductions (abstracts, excerpts, quotations, summaries); 4. collections (encyclopedias, compendia, handbooks, readers); 5. keys to other documents (indexes, bibliographies, catalogs); and 6. rules (standards, codes, recipes). The distinction between "format" and "medium" is sometimes cloudy or fuzzy. For example, the meaning of "book" usually includes its medium (paper) as well as its shape (leaves bound together along one edge). When the content of a book is moved to electronic media, is it still a book? Probably the meaning of "book" will continue to shift as the media on or in which book-like messages are conveyed change over time. After all, our word "book" came from the Anglo-Saxon word for beech tree, because ancient runes were once written on beech bark! The connection in German is quite close: "Buche" for "beech tree" and "Buch" for "book."

definition of document : 51

document. A document is a combination of text and medium. Texts cannot exist without embodiment in some medium, whether ephemeral, like airwaves, or longer lasting, like paper, film, or electronic media for digital data. Usually we use "document" to refer only to texts recorded in the longer-lasting media, and it is these documents that are susceptible to indexing and later retrieval.

definition of catalog, cataloging : 52

catalog, cataloging. A catalog is an index for a particular collection of messages. A union catalog is an index for several collections. "Cataloging" is the process of creating a catalog, so it is a type of indexing.

definition of descriptive cataloging, descriptive indexing : 53

descriptive cataloging, descriptive indexing. "Descriptive cataloging" is an old and honorable term that refers to the description and indexing of texts and documents with respect to features other than the content, purpose, or meaning of the text's message. Such features include the authors and other creators of texts (editors, composers, illustrators, translators, artists, etc.); the names or titles of texts (including subtitles, parallel titles, alternate titles, running titles etc.); the publishers or manufacturers and distributors of documents containing texts; the size and medium of documents; and the symbol set and code used to encode the text. Codes and symbols used to encode texts include natural languages and their writing systems (French, German, Chinese), but also codes and symbols for music, dance, chemistry, mathematics, etc., and, at another level, codes for the representation of messages in digital media. Names and index terms are established for the most important features. Descriptive cataloging (along with subject cataloging) is part of the process for making a catalog. "Descriptive indexing" is a rarely used term for the same process outside of the context of catalogs for particular collections of documents.

definition of subject cataloging, subject indexing : 54

subject cataloging, subject indexing. Whereas descriptive cataloging and descriptive indexing focus on the surface features of texts and documents, subject cataloging and subject indexing focus on analysis, description and indexing of the content, purpose or meaning of messages, in other words, the topics or subjects of messages and texts. The description of certain non-topical features of messages, texts and documents is frequently included in subject cataloging and indexing. Examples include special audiences (books for children), special formats (poetry, fiction, dictionaries, periodicals, statistics), special aspects or approaches (history, case studies), special media (film, video recordings, audio recordings, world-wide web), etc. The goal is to identify and provide access to all important topics and features. The trick, of course, is figuring out what is, or will be, important for future users!

definition of classification : 55

classification. "Classification" literally means to place items in classes, resulting in groupings of items sharing some similarity. By extension, it can refer to the creation and/or naming of these classes. By further extension, it often includes the arrangement of classes in a logical, relational, non-alphabetical or non-alphanumeric order. At the fundamental level, indexing and classification are the same process, because in both operations, messages must be analyzed, and based on this analysis, grouped into categories or classes. Finally, these groupings must be named and arranged to provide access. At the more superficial level, but reflecting its most common usage, classification refers to the logical, relational (non-alphabetical) arrangement of classes, in contrast to alphabetical indexes in which classes are simply arranged in alphabetical or alphanumeric order on the basis of their names.

definition of documentary unit : 56

documentary unit. A "documentary unit" is the portion of a document that can be directly retrieved by an IR database. Documentary units may be complete documents, such as complete books, or complete periodical articles. Or they may be parts of complete documents — chapters in books, or paragraphs or charts or diagrams or illustrations in periodical articles. This same variety in the size of documentary units applies to all media. An IR database for videotapes, for example, might retrieve only complete videotapes (so that the documentary unit is the complete tape), or it might be able to retrieve individual frames or short sequences of frames, in which cases, either the individual frames, or the short sequences of frames, constitute the documentary units. In all cases, the documentary unit is the unit that is analyzed for indexing (either by machine algorithm or by human inspection). Consequently, the "documentary-unit" is also called the "unit-of-analysis." "Bibliographic unit" has also been used for this concept, indicating the unit described and retrievable via a bibliography. Small documentary units have also been called "information units," but one should hope that all documentary units will be informative! Chapter 6 deals with adocumentary units.

definition of indexable matter : 57

indexable matter. "Indexable matter" is the actual portion of a documentary unit on which indexing or classification is based — on which index terms or headings are based or from which terms are extracted. Not all indexes need to be based on the entire text of a message. Sometimes a message can be adequately summarized by a part of its text. Thus, if an index does not need to be very detailed, a good title might be sufficient to represent the message of a periodical article for purposes of indexing or classification. In that case, the title could be the indexable matter for the documentary unit — the periodical article. Abstracts of scholarly articles are a common example of indexable matter. Many indexing and abstracting services base their indexing and classification on the abstracts of the messages that they cover. For important messages, the entire text of the message may need to be consulted, thereby making the entire text the indexable matter. Sometimes, whole categories of messages may be excluded from indexable matter. An index for a scholarly journal, for example, may index only substantive research articles and exclude from indexable matter all advertisements, letters to the editor (unless they comment on articles that are indexed), announcements, calls for papers, etc. (Indexable matter is also called "analysis base," because it constitutes the base (or basis) of analysis — the text on which analysis is based.) Chapter 7 deals with indexable matter.

definition of term, index term, bound term : 58

term. A "term" is a word or a phrase representing a single concept or multiple concepts that are tightly bound together in the context of a particular IR database. An "index term" is such a word or phrase associated with a documentary unit for the purposes of retrieval. Some concepts need more than one word to express them, for example, "information science" or "venetian blind." Some terms could be divided into two separate terms, but they are used so commonly together in a consistent order, that they are considered a single "bound term" or "compound term." Examples of such bound terms are "information science" (which could be the "science" of "information"), "library schools," "school libraries," "birth control," and "juvenile delinquency." If "information science" were separated into separate terms for "information" and "science," we could get all sorts of "false drops" (unwanted documents) dealing with information problems in science when doing a search for "information" AND "science." We could divide "birth control" into "birth" and "control," but birth control is more about the control of conception than the control of birth, so the bound term "birth control" is more useful. Generally speaking, when two or more terms are almost always used in the same way, in the same order, for a particular concept or set of concepts, they should be kept together as a single bound term. One hardly ever hears anyone refer to "the science of information," the "control of conception or birth," or the "delinquency of juveniles." The National Information Standards Organization guidelines for thesauri has a whole section on compound terms (National Information Standards Organization 1993, section 4).

definition of complex term : 59

Sometimes "complex term" is used for a single phrase denoting more than two distinct concepts. The Library of Congress introduced the complex term "telephone assistance programs for the poor" in 1990. This single term could be broken up into separate terms for "telephones," "assistance programs" and "poor people," so it could qualify as an example of a complex term.

definition of descriptor, equivalent term : 60

descriptor. The term "descriptor" is usually reserved for a term that is part of a controlled indexing language. Such indexing languages are often listed in a thesaurus. For each concept included in the indexing language, one descriptor will be chosen to represent the concept, and all other terms that can be used for the same concept are linked to the descriptor by means of cross-references. Thus, if a thesaurus uses the descriptor "lawyer," then it might not use the terms "attorney," "barrister," "solicitor," or "counselor-at-law." Each of these alternative terms can be linked to the preferred descriptor "lawyer" and would be given the status of un-used synonymous or equivalent terms. (Equivalent terms are terms that are not truly synonymous, but are close enough so that they can be considered equivalent in the context of a database. Anyone who knows the English legal system knows that "barrister" and "solicitor" are not exactly the same as U.S. "lawyers," but in many databases, the distinction would not be important enough to make, so that "barrister" and "solicitor" could be considered equivalent to "lawyer.")

definition of free-text term, keyword : 61

free-text term. Often shortened to "free text," "free-text term" usually refers to the use of uncontrolled words or terms from natural language text for indexing or searching. When one searches the actual text of a document, one is searching the free-text terms that are found in the document. The difference between "free-text terms" and just "terms" is that sometimes terms may be standardized, at least a little, with respect to format, and they may also have links with the most common synonyms or equivalent terms, even if they are not controlled to the extent of formal descriptors. In this paragraph, every term or phrase is a free-text term. Some of the smaller words (such as "to," "the," "of," etc.) may be listed on a "stop list" of unsearchable terms — terms that cannot be searched for by themselves, but they are still free-text terms! "Keyword" is often used to indicate the more important free-text terms.

definition of heading, classification caption : 62

heading. In displayed indexes (indexes that are designed for visual inspection by humans as opposed to non-displayed indexes that are searched by computer algorithm), index terms are combined into headings consisting of multiple terms. It is possible to have index headings with only single terms, but headings of two or more terms are more meaningful, because the lead term is modified or amplified or described by the subsequent term or terms. The subsequent term or terms create a context for the first, or lead, term. Compare, for example, the meaning of the simple heading "United States" versus the more detailed meaning of "United States — history — civil war — bibliography." In the second heading, "United States" has been modified or defined by aspect or approach (history), event or period (civil war), and format (bibliography). An index heading is an essential part of an index entry. When displayed indexes are displayed in classified rather than alphabetical order, the headings are often called "captions."

definition of syntax for index headings, syntax for search statements : 63

syntax. "Syntax" is a linguistic term meaning (1) "orderly or systematic arrangement," or more precisely, (2) "the arrangement of words as elements in a sentence to show their relationship; sentence structure" (Webster's 1966, p. 1480). It comes from the Greek for putting or arranging together. The first meaning is labeled "obsolete," but it is closer to the meaning intended here in borrowing "syntax" from linguistics and applying it to index headings and search statements. "Syntax" is used in this book to mean rules or patterns for the combination of terms to form meaningful index headings or effective search statements. Index headings consist of terms arranged in a certain order, and they may display a certain structure as well, so the application of the idea of syntax seems appropriate. In modern search statements for electronic IR databases, the order or particular arrangement of terms is often immaterial, but by extension, the idea of syntax is used to refer to the rules or patterns for the combination (as opposed to the arrangement) of terms (for example the use of boolean operators OR, AND, or NOT between terms), and also for the application of techniques for indicating term weights, proximity limits, truncation and wildcards, and for stemming and similar refinements to influence the results of a search. Here the analogy corresponds to the grammatical use of inflections (word endings or changes in form) to indicate the role of words in a sentence with respect to number (singular or plural), case (subject, object, possessive), gender (male or female) or tense (past, present, future).

64

In short, indexing or searching "syntax" is used to refer to the rules or patterns for creating index headings or search statements! Chapter 12 deals with syntax.

definition of postcoordinate syntax, precoordinate syntax : 65

postcoordinate, precoordinate syntax. The terms "postcoordinate syntax" and "precoordinate syntax" are used to indicate when terms are put together to represent documentary units, either before (pre) or after (post) a search begins. All index headings that are constructed for displayed indexes, which users may browse during the searching process, must of necessity be created before the search, so they are called "precoordinate" headings based on precoordinate syntax. Postcoordinate syntax is used almost exclusively for machine matching, where searchers create search statements, putting terms together at the time of the search, then make use of computer algorithms to find matching records or texts.

precoordinate index headings versus postcoordinate search statements : 66

One big difference between precoordinate index headings and postcoordinate search statements is that the precoordinate headings generally refer to actual existing documents, whereas postcoordinate search statements refer to hoped-for documents.

role of precoordinate index terms in search statements : 67

Even in postcoordinate searches, searchers may take advantage of precoordinate terms or headings that have been attached to documentary units. Such precoordinate terms or headings can prevent false drops — the retrieval of documents based on the presence of two or more terms, when these terms are not actually related in the manner intended. Thus the precoordinated combination of nationality and medium in "French painting" can prevent the retrieval of a document that deals with "French sculpture" and "Dutch painting" when it is "French painting" that is sought. If "French," "Dutch," "painting," and "sculpture" were all separate terms for later combination in postcoordinate searching, then "French" and "painting" would retrieve (in error) the document on "Dutch painting" and "French sculpture."

definition of index entry, locator : 68

entry. In displayed indexes, an entry represents and points to a documentary unit. An entry consists of a heading (of one or more terms) and a single locator, such as:

United States 23

or

United States. history. civil war. bibliography 44

The locator leads to the documentary unit. In this example the numbers 23 and 44 might refer to particular paragraphs or pages or to entries in a list of document citations or to documents on shelves or in a filing cabinet.

definition of index entry array : 69

When two or more entries have identical headings or subheadings, these duplicate headings are usually merged for display, resulting in "entry arrays" that might look something like this:


         United States
            Armed Forces
               Afro-Americans. Bibliography 25
                               History 24-30, 339
               California. History. 20th century 54
               China. History. 20th century 332
                      Military life. History 442
               Gays 74-80, 445-450
                    Government policy 76
                    History. 20th century 78-80
                    Legal status, laws, etc. 76-78
               History. Civil War, 1861-1865 61
                        Revolution, 1775-1783 55
                        World War, 1939-1945 93-97
               Officers. Death  333, 634
                         Directories 335
                         Education 330-331
               Women. Bibliography 99
                      History. Archival resources 98
                      Periodicals 97

presentation of entry arrays : 70

Such entry arrays are more compact and often clearer to the user than repeating each term or heading and subheading, as in the following example, which consists of the very same entries, but without merged headings. Punctuation between terms will vary. In the preceding example, dots (or periods or full stops) were used between distinct terms. In the following example, terms are separated by a space-dash-space, as used in Library of Congress subject headings.



      United States — Armed Forces — Afro-Americans — Bibliography 25
      United States — Armed Forces — Afro-Americans — History 24-30
      United States — Armed Forces — Afro-Americans — History 339
      United States — Armed Forces — California — History — 20th century 54
      United States — Armed Forces — China — History — 20th century 332
      United States — Armed Forces — China — Military life — History 442
      United States — Armed Forces — Gays 74-80
      United States — Armed Forces — Gays 445-450
      United States — Armed Forces — Gays — Government policy 76
      United States — Armed Forces — Gays — History — 20th century 78-80
      United States — Armed Forces — Gays — Legal status, laws, etc. 76-78
      United States — Armed Forces — History — Civil War, 1861-1865 61
      United States — Armed Forces — History — Revolution, 1775-1783 55
      United States — Armed Forces — History — World War, 1939-1945 93-97
      United States — Armed Forces — Officers — Death 333
      United States — Armed Forces — Officers — Death 634
      United States — Armed Forces — Officers — Directories 335
      United States — Armed Forces — Officers — Education 330-331
      United States — Armed Forces — Women — Bibliography 99
      United States — Armed Forces — Women — History — Archival resources 98
      United States — Armed Forces — Women — Periodicals 97


determination of number of index entries : 71

When identical portions of headings are merged, one cannot count headings to determine the number of entries in an index. Instead, it is the locators that must be counted. Every entry has a separate locator. It may or may not have a separate heading. Thus, the number of entries (i.e., locators) in an index is not the same as the number of headings, because the same heading can refer to a number of documentary units, and each referral constitutes an entry.

sequences of locators : 72

One area of debate in the indexing community is whether a sequence of locators, such as "76-78," constitutes one locator or three: 76, 77, and 78. The answer should probably depend on the nature of the documentary units to which these locators refer. If they are paragraphs or pages in a continuous text, they could be considered a single locator referring to a three-paragraph or three-page documentary unit. But if the documentary units are independent documents, such as three separate periodical articles, then they are clearly three separate locators.

numbers of locators under headings : 73

One sign of a bad index is too many locators (entries) under individual headings. The National Information Standards Organization (NISO) technical report on indexes (Anderson 1997a, p. 22) recommends that no index heading and no main heading subheading combination should have more than five attached locators, unless these locators themselves convey additional information, as is the case when document citations, with document titles, are used as locators. The rationale for this guideline is that most users do not want to examine too many documentary units in hopes of finding a relevant message. The technical report suggests that users should not have to consult more than five documentary units when they search for a relevant message related to any given index heading. To achieve this goal, indexers can use more specific headings, or they can add more information to headings (by adding additional terms) in order to characterize the documentary unit in more detail and to differentiate among the various messages that might fall together under a more generic heading.

criteria for index entries : 74

According to Timothy Craven (1986, p. 7), a good index entry will provide enough information so that an index user can safely ignore the documentary units to which it refers. This is the principle of eliminability — the need to provide enough information so that the user can eliminate the entry without having to follow up its locators and examine their documentary units. The four other criteria for good index entries suggested by Craven include: predictability, collocation (similar entries falling together in an index), clarity, and succinctness. These criteria will be addressed in chapter 12 on syntax, in section 12.2.

definition of locator : 75

locator. The "locator" is the part of an index entry that leads the user to the documentary unit to which the index entry refers. It indicates the location of the documentary unit or the location of a representation of the documentary unit (such as a citation, abstract, description, or thumb-nail image). The locator can be as brief as a number, representing a page or paragraph in a back-of-the-book index, or it can be long enough to include a full citation that can be used to locate a documentary unit, perhaps in a library or on the internet. Chapter 15 deals with locators. See also entry.

definition of database record : 76

record. A record (or database record) contains the description of a message, the text in which it is encoded, and the documentary unit that contains the text. All the information or data in a database about a particular message, text and documentary unit goes into its record. Examples of such data include: a citation to the text and its documentary unit, including creator, title, publisher or manufacturer, format and medium; an abstract or some other description of the message content and features of the message, text, and documentary unit, sometimes including a small picture (thumbnail) of an image document or a short segment of sound; and all the content and feature terms, descriptors or headings associated with the documentary unit. The database record is usually structured or formatted according to some regular pattern. For example, many library catalogs use the MARC (Machine-Readable Cataloging) record format, developed initially by the Library of Congress and now a world-wide standard. Many databases create their own record format. In some database models, especially relational databases, the record is not a single unit, but is a node that contains links to all the data related to a particular message, text and documentary unit. For example, the name of a publisher may be recorded in a table of publishers and the name of an author may be in a table of authors. The particular publishers or authors linked to a particular message, text and documentary unit are called into a record display when that display is requested. Chapter 20 deals with record formats.

definition of relevance : 77

relevance. Judgments of relevance are used in information retrieval as an indication of the usefulness of retrieved documentary units in response to a request or a search. The common measures of retrieval effectiveness, recall and precision, are both based on a determination of relevance (see section 9.1). Sometimes, researchers try to make distinctions between relevance, utility, pertinence, and similar terms, or to distinguish types of relevance, such as topical relevance as opposed to user relevance (the idea being that a document might be on the topic (and therefore topically relevant), but the user can't use it or doesn't want it — perhaps he or she can't read the language or already has the document or the writing is too complex, etc.).

judges of relevance : 78

An associated controversy is who is qualified to judge the relevance of documents. In earlier (and some current) information science research, so-called expert judges made relevance judgments, but now in most information retrieval circles, these judgments are suspect. There is a growing consensus that to assess the effectiveness of IR databases and information retrieval systems for the actual users or clientele of these systems, the only legitimate judges of relevance, whatever its definition, are the actual users or clientele who have the actual information needs and make the information requests or conduct the searches. If this is the case, then relevance simply means that a user judges a documentary unit to be a useful response to her or his request or query.

1.4. Standards and Codes of Practice.

standards versus scientific research : 79

Since the beginning of librarianship, millennia ago, improvements in practice have come about mainly through the development of new and better standards or codes of practice. Scientific research, as a means to study and understand phenomena and thereby improve practice, is a relatively recent innovation that came into librarianship, for the most part, with the advent and popularity of information science, mostly after World War II. Whereas scientific research is based on empirical testing of hypotheses, standards and codes of practice are based on expert opinion.

standards for cataloging, classification : 80

In the world of indexing, cataloging, and classification, professional bodies have created a wide variety of codes of practice.

Current codes for cataloging and classification include:

Anglo-American cataloguing rules (2002)

Library of Congress subject headings and associated manuals and guides for their application: Subject cataloging manual: subject headings and Free-floating subdivisions, all issued by the Library of Congress (1996, 1999, 2003);

many specialized lists of subject headings and thesauri such as Medical subject headings (National Library of Medicine 1999), the ERIC thesaurus (Educational Resources Information Center), and the Art and architecture thesaurus (1994);

several library classification schemes, including the Dewey decimal classification (Dewey 1996), the Library of Congress classification (Library of Congress 2004), the Universal decimal classification (British Standards Institution 1961), and the Bliss classification (1997); and

codes for the arrangement of alphabetical catalogs and indexes: A.L.A. filing rules (American Library Association 1980) and Library of Congress filing rules (Library of Congress 1980).

standards for indexing : 81

In the realm of back-of-the-book indexing, the venerable Chicago manual of style (1993) has the status of a standard, even though it was never formally adopted by any standard-setting body. But then, neither was Library of Congress subject headings nor most classification schemes either. Many codes of practice become de facto standards through wide-spread adoption by practitioners. Formal standards are created by standard-setting bodies such as the National Information Standards Organization (NISO) or the International Organization for Standardization (ISO).

standards for alphanumeric arrangement : 82

The arrangement of alphabetical catalogs and indexes is an interesting example of the impact and use of standards or codes of practice versus research. In the late 1970s, as computers became more and more important in cataloging operations, librarians decided that the older codes for arranging entries in alphabetical catalogs were no longer adequate. There were too many exceptions that required complicated algorithms or human intervention for computer implementation. Examples included the arrangement of abbreviations as if the full term were spelled out, the arrangement of numerals as if the number were written out in the language of the text, and the consideration of heading elements in an order different from their order in certain headings — "Edward II, King of England," for example, was arranged as if it were "Edward, King of England, 2."

Library of Congress filing rules as standard for alphanumeric arrangement : 83

So both the American Library Association (ALA) and the Library of Congress (LC) set up committees of experts to create new codes for the arrangement of alphabetical catalogs. They produced very different and conflicting rules, reflecting deep disagreements on the best way to arrange catalog entries. The Library of Congress continued an old practice of grouping headings on the basis of implicit criteria that are unknown to most users. If headings begin with the same word, the type of heading takes precedence over the content of the heading (the actual words). Names of persons come before names of places. Personal forenames come before family names. Names of places come before names of things (first corporate bodies, then topical subject headings). Names of things and topical subjects come before titles of documents. For example:



   George III, King of Great Britain, 1738-1820      [forename]
   George, Saint, d. 303                             [forename]
   George, Alan                                      [family name]
   George, William C.                                [family name]
   George (Ariz.)                                    [place name]
   George (Wyo.)                                     [place name]
   George (Motor boat)                               [thing: corporate body]
   George, Lake, Battle of, 1755                     [subject heading]
   George [motion picture]                           [document title]
   George and the dragon                             [document title]


(Examples taken from Library of Congress filing rules, 1980, p. 24, with two examples and some explanatory modifications added.)

arrangement of subheadings : 84

Also, according to Library of Congress filing rules, subheadings or subdivisions under initial subject headings are not arranged alphabetically, but first grouped by the type of subdivision, such as chronological periods, general forms and topics, place names, limiting adjectives (preceded by a comma), qualifications (enclosed within parentheses), and phrases. This results in many non-alphabetical arrays, such as the following example under "missions," in the 1995 edition of Library of Congress subject headings:


      Missions — African influences Missions — Theory
      Missions — Asia
      Missions — United States
      Missions, American
      Missions, Tamil
      Missions (canon law)
      Missions and Christian union
      Missions to Buddhists
      Missions to Mormons
      Missions around the world                      [document title]

(Actual headings as arranged in the 1995 edition of Library of Congress subject headings, with the addition of one document title.)

A.L.A. filing rules as standard for alphanumeric arrangement : 85

The ALA rules rejected these non-alphanumeric distinctions, preferring to arrange headings only on the basis of the actual alphabetic letters or numerals of each heading. The ALA experts claimed that users would miss desired headings, because they are unaware of the special non-alphanumeric criteria imposed by the Library of Congress. Take, for example, a library catalog with hundreds of entries under "missions," with a variety of subheadings as well as document titles beginning with the word "missions." When a user comes to "missions" in the catalog, how could he or she be expected to know that "missions, American" comes after "Missions — United States," or that "missions around the world" comes at the very end of the sequence, after "missions to Mormons"?

NISO standards for alphanumeric arrangement : 86

The National Information Standards Organization (NISO) began working on a new standard for the "alphabetical arrangement of letters and the sorting of numerals and other symbols" early in the 1990s (National Information Standards Organization 1996a), but this proposed standard failed to achieve the required consensus among NISO members. The proposed standard was much closer to the ALA rules than the LC, but it differed from both of these de facto standards in significant ways. Most notably, initial articles ("a," "an," and "the" in English) were to be considered for arrangement, whereas most initial articles are ignored in arrangements based on ALA and LC rules. Another departure concerns the arrangement of decimal numbers by numerical value rather than by numerical digits. The problem of fractions was not addressed. NISO later published these recommendations as a technical report (Wellisch 1999).

lack of research on alphanumeric arrangement : 87

In this whole process, there was almost no research as to how users perceived alphabetical or alphanumeric order and which arrangement alternative would be easier for them to use. One exception consisted of several experiments conducted with small groups of students in the United Kingdom (Hartley, Davies, & Burnill 1981). Students were asked to arrange sets of headings as they would expect to find them in a back-of-the-book index. As with the experts, however, these students exhibited no consensus, suggesting a wide variety of possible arrangements. In any case, experiments such as this do not necessarily indicate the impact of different arrangements on searching or browsing effectiveness. These researchers did attempt to assess the speed of access to particular headings in sample alphabetical indexes arranged in different ways, but the differences were of no significance.

88

To this day, we do not have any significant body of research on which to base our arguments for particular alphabetical arrangements, and experts are fiercely divided on such issues as whether spaces between words should be considered or ignored (letter-by-letter versus word-by-word arrangement), the appropriate arrangement of subheadings, the arrangements of fractions and decimal numbers, and many similar issues that result in very different arrays of entries in alphabetical or alphanumeric indexes.

views of Saracevic (Tefko) on research versus standards : 89

To be fair, one leading expert in information science (Tefko Saracevic) has declared that there are issues, such as this one, that are simply not amenable to research and must be subject to standards. Indeed, it is not easy to design appropriate, meaningful research to gather empirical evidence on arrangement questions. But what does one do when there is simply no, or insufficient, agreement among experts?

display of subject headings in online public access catalogs : 90

In 1992, a Subcommittee on the Display of Subject Headings in Subject Indexes in Online Public Access Catalogs (a subcommittee of the Subject Analysis Committee of the Cataloging and Classification Section of the Association for Library Collections and Technical Services, a part of the American Library Association) brought out a small book entitled Headings for tomorrow: public access display of subject headings (American Library Association 1992). In this book, this subcommittee laid out the options and the arguments in favor of various arrangement alternatives, but it made no attempt to reach a consensus on the major controversies or to make recommendations regarding the best kind of arrangement, except in non-controversial areas such as the arrangement of numbers in ascending numerical order (p. 23).

lack of consensus among standards on alphanumeric arrangement : 91

Drusilla Calvert (1996) provides a good summary of the status of standards for alphanumeric arrangement. In an article comparing the latest British and international standards for indexes, she, in effect, throws up her hands in dismay and declares, "Filing, or sorting, is a hornet's nest. All standards seem to disagree with all others" (p. 75). This is indeed the case, resulting in chaos for users, who are mostly unaware that there are major differences in possible alphanumeric arrangements. When they don't find something, they just assume it's not there, not suspecting that it has been placed in a completely unexpected location!

92

We shall return to questions concerning the arrangement of headings and entries to facilitate searching and browsing in chapter 17, Arrangement of displayed indexes.

controversies in information retrieval : 93

This story of alphabetical arrangement can be extended to many other controversies in the world of IR, some very central, such as the role and method of vocabulary control, automatic indexing versus human intellectual analysis, and boolean logic versus ranked weighted retrieval in machine searching.

standards for information retrieval : 94

For IR in the United States, the most important standard-setting bodies are the International Organization for Standardization (ISO) and the National Information Standards Organization (NISO). NISO is a United States body, similar to national bodies in most other developed countries, such as the British Standards Institute in the U.K. These bodies are responsible for a wide range of standards affecting the design and performance of IR databases, on such topics as information interchange formats, international standard numbering for documents in various formats and media, indexes, abstracts, technical reports, thesauri, holding statements, computer character sets, paper permanence, information retrieval protocols (Z39.50), romanization and transliteration of non-Roman writing systems, common command language, interlibrary loan, East Asian character codes, bookbinding, computer software description, library shelving, country codes, CD-ROMs, electronic manuscripts, price indexes, bibliographic references, patron records, circulation transactions, alphanumeric arrangement, preservation, environmental conditions, microforms, library codes, and many more (National Information Standards Organization 1997a, 1997c).

NISO Committee YY and new standard for indexes : 95

In 1991, NISO created a new committee, labeled YY, to revise the 1984 standard for indexes: Z39.4-1984 Basic criteria for indexes (National Information Standards Organization 1984). This committee spent five years studying the issues, soliciting and receiving input from NISO member organizations and interested information professionals, and suggesting standards that would encompass all types of indexing (automatic and human) and all types of indexes (print and electronic, displayed indexes for visual inspection and non-displayed indexes for machine searching). Because indexes are so central to IR databases, the committee addressed all aspects of IR database design.

opposition to standard for indexes : 96

Two NISO members, the American Society of Indexers and the American Society for Information Science, objected to recommendations of Committee YY regarding automatic indexing and non-displayed indexes (indexes that are searched by computer algorithm as opposed to being displayed for searching by human visual inspection). Ironically, these two organizations were also the ones that were most closely involved in the work of the committee. Most members of Committee YY were members of both of these organizations, and both of these organizations sponsored meetings and consultations regarding the development of the standard. An article about the development process for this proposed standard appeared in the Journal of the American Society for Information Science (Anderson 1994).

objections of American Society of Indexers to standard for indexes : 97

The American Society of Indexers (ASI) is primarily an organization of human indexers who earn their living by creating indexes based on their human intellectual analysis of messages and texts. From the very beginning, ASI's official representatives consistently and strenuously objected to any suggestion that finding or searching tools based on simple computer algorithms could be considered indexes. The kinds of tools that ASI objected to included KWIC, KWOC, KWAC (key-word-in-context, key-word-out-of-context, key-word-along-side-context) and permuted indexes, all of which have been widely used and are universally called indexes. The ASI objection also extended to non-displayed indexes that are widely and routinely used in simple full-text searching. ASI argued forcefully that only retrieval tools that actually contributed additional intellectual value (as opposed to simply rearranging or retrieving words as in a concordance) should be called indexes. They appeared to be willing to accept the products of more sophisticated computer algorithms based on term weighting, the identification of term phrases, and clustering as worthy to be called "indexes."

98

Committee YY understood and appreciated ASI's concerns, but it felt it could not eliminate any tool or device that pointed to informative messages, in line with its basic definition of an index as any "indicating tool," especially when such tools are universally referred to as "indexes" in the information community. Instead, the committee chose to apply standards for vocabulary management to these simple indexes, something that such indexes almost universally lack.

endorsement of standard for indexes by American Society for Information Science : 99

The American Society for Information Science (ASIS) gave the first official draft (1993) of the proposed new standard a strong endorsement, saying:

"The people who reviewed NISO Z39.4-199X Guidelines for Indexes and Related Information Retrieval Devices for ASIS feel that it is a very good document. They particularly note that the inclusion of computer indexing is a good enhancement and expansion for the standard" (ballot response from the Standards Committee, American Society for Information Science, 18 February 1994).

opposition from American Society for Information Science to standard for indexes : 100

But by the time the second official draft went out for a vote in 1995, the membership of the ASIS standards committee had changed, and so did its attitude toward the proposed standard:

"The attempt to extend the standard to electronic information retrieval has resulted in a standard that is overly complex, confusing, and diluted from its primary focus. The standard contains weak, incomplete coverage of online information retrieval concepts and diluted focus on the raison d'etre for the standard, which is the design of indexes such as back-of-the-book indexes. We recommend that the standard be refocussed on traditional index design ...." (ballot response from the Standards Committee, American Society for Information Science, 26 July 1995).

101

The new ASIS objections had little to do with the relatively minor changes to the draft standard since the first official draft. Rather they related to the very heart of the draft standard — the attempt to create a standard that would apply to all types of indexes, regardless of medium, type of indexing, or type of searching.

opposition from American Society for Information Science to terminology for non-displayed indexes : 102

The new ASIS Standards Committee did not like the terminology that NISO's Committee YY had adopted, after wide consultation, for indexes that were not displayed for human inspection, but rather were designed for machine searching. When the Committee YY began its work, there simply wasn't a common vocabulary for such indexes. One visiting member of ASIS declared at a 1992 open meeting of Committee YY that such electronic indexes were not indexes at all, and should not be considered by the committee (Anderson, Record of November 6, 1992 meeting of Committee YY, November 7, 1992). This disagreement, like that of ASI, is fundamentally one of definition. Can the systems that permit computers to search algorithmically legitimately be said to include indexes? Are indexes an essential component of such computer search systems? It is indeed a matter of definition. Definitions can be important when they reflect conflicting models of basic IR processes. And definitions are precursors to standards.

terminology for non-displayed indexes : 103

Several members of Committee YY represented the IR database industry, and these members, along with a majority of the committee, believed strongly that non-displayed indexes (designed for machine searching) met all the criteria for indexes and were therefore within the purview of the committee. In the end, the committee settled on the term "non-displayed indexes" for these indexes that are not displayed for human inspection. Commonly used "inverted files" are an example of such non-displayed indexes.

role of search interfaces in non-displayed indexes : 104

A related controversy was whether the search interface for machine searching systems was part of a non-displayed index. The new ASIS Standards Committee said it was not. In contrast, the NISO Committee YY held that the search interface was an essential part of an electronic non-displayed index, because it was the interface that provides the capability for creating search statements that can be matched against the non-displayed index. These search statements, said Committee YY, are closely analogous to index headings in displayed indexes. In fact, according to Committee YY, a non-displayed index can only be considered an index in combination with a search interface. Without a search interface, a non-displayed index is unsearchable.

lack of consensus on standard for indexes : 105

These views turned out to be irreconcilable. A standard requires a certain level of consensus. There was no consensus, so NISO published the recommendations of its Committee YY as a technical report (Anderson 1997a).

impossibility of standards for indexes : 106

These stories regarding alphanumeric arrangement and IR indexes serve to illustrate the sometimes contentious atmosphere in which standards of professional practice are developed, especially in the absence of solid, widely accepted research. Standards are based, fundamentally, on expert opinion, and such opinions can be as strongly held and as staunchly defended as the most fundamental religious or cultural beliefs. The key areas of disagreement on the standard for indexes were definitional. Are finding tools created by computer algorithm truly indexes? Do machine searching systems rely on indexes? Are computer search interfaces an essential component of such indexes? Conflicting views and definitions reflect conflicting models of reality, and they may also be perceived to impact or even to threaten future professional roles.

chaos and creativity versus stability in IR database design : 107

The world of indexing and more broadly the world of IR database design and implementation have left behind a period of relative stability (from roughly 1870 to 1970) in which there was a wide consensus on practice. With the advent of computer and information technologies, we have entered a period of chaos and creativity. During the period of stability, almost every library, every indexing and abstracting service, every back-of-the-book index (in short, every IR database!) was pretty much the same with respect to how indexing was done and how indexes were presented to users (with the exception of variations in alphanumeric arrangement!). Now, online public access catalogs in libraries exhibit extreme variety — almost every one is different. The variety of indexing available, especially automatic indexing, for databases in various electronic media, including the world-wide web, digital libraries, and similar resources, increases on a daily basis.

impossibility of standards in periods of instability : 108

In this period of extreme chaos and, one hopes, creativity to deal with and respond to new needs and new opportunities, it may just be impossible to reach the kind of consensus that an official standard requires.

responsibility of information professionals in absence of standards : 109

If that is the case, individual information professionals will have to make their own judgments as to the most appropriate approach to any particular clientele, situation or problem. This book is meant to help them do just that.

1.5. Types of IR Databases.

types of indexes : 110

The NISO technical report (Anderson 1997a) identifies more than 30 types of indexes used for information retrieval. Because indexes are so central to IR databases, influencing as they do the methods for the representation of messages, texts and documents on the one hand and the methods for searching and retrieval on the other, these types of indexes correspond to types of IR databases. They are listed here, with examples. The intent of the NISO technical report, and of this book, is to address design principles that apply to every kind of index and IR database that is intended to describe messages, texts and documents and to provide access to them for subsequent retrieval.

attributes of IR databases, of indexes : 111

Like any complex entity, IR databases and their indexes can be categorized by many different attributes. The major ones are:

the kinds of objects represented in index terms, headings, and entries;
the kinds of index terms used;
the kinds of indexable matter used for indexing;
the methods for presenting the index to the user and the concomitant method for searching made available to the user;
the arrangement of entries;
the methods for analysis of message content;
the methods for term selection for indexing;
the methods for term combination in index headings;
the methods for term combination in searching;
the kinds of documents being indexed;
the medium of the IR database;
the proximity of the documents being indexed to the IR database itself;
the size of documentary units;
the periodicity of the IR database;

and finally,

the authorship of the database.

112

WARNING! The types of IR databases and indexes listed below will mention many complexities that haven't been explained yet. After all, most of the book is yet to come. So don't worry. The purpose of this list is to emphasize the wide scope of IR database and index possibilities. It can also be used for reference, later on, simply to review some of the choices available in IR database and index design. So the first time through, just scan it, and don't worry about the details.

113

Here is this complex list laid out, one criterion at a time, with some explanation and with some examples of real, existing IR databases.

1.5.1. Kinds of Objects Represented in Index Terms, Headings, and Entries.

indexes to authors, topics, features : 114

The major categories of objects represented in the terms, headings, and entries of indexes are the persons and organizations responsible for the creation of messages, texts, and documents, and the topics and features of these messages, texts, and documents.

indexes to authors, illustrators, editors, translators, publishers : 115

a. indexes to persons and organizations responsible for messages, texts, and documents:
i. author indexes.
ii. illustrator indexes.
iii. editor indexes.
iv. translator indexes.
v. publisher indexes.

indexes to composers, choreographers, lexicographers, painters, sculptors : 116

Depending on the nature of messages, authors can be writers, composers (of music), choreographers (of dance), lexicographers (of dictionaries), painters, sculptors, etc.

indexes to subjects, places, institutions, documents, laws, quotations, Bible verses : 117

b. indexes to topics addressed in messages and texts.
i. general subject indexes.
ii. specialized indexes to types of subjects, such as places, persons, institutions, operations, and documents (e.g., laws, quotations, Bible verses), etc.

indexes to features : 118

c. indexes to features of messages, texts, and documents.

indexes to titles : 119

i. title indexes.

indexes to genres, science fiction, novels, fiction, short stories, poems : 120

ii. genre indexes, e.g., an index to science fiction novels or short stories or poems.

indexes to document numbers, international standard numbers : 121

iii. document number indexes, e.g., an index to ISBNs (international standard book numbers).

Note: The author of a message and its text is perhaps its most important feature, so category 1.5.1.a could have been subsumed under this more general category — but persons and institutions responsible for documents get their own category because they are so important.

1.5.2. Kinds of Terms Used.

122

Index terms usually consist of words, but they can also consist of numbers of various types and also other types of specialized symbols.

role of words in index terms : 123

a. word indexes.

Word indexes can be further categorized by the types of words, e.g.,

role of proper nouns, common words in index terms : 124

i. proper nouns — names of persons, corporate bodies, places.
ii. common words

role of numbers in index terms : 125

b. numerical indexes.

role of symbols in index terms : 126

c. indexes using specialized symbols

role of mathematical symbols in index terms : 127

i. mathematical symbols.

role of chemical symbols in index terms : 128

ii. chemical symbols.

role of musical symbols in index terms : 129

iii. symbols representing music.

1.5.3. Kinds of Indexable Matter Used.

full text as basis for indexing : 130

a. indexes based on the full text of documentary units.
b. indexes based on summaries of documentary units, e.g.,

titles as basis for indexing, title indexes : 131

i. indexes based on titles only.

abstracts as basis for indexing : 132

ii. indexes based on titles and abstracts.
c. indexes based on portions of documentary units, e.g.,

lead paragraphs as basis for indexing : 133

i. lead paragraph only.

tables of contents as basis for indexing : 134

ii. tables of contents only.

introductory matter as basis for indexing : 135

iii. introductory matter.

reference citations as basis for indexing, citation indexes : 136

iv. reference citations (for citation indexes).

first lines as basis for indexing : 137

v. first lines (as in poems).

1.5.4. Presentation and Methods for Searching.

138

There are two fundamentally different ways that IR database indexes can be searched: (1) visual scanning and examination of index headings, and (2) mechanical or electronic symbol comparison and matching. (It is also possible to create Braille indexes that are scanned by touch and audible indexes that are listened to, but the first two approaches are the major ones.) The first method is performed by humans. The second method is now performed by computer algorithms. (Prior to the computer, various mechanical means were devised for comparison and matching.) For the first method, the index must be displayed for human visual inspection. For the second method, the user does not necessarily see the index. Some of the best IR designs will combine these two approaches, so that users can take advantage of sophisticated electronic machine matching algorithms but can also see displays of index headings when they wish to browse or make some preliminary judgments about documents or the direction of a search. (Here the focus is on methods of searching. An IR database that provides only for electronic machine matching, with no display of indexes, will still display the results of a search for human examination and consideration!) So we have IR databases that provide:

displayed indexes : 139

a. displayed indexes for visual searching.

non-displayed indexes : 140

b. non-displayed indexes for searching by means of computer matching algorithms.

c. both types of indexes, for both types of searching.

1.5.5. Arrangement of Entries.

presentation of IR databases; internal computer representation not addressed : 141

Non-displayed indexes may have internal arrangements to facilitate computer comparison and matching, but this book does not address these internal computer issues. The methods and techniques for internal electronic representation and manipulation are constantly changing, and their mastery requires expertise and experience separate from that required for high quality design of IR databases from the point of view of their presentation to and use by human users. The focus of this book is on the presentation of IR databases and their indexes to users. Many different computer methods can be used for the same type of presentation.

arrangement of displayed indexes : 142

So here, we focus on the arrangement of displayed indexes — those indexes designed for human visual scanning and inspection.

143

Such indexes must have an order that facilitates the location of particular entries. Here are the choices:

alphanumeric arrangement of displayed indexes : 144

a. alphabetic or alphanumeric indexes. At first glance, this is a simple category, and a very popular one for indexes, but as discussed above in section 1.4 on standards, there is no agreement on what constitutes proper alphabetic or alphanumeric order. Consequently, there are many different approaches and versions. These shall be taken up in detail later in section 17.1 on alphanumeric displays.

relational arrangement of displayed indexes : 145

b. logical, relational or classified indexes. Here, headings are arranged according to various types of relationships among the concepts represented. Criteria for such arrangements can be increasing or decreasing importance, chronology, class inclusion (creating hierarchies from broad topics to narrow ones), or a whole and its parts. These arrangements are often called "classified," but this term tells you nothing about the basis of the arrangement, especially because the classes represented by index headings can also be arranged alphabetically. Relational arrangements will be discussed in some detail later on in section 17.3.

alphabetical-relational arrangement of displayed indexes : 146

c. combined alphabetical-relational indexes. Some arrangements combine aspects of alphabetical and relational criteria. They are sometimes called "alphabetico-classed." One approach is to arrange broad classes in alphabetical order, with subordinate classes arranged under broad classes on the basis of various relational criteria. The opposite approach is also used. Broad classes are arranged on the basis of relational criteria, but narrower, subordinate classes may be arranged in alphabetical order. The Library of Congress classification uses this latter approach quite frequently.

1.5.6. Methods for Analysis.

147

As with the arrangement of entries, there are two fundamentally different approaches to the analysis of messages for indexing, with a third approach combining elements of the two basic approaches. Thus we have:

human intellectual analysis of texts for indexing, human indexing : 148

a. indexes based on human intellectual analysis of messages and texts.

computer algorithmic analysis of texts for indexing, automatic indexing : 149

b. indexes based on various computer algorithms for the analysis of machine-readable texts. This is often called "automatic indexing."

combination of automatic indexing and human indexing : 150

c. indexes based on combinations of computer and human analysis.

1.5.7. Methods for Term Selection.

151

Index terms can be extracted from texts (if the texts consist of words) or they can be assigned to texts. Extractive indexes are most often associated with automatic computer-based indexing, but human indexers can also limit their selection of terms to those appearing in language texts. Assignment indexing is done most often by human indexers, but computer algorithms also have been developed to assign terms not found in texts. Thus we have:

extraction of index terms : 152

a. indexes based on extracted terms.

assignment of index terms : 153

b. indexes based on assigned terms.

combination of extraction and assignment of index terms : 154

c. indexes based on both the extraction and the assignment of terms.

1.5.8. Methods for Term Combination.

necessity for combination of index terms : 155

Indexes must provide the capability to search for multiple topics or features at the same time. If indexes provided access to only one topic or feature at a time, they would be pretty worthless. Can you imagine searching a large database for everything related to "United States," with no capability of combining that term with anything else that you want?

methods for combination of index terms : 156

There are two basic types of methods for the combination of terms, and these are correlated with whether the index is displayed or non-displayed. Thus we have:

precoordinate combination of index terms : 157

a. precoordinate term combination for indexes that are displayed — terms are combined (or coordinated) before the index is presented to the user for searching.

postcoordinate combination of index terms : 158

b. postcoordinate term combination for indexes that are non-displayed — terms are combined (or coordinated) after access to the index is presented (via a search interface) to the user, at the time of the search.

precoordinate and postcoordinate combination of index terms; information science as example of bound term : 159

c. indexes based on both precoordinate and postcoordinate terms. Precoordinate terms are often used in non-displayed indexes to represent complex concepts and to prevent the inaccurate or inappropriate combination of discrete terms. (For examples, see the discussions of pre- and postcoordinate syntax in section 1.3 on terminology.)

1.5.9. Kinds of Documents Being Indexed.

160

Here, IR databases are characterized not on the basis of their own features, but on the basis of the types of documents that are included or represented and indexed for the database. These are as various as all the existing types of documents, and new types are being developed or invented all the time. Only some representative examples are listed here:

IR databases for periodicals : 161

a. periodicals: articles in periodicals or whole periodicals (complete sets); also specialized forms of periodicals or serials, such as newspapers, newsletters, etc.

IR databases for books, monographs : 162

b. books and monographs, including "back-of-the-book" indexes for single books.

IR databases for poetry : 163

c. poetry.

IR databases for fiction : 164

d. fiction; also specialized types of fiction, such as science fiction, romance, historical novels, mysteries, fantasy, short stories.

IR databases for film media, motion pictures, slides, photographic media : 165

e. film: motion pictures and other types of film or photographic media (such as slides, filmstrips, photographs).

IR databases for videotapes : 166

f. video; video recordings.

IR databases for pictures : 167

g. pictures: reproductions, paintings, drawings, photographs, etc.

IR databases for maps, geographical information systems : 168

h. maps of all types, two-dimensional, three-dimensional; flat maps and charts; globes; geographical information systems.

IR databases for music, sound recordings : 169

i. music and sound documents, including all sorts of sound recordings — spoken, music, and other types of sounds, such as bird songs, animal sounds, weather sounds, etc. — on various media. Also musical scores.

IR databases for machine-readable texts : 170

j. machine-readable texts.

IR databases for computer software : 171

k. computer software.

IR databases for internet resources : 172

l. internet; including world-wide web resources.

1.5.10. Media of IR Databases.

173

The media of IR databases are as varied as the media of documents in general — after all, IR databases are documents too. The major media used for IR databases are:

paper as medium for IR databases : 174

a. paper. Before the development of paper, IR databases were recorded on its precursors, such as stone and clay tablets, parchment and other animal skins, papyrus, tree bark and other vegetable matter. Paper media includes card-stock, which was the most popular medium for library catalogs for about a century, until electronic media became viable and popular.

microforms as media for IR databases : 175

b. microforms. IR databases have appeared in various styles of microfilm and microfiche.

electronic media for IR databases : 176

c. electronic media. This broad category includes an ever increasing variety of formats, such as compact discs (CDs, CD-ROMs), larger optical disks, magnetic disks and tape, as well as online databases maintained in accessible computer media and of course websites.

sound media for IR databases : 177

d. sound media. Spoken indexes sometimes accompany sound collections and archives. These are similar to those ever more pervasive voice mail menus that confront you when you call many offices and agencies. Sound indexes can be especially useful for persons with visual impairments.

braille media for IR databases : 178

e. braille media. Braille is usually recorded on paper, but because it is a specialized combination of symbols for persons with visual impairments, it gets a separate listing.

1.5.11. Proximity of Documents Being Indexed.

full-text databases : 179

a. full-text databases. Full-text databases contains the full text of the documents to which it points. This includes books published with traditional back-of-the-book indexes, as well as the increasingly popular full-text electronic IR databases, ranging from newspaper and periodical databases to encyclopedias and other reference works of various sorts and digital libraries. If you are surprised to find the printed book with index in this category, just remember that here too, the index is combined with the full text of the document being indexed, so it qualifies!

reference databases : 180

b. reference databases. Reference databases provide access to documents that are not included in the database. Instead, the IR database provides some sort of locator, such as a bibliographic citation and possibly a call number or notation that can be used to obtain the full document from a library collection, publisher, the internet, or other distributor or document delivery service.

library catalogs : 181

A library catalog may be seen as a reference database that refers to items in the library's collection. On the other hand, the library as a whole, including its catalog, may be considered a full-text database, because the documents to which the catalog refers are within its collections (unless they are checked out!).

1.5.12. Types and Sizes of Documentary Units.

182

Here we categorize IR databases and their indexes with respect to the kind of documentary units (parts of documents, complete documents, collections of documents) that are analyzed for retrieval. These units depend, of course, on the type of document. We give examples mostly from language documents, but analogous examples could be given from visual image documents (photographs, paintings), moving image documents (films, videos), sound documents, etc. In the past these units were often called "bibliographic units," because they were described in bibliographies.

definition of bibliography : 183

In this book, we have subsumed the term "bibliography" in the broader, newer term "IR database," but "bibliography" and "bibliographies" are fine old words that mean writing (graphy) about books (biblio), thus they have come to mean lists and descriptions of books. There is no reason to limit their meaning to "books," because the "biblio" part of the word comes from the Greek for papyrus leaves! So by extension, bibliographies can deal with messages and texts in any format and medium, just as IR databases can and do.

IR databases for small documentary units : 184

a. IR databases for small documentary units (parts of complete documents), such as lines, sentences, paragraphs, and pages, or frames in a videotape, segments of pictures or maps). These indexes lead the user inside the full document. Sometimes such small units are referred to as "information units" because they are more likely to lead directly to a precise message that may answer the searcher's query.

IR databases for complete documents : 185

b. IR databases for complete documents, e.g., periodical articles, chapters in collections, papers in conference proceedings, stories and poems in anthologies, and monographs.

IR databases for collections of documents : 186

c. IR databases for collections of documents, e.g., anthologies; complete sets of periodicals, serials and series; archives; libraries, etc.

1.5.13. Periodicity of IR Databases.

monographic databases : 187

a. monographic databases. Like any document, an IR database can be a monograph — a one-time publication, sometimes called a "closed-end" database or index.

serial databases : 188

b. serial databases. Or an IR database can be designed for updating on a regular or irregular basis. These databases are sometimes called "continuing" or "open-end" databases or indexes.

1.5.14. Authorship of IR Databases.

189

Finally, IR databases can be categorized by authorship, whether an IR database has been created by one or a small number of individuals who can be named and credited with its creation or by a large organization, with the participation of many persons, so that the personal influence of individual authors is not apparent. IR databases relying on automatic indexing are created, in part, by machine algorithms, but human beings "authored" the algorithms that are used.

1.5.15. Continuing Examples.

examples of IR database design : 190

Throughout this book, design principles related to the topic of each chapter will be applied to three prominent types of IR databases — (1) a book or monograph with its own index (often called a back-of-the-book index); (2) an indexing and abstracting service for a scholarly discipline; and (3) a full-text encyclopedia, which can be seen as a digital library of messages and texts.

monographs as examples of IR databases : 191

For the example of a single book as an IR database, indexes will be designed for both electronic and print media. The index at the end of this book illustrates the implementation of the design for the print-medium index.

indexing and abstracting services as examples of IR databases : 192

The example of a scholarly indexing and abstracting service will be an indexing and abstracting service for the literature of library and information science. Every reader of this book likely has some familiarity, or at least interest, in these disciplines.

full-text encyclopedias and digital libraries as examples of IR databases : 193

The example of a full-text encyclopedia (or digital library) will be an IR database consisting of digital texts on library and information science.

1.6. IR Databases Versus Other Types of Databases: A Recap.

definition of database : 194

Throughout this book, you will find frequent use of the term "database." As discussed in section 1.3 on terminology, the definition of "database" as used in this book is simple: an organized collection of data designed for retrieval. Although the term "database" (data base, data-base), and its companion term (in earlier days) "databank," grew out of a computer environment, it need not imply any particular medium for the database. In this book, "database" will refer both to print databases and to electronic digital computer-based databases.

varieties of IR databases : 195

"Database" is a convenient word for the enormous variety of IR tools that librarians, indexers, abstracters, and information specialists of various sorts have developed over the years — indexes, indexing and abstracting services, bibliographies, catalogs, gazetteers, dictionaries, concordances, directories, encyclopedias, handbooks. All of these are organized collections of data designed for retrieval, so all can be legitimately called IR databases.

1.6.1. Two Types of Databases.

databases for concrete entities and events versus IR databases : 196

As discussed in section 1.3 on terminology, databases can be categorized in many ways — by data models, by purpose, by subject area, and by the kinds of phenomena represented. It is this last categorization that is central to this book. With respect to the primary phenomena represented, databases can be divided into two types: (1) concrete entity and event databases, and (2) IR databases. By far the most common in everyday life are type-1 databases — concrete entity and event databases. These databases are designed to provide information about concrete entities (things, objects) and concrete events (transactions, operations, processes). Bank databases and airline databases were cited as examples in section 1.3. Another example of concrete entity and event databases are university databases, containing as they do information about every student, course, course offering, instructor, classroom, grade, tuition payment — all concrete entities and real events. In contrast, IR databases focus on messages, and these messages frequently relate to phenomena that are abstract, vague, emotional, and imaginary — anything but concrete!

197

Concrete entity and event databases are designed around the attributes and relationships among concrete entities and events. For example, students take courses, get grades, and pay tuition. Instructors teach courses and get paid a certain amount every so often. In contrast, most IR databases do not attempt to define possible relations in advance. There are just too many potential relationships among concepts represented in messages and texts, and some of these relationships are only discovered later through subsequent use and analysis.

● exclusion of databases for concrete entities and events from scope of this book:

management information systems, database management systems : 198

This book does not concern itself with the very important category of concrete entity and event databases. If these are the databases you want to read about, get a good book on database management systems (DBMS). Most books using that term are talking about concrete entity and event databases. Management information systems (MIS) also consist largely of concrete entity and event databases, although MIS people are paying more and more attention to IR databases.

1.6.2. IR Databases.

199

To close this introductory section of the book, let's return to the basic topic and purpose of the book: the design of IR databases.

messages as key entities for IR databases : 200

IR databases are databases that focus on messages rather than directly on concrete entities and events. Messages, of course, can and frequently do deal with concrete entities and events, but just as often, they deal with abstract entities, theories, hypotheses, feelings, opinions, ideologies, dreams, emotions, properties, attributes, operations, processes, and even imaginary characters, places, events and times, which are completely foreign to most concrete entity and event databases.

IR databases as hybrid databases : 201

In fact, IR databases must be hybrid databases. They must deal with certain types of concrete entities and events as well as messages. The primary concrete entities in IR databases are the documents in which texts and messages are embedded and the persons and institutions that create messages, texts and documents. IR databases also generally record data about the production and publication of these documents, which are of course concrete events.

databases for concrete entities and events versus IR databases : 202

But a database that does no more than record the existence of documents is no different, in theory, than a database of car parts, or the products for sale by JCPenney or L.L.Bean. What makes an IR database special is its focus on the content, meaning, purpose and features of messages.

scope of this book : 203

This book is about the design of IR databases for all types of messages, texts and documents. A special focus will be the enormous variety of indexes that can be used in IR databases, because indexes are essential components of such databases. It is the index that organizes the data for retrieval. Without an index, an IR database loses the element that gives it organization. It is no longer a "collection of data organized for retrieval," and therefore is no longer a database in the usual sense of the term.

indexes versus IR databases; components of IR databases : 204

The line between an index and an IR database is a very fine and fuzzy one. They are closely related and intertwined. Actually, as we explore the design of IR databases, we will find that an IR database generally consists of three major components, of which the index is one. The other two components are (2) the collection of documents (in full-text databases) and surrogates (representations or descriptions of documents), and (3) the collection of terms that is used for the description and retrieval of documents. In some IR databases, this third component, the collection of index terms, is expanded to include synonymous, equivalent and variant terms and relations among them. Such an elaborated vocabulary component is called a thesaurus.

software and hardware for IR databases : 205

The focus of this book is on the design of IR databases for the effective presentation of message data to users. The emphasis is on the description and preparation of data for presentation. The book does not concern itself with particular software or hardware for the implementation of electronic databases, nor, analogously, with the technical publishing specifications for print databases.

design specifications for IR databases : 206

The end product from the study of this book should be the ability to create design specifications for an effective IR database. These design specs can then be used, with experts in database management systems (DBMS), to select an appropriate database model; with experts in interface design to implement presentation specifications in an effective interface; and with experts in software and hardware to select appropriate algorithms and programs and computers. The situation is similar for print IR databases. A book designer will translate specifications for indexing syntax, arrangement, and display into a book design that incorporates the features of the presentation specifications.

role of IR database designers, information architects : 207

Thus, the IR database designer is not expected to be an expert in every aspect of IR database implementation. But the IR database designer must have a good overview of possibilities. And more than anything else, the IR database designer is the advocate for the user, to insure that the needs and preferences of users will be faithfully represented in the final IR database product. A new term for this role is "information architect." Just as the architect of a building must rely on a whole range of experts to actually construct and maintain the building, so the information architect must rely on a similar range of experts. But the quality of the finished product and users' experience with that product is due, in largest measure, to the architect!


Last modified: Tue Jun 6 18:02:09 CDT 2006

Valid HTML 4.1!