Kwong Bor Ng, Soyeon Park,
Rutgers University, New Brunswick, New Jersey
Florida State University, Florida
In the digital library context, the role of metadata has become more important than ever before, because the effective organization of networked information clearly depends on the effective management and organization of metadata. The issue of metadata has been approached variously by different intellectual communities. The two main approaches may be characterized as: (1) the library science oriented bibliographic control approach; and (2) the computer science oriented data management approach. This paper examines the different concepts and orientations of the two major approaches contributing to the metadata discussion, and proposes an integrated concept of metadata to facilitate the merging of these two approaches. This paper also discusses the on-going efforts to establish metadata standards, and compares different metadata formats.
One of the important tasks for most intellectual activity is the ability to locate, identify, retrieve, and manipulate information. These tasks can be accomplished in a variety of ways, and have been approached variously by different intellectual communities. In the digital environment, the two main approaches may be characterized as:
Although these two approaches have different theories and practices, both make use of metadata schemes to facilitate the accomplishment of the above mentioned tasks. Their underlying philosophies have different emphases and features which reflect their own research histories and contexts. In the age of digitalization of information objects -- for storage as well as dissemination and end-use -- people working in these two areas are modifying their schemes to adjust to the new environment. During the process of adjusting and exploring, the two approaches are moving closer to each other. In this paper, we propose an integrated concept of metadata to facilitate the merging of these two approaches. This paper also discusses the on-going efforts to establish metadata standards, and compares the core-elements of different metadata schemes.
Library Science Oriented Bibliographic Control Approach
Libraries have a long tradition in the development of information systems. This tradition has focused on organizing the physical containers of information (e.g., books and printed materials), by means of bibliographical description, subject analysis, and classification notation construction, so that the container can be efficiently described, identified, located and retrieved . Through the establishment of cooperatively adopted rules and standards (e.g., Anglo American Cataloging Rules and International Standard Bibliographic Description for descriptive cataloging; Library of Congress Subject Headings, and Library of Congress Subject Cataloging Manual for subject analysis; Dewey Decimal Classification and Library of Congress Classification for classification), libraries have met with great success in implementing highly complex, surrogate-based information organization systems. Although much attention has been paid to "reader" or "user" needs (e.g., in determining proper terminologies and controlled vocabularies for subject cataloging and authority control; in providing access points for record searching; and in the prescribed citation order of facets or aspects in classification construction), the organizational principles of the systems as implemented have not focused on information per se or its use, but on describing, locating, and retrieving the containers of information. For example. the three basic "objects" of the catalog proposed by Cutter (1904, p.12) almost a century ago are still the major concerns of this tradition:
The basic unit of a library cataloging system is the surrogate which represents the information object. The usefulness of a surrogate is determined by the degree to which it:
The need to share information among institutions, nationally and internationally, has driven the development of standards, rules and procedures by the library community to promote interoperability and usability of information systems. While the applicability and limitation of the surrogate-based systems approach to the electronic environment may need further investigation, the rules, standards and procedures developed in that environment will have significant impact.
Computer Science Oriented Data Management Approach
The computer science community also has a long tradition in data management and organization. There is a variety of computerized information storage and retrieval systems for textual and relational data. They aim not only to store, access and utilize data effectively, but also to provide data security, data sharing and data integrity functions. Various types of data storage and retrieval systems are used for different purposes by a wide range of organizational settings (e.g., commercial, scientific, technical, and research organizations). When the data archives become large, distributed, and diversified, they present representational and mapping problems which result in complex data structures and data interrogation mechanisms. Different data models and architectures have been proposed in order to solve this problem in several areas (e.g., Baltex data management system for scientific disciplines). Here, how the data will be used is of fundamental importance to the design and implementation of these systems.
Electronic Data in Cyberspace
The Internet is a flourishing channel for information dissemination which combines many of the functions traditionally fulfilled by libraries and data archives. The architecture of the Internet, however, is unlike that of either libraries or data archives. Libraries and data archives are primarily information storage and retrieval systems and only secondarily communications media; conversely, the Internet is primarily a communications medium and only secondarily an information storage and retrieval system. However, the changing environment of digital technology has gradually blurred this distinction. The concerns of libraries and data archives are moving closer to one another as each comes to rely increasingly on the use of the Internet as an information delivery system.
Unlike most traditional libraries and data archives, the Internet employs a client/server model (Freeman, & York, 1991) which provides for much greater end-user control. The server delegates control of some functions to the client, but must do so without a corollary abdication of responsibility. It is incumbent on system designers to collaborate in the development of interface standards which ensure that greater end-user control does not result in less efficient and/or less thorough data attributes searching and record retrieval. This group of standards is referred to as network interoperability standards.
The Internet also differs from libraries and data archive in that information is not stored centrally, but distributed across inter-operating computer networks. Internet information resources are not:
System designers have attempted to address the issues of retrieving Internet resources through the development of Internet search engines. While these search engines can be helpful if users understand their underlying mechanisms, exponential growth in sheer number of online electronic resources has made it clear that without some level of meta-control, their effectiveness and efficiency will deteriorate (Taylor and Clemson, 1996). Internet search engines are limited by their inability to respond adequately to such questions as:
The successful resolution of these kinds of questions requires more than implementation of sophisticated search algorithms. The establishment of meta-information standards for Internet resources should provide logical solutions at the meta-level to many of these questions.
TWO MAJOR APPROACHES TO METADATA FORMATS
Standards for network interoperability are the province of system designers requiring little input from information professionals. These standards ensure that the client/server architecture operates effectively and efficiently, that is, that communication takes place without impediment. The development of standards for information retrieval at the document level, on the other hand, will require significant cooperation among systems designers, data providers, bibliographic information specialists and electronic text encoding specialists. Through collaborative efforts, the identification, location, retrieval, manipulation, and use of the digital information (as well as electronic services) stored in (or located at, or linked through) cyberspace will be facilitated. This group of standards is referred to as "metadata standards."
Smith (1996) enumerates the characteristics of metadata as it operates in traditional library contexts:
According to Smith, digital technology can change these characteristics in the following ways:
Smith's articulation clearly reflects the main concern of current studies of metadata: data modeling. Efforts are underway from both approaches to develop metadata standards for data (or information object) modeling, each following its own path:
The focus of electronic data archives system designers in this regard is on data use, which is reflected in Gritton's email notes to metadata enthusiasts (Gritton, 1994):
... metadata represents information which supports the effective use of data from creation through long term use ...
In this approach, any additional information (e.g., content description, access restriction, and administrative data) that can improve the use of stored data is considered good metadata.
The second group is attempting to apply well-established library-oriented bibliographic control mechanisms and tools to cataloging Internet resources. Such mechanisms and tools include:
MIT Library (Xu, 1996) and OCLC's Cataloging Internet Resources Project "InterCat" (Jul, 1995) are representative examples of this practice. Given that the primary concern of library cataloging is on locating and collocating, it is not surprising that this group's focus is on the resource discovery function, as reflected in the scope of the Dublin Core (Weibel, Godby, Miller & Daniel, 1995):
only those elements necessary for the discovery of the resource were considered. It was believed that resource discovery is the most pressing need ...
In the following sections, this article examines the different concepts and orientations of the two major approaches contributing to the metadata discussion, and proposes an integrated concept of metadata to facilitate the merging of these two approaches. This paper also discusses the on-going efforts to establish metadata standards within each of these traditions, and compares different metadata formats.
METADATA AND METADATA STANDARDS
What is Metadata
Metadata is a heavily loaded term (Gritton, 1994) for which many definitions have been offered. Generally speaking, metadata may be defined as:
data about data.
Agreement on the middle term of this definition -- "about" -- is crucial to a common understanding of metadata. From the bibliographic control perspective, the focus of the "aboutness" is on the characterization of the source data (Smith, 1996; Weibel et al., 1995) for identifying the location of information objects and facilitating the collocation of subject content. Metadata, therefore, is any information which records the characterization and relationships of the source data, or the set of data elements that can be used to describe and represent information objects.
On the other hand, from the computer science oriented data management perspective, the focus of "aboutness" is to enhance the use of the source data (Gritton, 1994; Strebel et al., 1994; Strawman & Bretherton, 1994). Metadata from this perspective is any information which supports the effective use of data, including information which can facilitate data management (e.g., data authentication, data sequence, data type, key field indicators), data access (e.g., range, report parameter) and data analysis (e.g., format for data mining, visualization) (Strebel, Meeson & Frithesen, 1994; Rao et al., 1995).
Although the focuses of the above two approaches to metadata are different, they are neither incompatible nor mutually exclusive. In the Internet environment, the user community is so diversified that it is hardly possible to identify a single predominant use among many different uses. Therefore, metadata schemes should be flexible enough to satisfy as many users as possible. Given these issues, we propose an operational definition of metadata encompassing both perspectives:
metadata is data which characterizes source data, describes their relationships, and supports its discovery and effective use.
The above definition is more like an expedient compromise than a new theoretical articulation. This accurately reflects the current status of metadata studies: multi-disciplinary research with different emphases from different intellectual communities.
The Functions of Metadata
Based on the above definition, it is possible to derive categories of metadata standards or schemes such as provenance, form, functionality, usage statistics, terms and conditions of use, administrative data, content ratings, linkage or relationship data, structural data, and so on. The decision of which metadata categories to include depends on the designer's understanding of the primary function of a metadata scheme.
One of the main functions of metadata is resource discovery. Most of the research in the library and information science communities have focused on this function which supports searching, retrieval, discovery and access to resources. According to Rao et al. (1995), an important use of metadata is to support selecting, understanding, utilizing, and remembering sources and their contents. In principle, metadata provides an effective mechanism for identifying and locating data which is relevant to a particular user. Metadata should make it possible for users to determine:
· the availability of information (do the information objects exist? where are they? how many of them are available? are they all the same?)
· the usefulness of information (is it authentic? is it good? how can I determine whether it is useful or not?).
Whereas the library and information science community focuses on resource discovery and search and retrieval functions of metadata, the computer science oriented data management communities focus on the aspects of data use. Data archiving requires a schema to describe the conceptual or logical data structures of all the objects or entities with which the archive is concerned (Strawman and Bretherton, 1994), as well as the relationships between them (e.g., Sheth and Larson, 1990). Strawman and Bretherton argue that within such a well defined and structured context, the difference between metadata and data disappears --metadata is simply data. Thus, according to this perspective, the distinction between metadata and data is merely one of use, and the focus is shifted to another formidable task of defining context. This context includes various functional requirements such as administrative function (e.g., authentication of users and charging mechanisms), content designation function (e.g., data analysis to support understanding of the meaning of the data), syntactic semantic function (e.g., record structure development), and data re-organization for presentation and visualization. Different contexts may emphasize different functions. In general, Hunter and Springmeyer (1994) claim that the basic function of metadata is to assist data management and storage systems in providing more efficient access to large data sets. Similarly, Strebel et al.(1994) propose the three main functions of metadata as:
1. data management,
2. data access, and
3. data analysis.
The functions of metadata can also be discussed at the system level and the end-user level. At the system level, metadata can be used to facilitate interoperability and shareability among resource discovery tools. Data sharing can speed completion of projects, improve the utility of research and decision-making, and reduce costs by minimizing duplication of effort. It can also support integration of the Internet resources and printed material already represented in machine readable format. At the end-user level, metadata can facilitate the ability to determine:
1. what data is available,
2. whether it meets specific needs,
3. how to acquire it, and
4. how to transfer it to a local system.
As metadata has developed in different contexts and for different uses, various formats have emerged. Formats such as the Internet Anonymous FTP Archive (IAFA) templates and Text Encoding Initiative (TEI) header emphasize different aspects of data use. These metadata schemes do not always reflect the boundaries of the traditions we have discussed, rather they are often illustrative of a tendency of cooperative efforts to merge the two.
Current Metadata Formats for the Internet Resources
Currently, there is no single international standard for metadata. Recently, several metadata schemes for digital information objects have been proposed, with different levels of complexity and richness (Dempsey & Heery, 1997), from relatively simple formats such as Dublin Core, to more complicated and richer formats such as the Text Encoding Initiative (TEI) header. Of these metadata schemes, we have chosen six standards for the comparison, based on their scope and impacts on other metadata schemes. The first criterion of selection for inclusion here was the scope of the metadata scheme. We have focused on generic metadata formats for "internet resources," rather than domain-specific formats. Secondly, we chose those schemes that have been widely implemented and experimented with, rather than schemes developed for a particular community. Based on these criteria, we have chosen the following six metadata standards for comparison: Dublin Core, IAFA templates, WWW Semantic Header, URC (Uniform Resources characteristic or Uniform Resources Citation), OCLC Intercat project which employs USMARC format, and the TEI (Text Encoding and Interchange) independent header. Although there is not agreement as to whether IAFA constitutes a standard, it is included in this section, because it is a well-developed metadata format specifically designed for Internet use. Table 1 compares different metadata standards. Our focus is not on the constituency or the syntax or the record structure, but on the core-elements of these standards. Since the formats discussed here are still undergoing revision processes, and some of them do not have the final draft (e.g., URC), the comparison is based on the latest information and documentation available. There may be some minor variations among different versions and drafts.
Whereas some elements such as identifier and title are common to all the schemes, other elements are particular to one scheme (e.g., system requirement of semantic header, and encoding description of TEI header). Again, the decision of which metadata categories to choose clearly reflects the primary concern of a metadata scheme.
Table 1. Attributes of Current Metadata Formats
Dublin URC Semantic USMARC IAFA TEI Core Header templates Header INTRINSIC Subject + + + + Title + + + + + + Author + + + + + + Publisher + + * + + + Publication + * + + place Other agent + + + * Date + + + + + Object type + * + Form + + + Identifier + + + + + + (URN, ISBN...) Relation + + * + * Source + + + + Language + * + + * Coverage + * + Abstract + * Version + * + + + (edition) Notes + * * (annotation) Signature + + Classification + * Classification * (security level) Keyword + + * EXTRINSIC System * + requirement Mode of + + Access Availability + * Cost * + * Control * + Extent (size) * + * Encoding + * description Revision * + * description
+ Mandatory * Optional
Examining these metadata formats in terms of granularity of information provided and complexity of the syntax and structure, we can see a spectrum of richness. Some scholars have developed typologies based on this richness to classify metadata formats. For example, in Dempsey and Heery's (1997) three bands classification scheme, Dublin Core belongs to band 2 and TEI belongs to band 3 of the spectrum. However, some classifications may be controversial because of the flexibility of the requirements involved. Just as we have different levels of cataloging in library science, we also have different levels of completeness in metadata schemes. Due to the repeatability and extensibility of different fields of the metadata elements (attributes), the degree of richness can vary dramatically within the same format. In addition, the dichotomy of "mandatory" and "optional" may be too simplistic (e.g., TEI employs a three level specifications: required, recommended, and optional).
Sometimes it may also be controversial to classify metadata formats in terms of origination (i.e., whether the scheme originated from library science or computer science), because there are always ambiguous cases and current efforts demonstrate collaboration between the two groups. For example, some may consider Dublin Core to have originated from library science because of the principal involvement of one of the largest library consortiums - OCLC - in the effort, and the fact that the first workshop followed joint meetings with the American Library Association. However, Dublin Core workshops have also been organized by National Centre for Supercomputer Applications. On the other hand, TEI may look like computer science oriented approach because of the structure of its document type definition (DTD) and its system declarations, but the content of TEI independent header is strongly influenced by library cataloging practices.
We understand that there may be more than one origin and more than one source of influence in the formulation of these metadata formats. A geneology of knowledge may be required to depict the intellectual history of the merging and diverging of the ideas behind metadata elements. We also understand that the "richness" of a scheme may be affected in practice by repeatable and expandable elements. Therefore, comparison here is only intended to illustrate the different emphasis of the different formats at this moment in time, not to "anchor" or "fix" the origin, development or application of any scheme.
We begin our comparison with USMARC (U.S. Machine Readable Cataloging Format), which we consider a representative of the library science oriented approach. MARC originated in the 1960s as a communication format for exchanging bibliographic data. As the computerization of library systems proliferated through the seventies, eighties, and into the nineties, MARC became the de facto standard for the communication of library data of all kinds. Recently, experiments have been conducted in US and discussed internationally which apply the USMARC format to cataloging Internet resources. Dempsey and Heery (1997) place USMARC in band 3, the richest band of the spectrum, but comparison of it with the other metadata schemes in Table 1 may lead to confusion. There are many elements not found in USMARC. Even though USMARC has many more fields and sub-fields (i.e., more "richer") than Semantic Header and IAFA templates (e.g., for the metadata element "title" USMARC provides fields 130, 210, 222, 240, 245, 246, 247, 630, 730, and 740 for different forms, and variations and types of title of the information object), the latter appear to be able to cover more data functions than USMARC. This apparent inconsistency between Dempsey and Heery's classification and Table 1 actually points to a definitional discrepancy: USMARC is "rich", but not in terms of support for "data access" or "data use" functions. Metadata elements may be roughly divided into two categories: intrinsic (i.e., those that are related to resource identification and discovery) and extrinsic (i.e., those that are related administration and other non-bibliographic data). There is no prescribed place in USMARC for most of the extrinsic elements (e.g., access restriction, system requirement). Inclusion of such extrinsic elements is limited to the repeatable notes fields in the MARC record: "notes" (5XX field, most of them are also fields for bibliographic data). Notes fields are non- searchable for most of the current catalog search engines, which further limits its utility for users. USMARC is very rich in intrinsic elements (e.g., different fields for different variations of title and author), however. The intrinsic elements provide effective means for resource description and identification. When the primary purpose of manipulating metadata of information objects is to locate and collocate those information objects, USMARC is quite appropriate; but when the primary purpose of manipulating metadata is to make the information objects usable for authorized users, one may prefer IAFA templates (designed by the IAFA working group of the Internet Engineering Task Force), or other metadata formats that have explicit and well defined metadata element fields for data "use". It is not surprising that USMARC underplays the usage of the information objects, as this is typical of the library science oriented approach as a whole. Association of the format with the communication protocol clarifies this even further: the notes fields in bibliographic files in USMARC format are not searchable through the Z39.50 protocol which is specifically designed for bibliographic data transmission.
All six metadata formats have specific fields for intrinsic elements, and because these fields may be repeated and expanded, intrinsic elements can be comprehensively treated in any of the metadata formats. The consequences of treating intrinsic data in this way, however, are similar to those of treating extrinsic data as it must be treated in US MARC. The reliance on repeatability and extensibility of metadata elements, whether intrinsic or extrinsic, makes those elements difficult to search, recognize and use. Referring to Table 1, one might conclude that some basic elements are common to all formats, while the remaining elements found in some formats but not in others, increase the "richness" of the format, but, as the discussion above indicates, this interpretation does not necessarily hold. Instead, it is fairer to say some elements are more suitable for resource description and identification, while others are more suitable for resource access and navigation.
As mentioned above, the primary concern of metadata is to discover appropriate methods for modeling various classes of information objects in the networked and distributed information environment. This emphasis on modeling can be considered as a representative trend of current studies on metadata schemes. The orientation of these studies are either technical, theoretical (e.g., Smith et al., 1996; Smith, 1996; Smith et al., 1995) or empirical (e.g., Jul, 1995; Xu, 1996). The justifications of selecting particular metadata categories are articulated at the levels of data type and expected user needs. Data type can be technically specified, and expected user needs can be empirically observed, analyzed and categorized. However, the specification of data type and the categorization of expected user needs are not straight forward and un-mediated. They must be articulated through a framework of pre-judgments of the needs of user community, and built on a set of assumed values hidden in the intellectual tradition of the data-type specifier and categorizer. A re-examination of our pre-judgments of user needs and our intellectual tradition, coupled with understanding of the pre-judgments of user needs and intellectual tradition of the other disciplines involved, is required in order to promote the fusion of horizons of different approaches of metadata construction.
Almond, J. (1994). Ideas for information types and metadata attributes. http://www.llnl.gov/liv_comp/metadata/papers/type-almond.ps.
Caplan, P. (1995). You call it corn, we call it syntax-independent metadata for document-like object. The Public-Access Computer Systems Review, 6(4), 19-23. http://www.nlc-bnc.ca /documents/libraries/cataloging/caplan3.txt.
Cutter, C.A. (1904). Rules for a dictionary catalog (4th Ed.). Washington, DC: Government Printing Office.
Dempsey, L., & Heery, R. (1997). A review of metadata: A survey of content resource description formats. http:www.uklon.ac.kr/metadata/DESIRE/overview/rev_ti.html
Dempsey, L., & Weibel, S. L. (1996). The Warwick metadata workshop: A framework for the deployment of resource description. D-Lib Magazine, July/August 1996. http://www.dlib.org/dlib/July96/07Weibel.html.
Desai, B. C. (1995). The semantic header and indexing and searching on the Internet.
Desai, B. C. (1995). Indexing and searching virtual libraries. Paper prepared for CIC forum: America in the age of information. http://www.cs.concordia.ca/~faculty/bcdesai/forum95/forum95-bcd-indexing.html.
Deutsch, P., Emtage, A., Koster, M., & Stumpf, M. (1995). Publishing information on the Internet with anonymous FTP. http://info.webcrawler.com/mak/projects/iafa/iafa.txt
Freeman, G. & York, J. (1991). Client/server architecture promises radical change. CAUSE/EFFECT, 14 (1). Http://cause-www.colorado.edu/information-resources/ir-library/text/cem9114.txt
Gordano, R. (1994). The documentation of electronic texts using text encoding initiative headers: An introduction. Library Resources and Technical Services, 38(4), 389-401.
Gritton, Bruce (1994). Metadata comments (email, March 3, 1994). http://www.iini.gov/liv_comp/metadata/papers/comments-gritton.html.
Hunter, C., & Springmeyer, R. (1994). Using metadata to create library guides for scientific analyses: A position paper on metadata by the intelligent archive project at LLNL. Ftp://ftp.clearlake.ibm.com/pub/IEEE-
Jul, E. (1995). Internet cataloging project call for participation: building a catalog for Internet-accessible
Madsen, M. S., Fogg, I., & Ruggles, C. (1994). Metadata systems: Integrative information technologies. Libri, 44 (3), 237-257.
Rao, R., Pedersen, J. O., Hearst, M. A., Mackinlay, J. D., Card, S. K., Masinter, L., Haivorsen, P., & Roberson, G. G. (1995). Rich interaction in the digital library. Communications of the ACM, 38(4), 29-39.
Shelley, E. P. & Johnson, B. D. (1995). Metadata: concept and models. Proceedings of the third national conference on the management of geo-science information and data, organized by the Australian Mineral Foundation, Adelaide, Australia, 18-20 July 1995, pp.1-5.
Sheth, A. P. and Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22, 183-236.
Smith, T. R. (1996). The meta-information environment of digital library. D-Lib Magazine, July/August 1996. Http://www.dlib.org/dlib/july96/new/07smith.html.
Smith, T. R., Greffner, S., & Gottsegen, J. (1996). A general framework for the meta-information and catalogs in digital libraries. Alexandria digital library public documents. Http://alexandria.sdc.ucsb.edu/public-documents/iee.
Strawman, A. & Bretherton, F. (1994): "A reference model for metadata". http://www.llnl.gov/liv_comp/metadata/papers/whitepaper-bretherton.
Strebel, D., Meeson, B. & Frithesen, J. (1994). Metadata standards and concepts for interdisciplinary scientific system - II. Position papers from IEEE Metadata Workshop (May 1994-Washington D.C.). ftp://ftp.clearlake.ibm.com.pub/IEEE-Metadata/Archives-Workshop/Pos_Papers/Donald Strebel.
Taylor, A. G. & Clemson, P. (1996). Access to networked documents: Catalogs? Search Engines? Both? OCLC Internet Cataloging Project Colloquium, position paper http://www.oclc.org/oclc/man/colloq/taylor.html
Weibel, S., Godby, J., Miller, E & Daniel, R. (1995). The essential elements of network object description: OCLC/NCSA metadata workshop. http://www.oclc.org:5046/oclc/research/
Xu, A.(1996). Accessing information on the Internet: feasibility study of USMARC format and AACR2. http://www.oclc.org/man/colloq/xu.htm