An Argument for a Semantic Web Based FRBR Union Catalogue

Jillian C. Wallis

UCLA - Dept. of Information Studies
IS 277: User Centered Design
Prof. Phil Agre
June 14th, 2004


Abstract. IFLA's FRBR is a semantic expression of the relationships between items in the library catalog. The web technologies currently being developed by the W3C could be used to implement these expressions. A new layer would need to be developed on top of the MARC XML layer, to aggregate all of the holdings and descriptive data into a new union catalogue. Thus, the FRBR data could then live in this layer and give the library catalog the new functionality required by FRBR.

Introduction. The IFLA (International Federation of Library Associations) final report on Functional Requirements for Bibliographic Records (FRBR) has changed the way the library world perceives the library catalogue and the interaction of records with one another. FRBR describes relations between catalogue items using the concept of bibliographic families, pointing out just how closely items are related and precise relationships. The mapping of bibliographic families can only be possible if all catalogue records contain the FRBR metadata, and if they are able to be united into a single catalogue of holdings. The current union catalogues rely on MARC records. At this point in time, the MARC standard is too entrenched to be able to accept these new FRBR specifications without serious renovation and record conversion. A better solution would be to layer the FRBR metadata on top of the existing MARC metadata.

The semantic web is slowly growing in scope and maturity, but the promise it shows in this capacity to layer. This ability could be harnessed by the library community to make the relationships described in FRBR explicit. As a step in the right direction, the XML expression of MARC records has already been embraced. This could serve an extensible layer on which it would be possible to add the more semantic FRBR layer using RDF or one of the other truly semantic XML versions. Using harvesting tools to extract holdings information from the MARC XML records, this FRBR layer could then form a union catalogue that contains all of the FRBR relationships down to the holdings information for each item in the catalogue.

The Semantic Web. At the moment, the internet is ruled by HTML (HyperText Markup Language), a presentation standard that allows text to be "marked up" with instructions for display within a browser window. HTML is platform independent, but some browsers are more liberal with HTML tags than others. For example, a page that has been designed for Internet Explorer may not look the same when displayed in Mozilla. HTML allowed the internet revolution to occur, because of this independence and the ease with which new pages can be added to the web and hyper-linked to other pages and items.
As the web grows larger, just having a presentation standard is no longer sufficient. The shear volume of information needs some measure of control. This control can come in the form of attached metadata to sufficiently describe web resources, using author, title, subject, and other information, similar to the way books are described. While HTML includes provision for metadata in the header material tags, this is not descriptive enough to satisfy most information communities.

In order to fill this need, the eXtensible Markup Language (XML) has been developed by W3C (World Wide Web Consortium) which is headed by Tim Berners-Lee, the so called father of the world wide web. XML makes it possible for a community to define their own set of tags that can be used to markup text, as well as set up relations between tags using a Schema or DTD (Document Type Definition) to form an ontology that is defined by the community's domain. Thus XML is a context standard, instead of a presentation standard, although it can still control the presentation of text through the use of style-sheets. The community designed ontology becomes a content standard, as it helps to define what items should be described by each tags. If this is compared to the world of cataloguing, the ontology serves a similar function as the Anglo-American Cataloguing Rules 2 (AACR2), and the tags as defined by XML would be similar to the MAchine Readable Catalogue (MARC) fields.

Tim Berners-Lee's ultimate vision for XML is the Semantic Web. By layering a logic layer over the top of this new metadata layer, it is possible for the information on the web to have logical or semantic relationships. With logical relationships enumerated it would be possible to build logic engines as well as search engines that could find precisely what the user was querying, thus making the Web accessible once more (Berners-Lee). In this vein, the Resource Description Framework (RDF) was designed to set up this logical framework for semantic relationships. Using the RDF model, it is possible to make assertions using "triples." A triple consists of two nodes, one subject and one object, which are connected with a predicate relationship. In this way it is possible to start forming relationships between concrete items in the world that can be located using URIs (Universal Resource Indicator). As these concrete relationships grow in number, more abstract relationships begin to emerge. (Fitch)

In order to form richer and more strict relationships, other tools that work with RDF and XML are being developed. DAML+OIL is an extension of RDF that is used to define ontologies. Ontologies defined with the DAML+OIL namespace are able to use more predefined triples that are called "primitives." For example setting up a catalogue entry with an author in RDF would not imply that the author has a birth date, but in DAML+OIL, if the author is defined as a person, then the relationship between an author and their birthday is already made (Fitch). This allows for richer relationship construction with less syntax being defined in the ontology.

Another way to define richer relationships is the use of Topic Maps. Topic Maps are also defined in triples, in this case the triples consist of topics, associations, and scopes. Topics themselves are defined by three characteristics: name, occurrences, and roles placed in associations. These associations are always reified, or stand in for real world associations, and can be used in other triples (Coverpages). Hence, Topic Maps tend to address abstract relationships and work down to real world instances, which is the opposite of the RDF model (Fitch).

The ability to set up community defined ontologies that are able create semantic meaning for web information is a vision, a vision that is slowly resolving. In order to make the vision reality, semantic tools are being developed to assist in the creation and maintenance of XML documents. There are validators to essentially debug XML and there are over-arching ontologies that serve as compilers in different namespaces. XML search tools are slowly coming to fruition, as with Swoogle. Logic engines have been developed for other programming languages, such as ALE and PALE, but have not necessarily for these XML semantic description schemes.

MARC XML. For over 30 years, the MARC standard has been used by the library community to hold bibliographic and authority record information. Cataloguing itself is a highly codified practice, with rules to guide the cataloguer through any decision in the process, such as AACR2 and LCSH (Library of Congress Subject Headings). The fact that the process was already so codified made it ripe for being able to automate at least the record creation process. Thus, the output of the cataloguing process is a MARC record in one of its formats, depending on the type of record being created.

As a standard it has slowly evolved to accommodate new fields for the description of not only books, but electronic files, music, movies, and other file formats (McCallum). The high level of uniform use of the standard has been key to exchanging records data between institutions and the creation of union catalogues, as well as tools for copy-cataloguing. The standard has been implemented worldwide, with only minor differences between each countries implementation if any at all.

When XML technology began to emerge, task forces were created to determine if this was a direction that MARC should be moved towards. The extensibility of XML was an attractive feature. As was the ability to create conversion scripts that could automate the conversion into a new format. Conversion in both directions is loss-less, so the integrity of MARC record format would not be compromised. This conversion would also extend to other XML standards, such as Dublin Core (DC). Additionally, this new web-based version of MARC would be open to web harvesters, such as the Open Archives Initiative (OAI), which could increase visibility for these deep web objects. And so it came to be that MARC 21 was translated into an XML Schema, and named MARC XML which maintained by the Library of Congress (LoC).

The MARC XML Schema essentially marks up the traditional MARC data fields, in order to compose a database of MARC authority and bibliographic records. Unfortunately more advanced semantic capabilities are unavailable in this layer of the semantic web. So it is not possible to set up abstract relationships and be able to logically deduce anything from the MARC XML records, without another layer of RDF (or DAML+OIL or Topic Maps) pulling information from the MARC XML databases.

The process to add new fields to the standard can take years to occur and involved the appointment of task forces to review the possible additions before making a decision. While this low level of change has made the standard very stable, it has also very resistant to change, especially of the radical nature. This movement to MARC XML is the largest change to occur in the world of MARC records for some time, and it should be noted that the move to XML has not affected the underlying structure of the MARC record. Unfortunately, if FRBR were implemented as a part of the record the underlying structure would need to change to accommodate new functionality. As such the FRBR specifications could not just be added to the existing MARC formats or to MARC XML.

Union Catalogues. In order to understand how union catalogues function, their reasons for existing must first be examined. There is a certain librarian mode that dictates the actions of librarians and the principles of library science. "[T]he librarian way of organizing communication is very much oriented towards aggregation of information" (Gradmann) Librarians are concerned with providing access to information, a such there is much emphasis on collecting information for library users and imposing an structure that helps the user find what they are looking for. The ultimate expression of this need to aggregate information is the creation of union catalogues.

Library union catalogues, such as OCLC or RLIN, bring together holdings from many different libraries, to create a list of all items available in libraries, as well as their locations. This is accomplished by having a copy of the local catalogue exists both in each library and at the union catalogue server. The centralized nature of the union catalogue means that even if a library's local server is down, the information can still be found at the union catalogue. Centralization also allows for quicker searching, because the search does not need to be broadcast to a distributed database network (Coyle). This type of catalogue sharing allows copy-cataloguing and inter-library loan or reciprocal borrowing systems to exist because all of the holdings information is available. Alas, these are subscription systems, which can sometimes be out of the price range for smaller libraries.

Union catalogues would not have been able to exist without MARC and a transmission standard that would facilitate the movement of MARC records. When the ARPANET was implemented, data transmission protocols and standards were devised to talk over the network and share data. It was at this point that the library community started work on what would eventually become the "Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" (Z39.50) standard. As can be seen in the full title, Z39.50 is recognized by both the National Information Standards Organization (NISO) and the American National Standards Institute (ANSI).

At it was developed, the Z39.50 standard "is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search." (Lynch) Because this standard did not try to determine the specifics of how the data was stored and only focussed on how information was exchanged as a part of the query process, Z39.50 existed as a standard layer between different types of databases with different specifications.

More recently, a number of libraries have attempted to set up distributed databases using the Z39.50 standard. The entire state of Iowa set up a virtual union catalog in 1997, in order to avoid the overwhelming cost of subscribing to OCLC or RLIN. The participating libraries each had different OPAC vendors, but searching still worked because of the standard's ability to work with different database structures. They found that the standard was able to satisfy their users and maintained all of their normal services (Stark). The UC system also implemented a distributed union catalogue that utilized Z39.50 in order to compare the performance of a distributed and a centralized union catalogue. In the case of the UC libraries the centralized catalog was faster and more reliable, because of its built in redundancy (Coyle).

Z39.50 is similar to XML in a number of different respects. As was mentioned above they can both form distributed databases. Z39.50 also has semantic capabilities, or would if there were not a number of problems in implementation. There is no community consensus for the structure or attributes of information content classes. Without this agreement on an ontology there is no interoperability. There is also some belief that semantic capabilities were out of the standard's scope (Lynch). All the more reason to move toward the semantic web to fully realize IFLA's FRBR.

FRBR. The objective of cataloguing is to make resources available to the user. This objective can be manifest in a number of different ways, but the end result is the same. If searching by title, author, or subject were all that were needed to make resources available, the cataloguer's job would be simple. Unfortunately, there are many other means of access to catalogue records, but not all of the avenues have yet been explored. The composite of all of the MARC record fields tries to capture all of the information about an item, but the "network of potential related editions and translations of works" (Leazer & Smiraglia) are not made explicit by this framework.

Bibliographic families are this network of related editions and translations, as well as revisions, abridged or illustrated editions, parodies, etc. Bringing together items of the same family, say everything pertaining to Shakespeare's Hamlet, allows the user the ability to look for one of any number of prints of the main work, as well as annotated versions, or analysis and critiques of this work. This is a method for collocating materials that had not been explored by a library catalog until the following occurred. In IFLA's final report on the Functional Requirements for Bibliographic Description, this method was outlined as one of the functional requirements. As such researchers in the information retrieval arena have been trying to implement the following scheme, affectionately known as FRBR (fûr•bûr).

FRBR is broken into three groups of entity relationships. The first group consists of work, expression, manifestation, and item. These function as the four levels of detail in actually showing relationships, with work being the overall bibliographic family, and the item being a specific holding. The second group are those responsible for the work, expression, manifestation, or item. These can either be a person on corporate body, and they must have a role that defines their responsibility. The third entity group can include the entities from the previous two groups, as well as concept, object, event, and place. As such this third group is what the work is about (IFLA). These three groups of entities reflect the traditional descriptive elements that are used to catalogue a work. The group I entities are analogous to title, group II entities are analogous to the statement of responsibility, and group III is the subject. Any more relation to traditional descriptive cataloguing at this point ceases.

The group I entities form a hierarchical relationship. Starting at the top, a bibliographic family or "work", is "a distinct intellectual or artistic creation" This work normally lends a title to the bibliographic family that follows. The expression is "the intellectual or artistic realization of a work." An expression could be the print run or reproduction of a work, a new expression could be a translation or revision. For each expression a relationship is defined with the above work, and those responsible for the expression are made explicit. The manifestation is "the physical embodiment of an expression." In other words the manifestation is the actual print run of a book, or edition, and as such carries the publisher and other edition information. And finally the item is "a single exemplar of a manifestation" (O'Neill). The item is similar to an instance of a book, and reveals the holdings of an item.

There are many different types of relationships that can be expressed using the FRBR model. There are work-to-work relationships, such as successor, adaptation, or criticism. There are expression-to-expression relationships between expressions of the same work, such as translation, or between expressions of different works, such as supplement. There is also the expression's relationship to the work. There are manifestation-to-manifestation relationships, such as reproduction or a whole/part relationship. There is the manifestation-to-item relationship, and finally item-to-item relationships, such as reconfiguration (Beacom).

FRBR sets up a semantic relationships between different items within a bibliographic family. The use of a hierarchical also reduces the redundancy of information a user may be presented with. For example if the user is looking at different manifestations of Hamlet, the fact that all of the items are titled Hamlet, and are by Shakespeare, is implicit. This semantic relationship between members of the same family can also be exploited graphically to generate maps of bibliographic families. These families can then be studies using visual pattern recognition.

A FRBR Union catalogue. Even examining the quick description of the FRBR hierarchy and relationships, it is possible to start forming triples of two nodes that are described by an association. Because of the inherent semantic structure of FRBR, RDF, DAML+OIL, or Topic Maps could all be used to set up a FRBR ontology with different classes that describe the possible relationships between items, as well as the hierarchy between the work, expression, manifestation, and item.
Once all of the MARC records have been converted to MARC XML, harvesting of holdings information would be easy to obtain. Then unique identifiers for each item could be created using a combination of the ISBN and the location. Given an RDF Schema for FRBR and those who are willing to actually discover and document the possible relationships within and between bibliographic families, it would be possible to automatically combine the holdings information into a FRBR union catalogue. Having a union catalogue is necessary, because only in the presence of large amounts of data can create a rich network of relationships.

This is not the first paper to proclaim that at least RDF would be a possible means of implementing FRBR (Fitch, Powell). In fact there have already been some attempts at implementing FRBR as a part of a library catalog system. VTLS, an Integrated Library System (ILS) vendor, has examples of FRBR records. VTLS provides an open source ILS called Virtua, that is based in XML, their value added being the installation and support staff. VTLS has also created an RDF implementation of FRBR that is able to interact with the XML based catalog.

Other projects that are implementing FRBR are the AusLIT: Australian Literature Gateway, which is using a combination of RDF, DAML+OIL, and Topic Maps to create new methods of resource description. (Fitch) VisualCat: Danish cataloguing client uses XML and RDF to manage traditional cataloguing structures as well as FRBR (Beacom). One of the world's largest union catalogues, OCLC, has been performing research on "FRBR-izing" their holdings. They have managed to create a downloadable FRBR Work-Set Algorithm to convert MARC 21 record databases into FRBR catalogues. They have also developed a FRBR tool which can be used on their fiction collection, called FictionFinder.

Conclusion. A FRBR union catalogue would not be a replacement for the existing MARC based catalogs, union or otherwise. The FRBR model acts as an enhancement to the existing means of access to resources within the catalogue. FRBR acts to create new meaning with the same information by highlighting implicit semantic relationships. It would only be possible to create such union catalogue with a set of semantic tools, and as such the semantic web presents the perfect opportunity for FRBR to finally be implemented. RDF could serve as a semantic layer that is expressive enough to make the FRBR relationships expressive and can also be layered over MARC XML, which is a rich data resource.

Works Cited.

* only available to those with a subscription