A Proposal for a Global Digital Format Registry
Harvard University Library
September 29, 2005
1 Executive Summary
2 Plan of Work
2.3 Work to date
2.4 Program of new work
2.5 Management oversight
2.6 Schedule and deliverables
Appendix A Participants in the DLF Invitational Workshops
Appendix B Use Cases
Appendix C Provisional Data Model
Appendix D Letters of Support
Appendix E Staff Curriculum Vitae
Appendix F Project Timeline
Appendix G Project Budget
1 Executive Summary
The Global Digital Format Registry (GDFR) will provide sustainable distributed services to store, discover, and deliver representation information about digital formats.
The format of a digital object must be known in order to interpret the information content of that object properly. Without knowledge of its format, a digital object is merely a collection of undifferentiated bits. Thus, format typing is fundamental to the effective use, interchange, and preservation of all digitally-encoded content. In terms of the Open Archival Information System (OAIS) Reference Model, the format typing of a digital object is representation information about that object; that is, it provides 1“information that maps the Data Object into more meaningful concepts.” However, in order to
implement that mapping it is necessary to have complete representation information about the format itself: its syntactic and semantic rules for encoding information into digital form. As noted in the recent NSF-DELOS report, Invest to Save, “Registries of digital formats provide keys to understanding the
nature of digital objects, guide the managing of their transition from one state to another, and inform the 2choice of preservation method for material in specific formats.” In so doing, format registries fall
directly into the scope of the digital preservation research agenda identified in the NSF/Library of Congress workshop report, It’s About Time, playing a key role in enabling and supporting technical
architectures and tools “to acquire archival data, prepare data for long-term storage, and manage data over 3several generations of technology.”
The wide diversity and rapid pace of adoption and abandonment of digital formats present an ongoing problem for long-term preservation efforts. As noted in the Library of Congress planning report, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure Preservation Program (NDIIPP), “Longevity of digital data and the ability to read those data in the future depend upon 4 standards for encoding and describing, but standards change over time.” The purpose of the GDFR
project is to address this concern by providing a sustainable resource for managing format-critical representation information necessary to the preservation function.
The Global Digital Format Registry will provide services for:
; The centrally-organized collection of format representation information
; The distributed storage, discovery, and delivery of that information
On a larger scale, a sustainable GDFR will provide:
; A common mechanism to pool and share scarce technical expertise on a global basis, reducing the
necessity for duplicative local effort
; A channel for the widest possible distribution of the fruits of that expertise to all actors engaged
in preservation activities
; A process for generating community-wide agreement as to the normative definitions of format
syntax and semantics, promoting best practices and effective interchange of digital assets between
preservation institutions, programs, and systems
; A foundation for additional value-added services requiring detailed knowledge of digital formats
The NDIIPP initiative defines its evolving infrastructure in terms of two conceptual facets: a digital
preservation architecture that provides the technical framework for preservation activities undertaken by a network of partners; and a digital preservation network of actors who collaborate to preserve digital
content. The functions provided by GDFR will support preservation infrastructures with regard to both of these facets: in terms of a preservation architecture, the GDFR will be a service provider of critical format representation information necessary for effective preservation collaboration; and in terms of a
A Proposal for a Global Digital Format Registry 1
preservation network, the GDFR will enable the work of the institutional, consortial, and regional actors engaged in the long-term preservation of digital assets of cultural, scientific, and economic significance.
This proposal lays out a two-year plan of work leading to the operational deployment of the Global Digital Format Registry populated with representation information for a significant number of digital formats in most common contemporary use. The project will incorporate the widest possible consultation with international stakeholders to achieve community-wide consensus and instill levels of trust and ownership necessary to the long-term sustainability of the GDFR.
1 ISO 14721:2003, Space data and information transfer systems – Open archival information system – Reference
model, February 24, 2003. Previously available as CCSDS 650.0-B-1: Reference Model for an Open Archival
Information System (OAIS), Blue Book, Issue 1, January 2002.
2 National Science Foundation/DELOS, Invest to Save: Report and Recommendations of the NSF-DELOS Working
Group on Digital Archiving and Preservation, 2003 <http://delos-noe.iei.pi.cnr.it/activities/internationalforum/
3 National Science Foundation/Library of Congress, It’s About Time: Research Challenges in Digital Archiving and
Long-term Preservation, August 2003 <http://www.digitalpreservation.gov/repor/NSF_LC_Final_Report.pdf>. 4 Library of Congress, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure Preservation Program, October 2002 <http://www.digitalpreservation.gov/repor/ndiipp_plan.pdf>.
A Proposal for a Global Digital Format Registry 2
2 Plan of Work
Technical documentation about digital formats will necessarily be a core part of any preservation program. In the absence of a generally accessible, reliable, and persistent registry of such data, each individual preservation program will need to collect and maintain its own documentation. Not only is this wasteful in terms of large-scale duplication of effort, but it would also require each program to have access to highly sophisticated staff with the skills to document each format the program ingests. Such expertise is scarce and expensive and many programs will likely be unable to support the activity at an appropriate level.
The Global Digital Format Registry therefore represents a highly fruitful area for shared infrastructure upon which to build a distributed program of digital preservation. The GDFR will save individual programs time and effort and will provide the entire community with access to expertise that would otherwise be unavailable to most institutional programs. It will provide a means for more preservation
programs to have more sophisticated information about more formats and thus will contribute
significantly to developing a robust preservation services environment distributed across institutions.
The extant Internet Assigned Numbers Authority (IANA) MIME type registry is insufficient for many 1preservation purposes. It does not mandate disclosure of technical information; it does not define a consistent set of technical properties; it is not amenable to automated discovery and delivery; and it defines format typing at a very coarse level of granularity. The Global Digital Format Registry will provide a standardized set of representation information, accessible through human and machine
interfaces, about formats defined at arbitrary granularity. While the GDFR cannot "mandate" disclosure
by the owners of proprietary formats, it is hoped that the GDFR will be able to engage those owners on behalf of the international digital preservation community in an effort to build sufficient levels of trust so that the appropriate format descriptions will be deposited in the GDFR.
The most significant architectural aspect of the GDFR is its distributed nature. The aim of the project is not to build a single centralized registry, but rather to define a common network protocol by which multiple independent, but cooperating, registries can communicate with each other and synchronize their holdings of format representation information. Such a scheme for redundant decentralized services is an important factor contributing towards the robustness of the GDFR, by decoupling its global long-term sustainability from the effects of local short-term policy making. The selection of an appropriate network protocol for data propagation is critical. The desirable characteristics of such a protocol include efficiency, fault-tolerance, automated operation, and ease of implementation. The protocol selection process, along with other major registry design and implementation decisions, will occupy the first six months of the GDFR project. Project staff will pay particular attention to the results of other distributed registry efforts, such as the OCKHAM Digital Library Services Registry, which uses the Open Archives 2Initiative Protocol for Metadata Harvesting (OAI-PMH) for data propagation.
Each cooperating registry is a node in the GDFR network (see Figure 1). The synchronization services of the GDFR protocol ensure that data is automatically propagated across the entire network, using tiered delegation for efficiency. Under tiered delegation, any given node in the network generally needs to be cognizant only of its parent node, from which it receives information, and its immediate children, to whom it sends information. Failover procedures ensure that a node can receive information from alternative parents should the primary parent node become unavailable. The registration of child nodes is a matter of local policy and practice.
A Proposal for a Global Digital Format Registry 3
As evidenced by the accompanying letters of support (see Appendix B), a number of important institutional actors in the international digital preservation community have already indicated their intent to participate as contributors to the GDFR. At the discretion of the contributor, new format representation information can be introduced by any node in the GDFR network in one of two modes: vetted or non-vetted. Non-vetted information is immediately propagated through the network without further technical review; its credibility is based solely on the reputation of the submitting agent. Vetted representation information is subject to an editorial process to ensure its technical veracity prior to being propagated. (The relationship between vetted and non-vetted representation information with respect to technical review is thus similar to that between the IETF and vendor/personal trees of the IANA MIME type registry.) As a result of the technical review process, preservation programs can freely make use of vetted information with a high degree of confidence as to its technical completeness, authenticity, and reliability; non-vetted information should be approached with greater caution. However, the non-vetted avenue does permit the quicker dissemination of format information. It is also useful for defining local format profiles for which a centralized evaluation is not necessary or is not practical.
The editorial review process will enlist the participation of international experts in a manner similar to the 3Internet Engineering Task Force (IETF) Internet Standards process. Newly submitted representation
information will be placed under the scrutiny of both a public review by interested stakeholders and a private review by recruited experts functioning as GDFR technical editors. As with the GDFR protocol, the specific communication mechanism utilized for this review process will be determined during the initial phase of the project. Since this process requires focused communication between human agents, rather than global communication between automated systems, it will probably employ a protocol different from that used for the propagation of data between GDFR nodes.
Vetted for propagation
Editorial RootprocessGDFR node
Data propagationSubmissions for technical vetting
GDFR GDFR nodeGDFR nodeprotocol
Figure 1. GDFR architecture
The GDFR will coordinate the collection by participating registries and contributors of format specification documents in electronic and print form. Since IPR concerns would probably preclude their legal redistribution by the GDFR, the data model will include provision for the storage of these documents both internal and external to the GDFR network itself.
One node in the network is designated as the root node, with administrative responsibility to coordinate registration of new top-level nodes (the immediate children of the root node), manage the global GDFR namespace, and formally release vetted representation information for propagation. (For vetted information, the relationship between the editorial process and the root node is thus similar to that
A Proposal for a Global Digital Format Registry 4
between the Internet Engineering Steering Group (IESG) and IANA in the current MIME type registration process: IANA does not update the MIME registry until directed to do so by a decision of the IESG; similarly, the root node in the GDFR network will not release representation information for 4propagation until directed to do so by a decision of the editorial review board.)
One of the project software deliverables is a reference implementation of a fully-functioning GDFR node, including a data store, service layer, and inter-nodal protocol handler. Note, however, that an institution wishing to participate in the GDFR network is not required to use this reference implementation. Compliance to GDFR standards applies at the level of the inter-nodal protocol, its underlying abstract data and service models, and namespace rules; any system that correctly implements the protocol and conforms to GDFR standards and practices can participate as a node in the GDFR network.
1 Internet Assigned Numbers Authority, MIME Media Types (August 23, 2005) <http://www.iana.org/assignments/
2 OCKHAM Initiative, The OCKHAM Registry Service (May 14, 2005) <http://wiki.osuosl.org/display/OCKPub/
RegistryService>; Open Archives Initiative, The Open Archives Initiative Protocol for Metadata Harvesting,
Version 2.0, October 12, 2004 <http://www.openarchives.org/OAI/openarchivesprotocol.html>. 3 S. Bradner, The Internet Standards Process – Version 3, RFC 2026, BCP 9, October 1996 <http://www.ietf.org/
4 N. Freed, J. Klensin, and J. Postel, Multipurpose Internet Mail Extensions (MIME) Part Four: Registration
Procedures, RFC 2048, BCP 13, November 1996 <http://www.ietf.org/rfc/rfc2048.txt>.
2.3 Work to date
Recognizing the importance of the issue of format registries, the Digital Library Federation (DLF) sponsored two invitational meetings of international policy makers and technical exports in 2003. (See Appendix A for a list of the participants.) The results of these meetings and subsequent work by the ad hoc working group include:
; A clear rationale for a Global Digital Format Registry
The working group affirmed the principles stated in a paper on the GDFR presented at the
2003 IFLA conference: "Proper interpretation of otherwise opaque content streams is
dependent upon knowledge of how typed digital content is represented. For purposes of long-
term preservation of digital objects, this knowledge of representation formats must be
sustainable over archival time-spans. Additionally, effective interchange of digital objects
between repositories and other consuming agents requires mutual agreement on format syntax
and semantics. In order to facilitate the complementary goals of archival preservation and
interoperability, what is needed is a sustainable public registry for the authority control of
identifiers of digital representation formats. Such a registry will provide an unambiguous and
persistent association between an identifier for a format and a set of important syntactic and
semantic information about that format, which can be recovered now or in the future in order 1to facilitate the operation of digital repositories that make use of that format."
; A series of use cases illustrating the integration of the GDFR into repository operations
Over 30 use cases were contributed by the institutional participants. These cases identify
specific applications of the GDFR that occur within most functional components of the OAIS reference model for preservation repositories.
A Proposal for a Global Digital Format Registry 5
As an example, an institutional repository cannot expect to receive detailed technical metadata about objects submitted for deposit; often not even the MIME type is known definitively. However, the repository can interrogate the GDFR to retrieve the salient technical characteristics of various formats or to determine one or more software tools that are capable of identifying the formats of digital objects. Using this information or these tools, the formats of the objects can be recovered. Once the formats are known, the GDFR can be interrogated to determine software tools capable of extracting the technical characteristics from the objects.
Once the objects have been deposited in the repository it is necessary to perform periodic monitoring for the risk of obsolescence. One important measure of obsolescence is the available of tools for rending digital objects. The repository can construct a list of all of the formats used by the objects under its managed storage and periodically interrogate the GDFR to determine the availability of rendering tools for those formats. The decline over time in the number of viable tools is a prime indicator of incipient obsolescence.
Suppose the repository uses a preservation strategy of migration. For a given at-risk format, the GDFR can be interrogated to discover other formats capable of representing the same abstract content with an acceptable level of information loss. Then, the GDFR can be interrogated to determine services or systems that can perform transformations from the at-risk source format to a target format deemed to have continuing viability. Note that there may not necessarily be a direct transformation between the desired source and target formats. However, due to the transitive nature of migration operations, the GDFR can be used to discover more complicated transformative sequences from format S and T by way of a
number of intermediate formats, e.g., S ? I? I?… ? I? T. 12n
Additional use cases developed by the DLF-sponsored ad hoc working group are provided in Appendix B.
; Provisional data and service models
As defined by the working group, the GDFR will provide a persistent and unambiguous means of identifying a registered format and binding that identity to significant descriptive, administrative, and technical information about the format, including:
; Canonical and variant format names
; MIME type
; Nominal file extension(s) and other customary external signatures
; Unique internal signature (e.g., "magic number")
; Format author, IPR holder, and maintenance agency
; Authoritative specification document(s)
; Ontological classification
; Relationships to other formats (e.g., subtype-of, new-version-of, can-be-
; Links to systems, services, and tools that support the format as an input or output
The details of the provisional data model are provided in Appendix C.
Agreement as to the authoritativeness of specification documents would emerge through consensus during the technical vetting process. The GDFR data model will include provision for tagging specifications with an indication of their authenticity and technical reliability.
Any repository holding an instance of a digital object in a registered format can simply record
A Proposal for a Global Digital Format Registry 6
the GDFR format identifier to satisfy many of the requirements for documenting the format
of objects to be preserved.
The service model developed by the working group includes two categories:
; Management services, for administrative operations of the registry itself
o Maintenance – For addition, update, and deletion of format representation
o Approval – For optional review and approval of format information under the
o Synchronization – For the propagation of data across the GDFR network
; Access services, for public discovery and delivery of format representation
o Description – For retrieval of format representation information by public
identifier or identifying characteristics
o Notification – For subscription-based notification of important format-related
o Export – For bulk delivery of registry content
o Introspection – For discovery of the capabilities, policies, and coverage of
Subsequent to these activities the concept of the GDFR has been presented at a number of international 2, 3, 4, 5forums to universal approval. A “proof-of-concept” registry based on the provisional DLF data and
service models is operated as a testbed for experimentation by Dr. John Mark Ockerbloom at the University of Pennsylvania as part of the TOM research project. (More information about the GDFR prototype can be found at <http://tom.library.upenn.edu/fred/>. This testbed is useful for illustrating the types of representation information captured in the provision data model. However, it was never intended to be used directly as the basis for the production GDFR. As stated in the accompanying letter of support from the University of Pennsylvania, TOM is seen as being complementary to the GDFR; the University of Pennsylvania intends to collaborate with the GDFR project as a contributor and, potentially, a participating node in the network.)
The UK National Archives (TNA) has been a significant contributor to the working group. The PRONOM system developed by TNA is an important nascent effort at collecting format representation information. The PRONOM data model is currently being enhanced to bring it into consistency with the provisional GDFR data model for purposes of interoperability. Note that PRONOM is a centralized system; the main contribution of the GDFR project will be the development of a network protocol by which conforming registries, such as PRONOM, can interoperate with each other, facilitating a decentralized process of capturing format representation information and the widest possible dissemination of that information to the digital preservation community. TNA has indicated that it remains committed to the GDFR concept and anticipates that PRONOM would become part of the GDFR network.
1 Stephen L. Abrams and David Seaman, “Towards a Global Digital Format Registry,” World Library and
Information Congress: 69th IFLA General Conference and Council, Berlin, August 1-9, 2003
2 Stephen L. Abrams and David Seaman, “Global Digital Format Registry,” IS&T Archiving Conference, San
Antonio, April 20-23, 2004 <http://www.imaging.org/store/epub.cfm?abstrid=30295>.
3 Stephen L. Abrams, “The Role of Format Registries in Digital Preservation,” International Conference on
Archiving Web Resources, National Library of Australia, Canberra, November 9-11, 2004
A Proposal for a Global Digital Format Registry 7
4 Stephen L. Abrams, “Establishing a Global Digital Format Registry,” Library Trends 54.1 (Summer 2005, to
5 Stephen L. Abrams. “Digital Formats and Preservation,” International Conference on Preservation of Digital
Objects, Göttingen, Germany, September 15-16, 2005 <http://rdd.sub.uni-goettingen.de/conferences/ipres/ download/Digital%20Formats%20And%20Preservation%20-%20Stephen%20Abrams.pdf>.
2.4 Program of new work
The foundational work of the project will be the design and development of an open source reference implementation of a GDFR node. Conceptually, the node will encompass a data store implementing the registry data model to hold format representation information, a business logic layer implementing the registry service model to perform appropriate operations on the stored data, and a presentation layer to provide views of stored data to external agents. The data model elements will fall into four categories:
; General descriptive properties, including canonical and alias identifiers for formats
; Characterization properties, detailing the syntactic and semantic properties for formats
; Processing properties, describing systems and services for which registered formats are inputs or
; Administrative properties, capturing important events in a registration’s provenance
The design of the data model will draw on the existing provisional model developed by the DLF-sponsored ad hoc working group (see Appendix C), which was strongly informed by the OAIS concept of representation information and the OCLC/RLG whitepaper on preservation metadata, itself drawn from a review of preservation projects undertaken by CEDARS (CURL Exemplars in Digital Archives), NEDLIB (Networked European Deposit Library), National Library of Australia (NLA), Online Computer Library Center (OCLC), and the Research Library Group (RLG), and the follow-on work described in the 1, 2, 3recently released PREMIS report. The data model design will be refined through a process of
extensive consultation with interested stakeholders during the initial phase of the project. The model will also draw suggestions for useful administrative properties from the ISO/IEC 11179 and OASIS/ebXML 4, 5standards. One key early decision in the design process will be the selection of the appropriate technology for the data store, which will be either a relational database management system (RDBMS) or an XML-based system. Regardless, the reference implementation will be developed in a platform-agnostic manner using Java and any necessary third party software packages (such as those implementing the selected data store technology) will be selected from open source choices.
The initial population of the GDFR will be performed by Harvard staff using extensive format representation information collected as part of the JHOVE project (JSTOR/Harvard Object Validation Environment), and will include the following widely used formats: AIFF, ASCII, GIF, JPEG, JPEG 2000, 6PDF, TIFF, UTF-8, WAVE, and XML, and their popular variants and profiles. Many of the institutions
providing letters of support for the GDFR project have also stated their intention to become contributors (see Appendix D). As format representation information begins to be contributed, the technical editorial process will be designed and put into place. This process will be designed to establish consensus opinion regarding the authenticity, completeness, reliability of submitted data. To support such a consensual process, a number of potential collaborative communication models will be investigated, including the use of email distribution lists, newsgroups, RSS feeds, or Wikis.
The GDFR service model defines the actions that can be taken by external agents. These include the contribution of new data to the registry, the public discovery and delivery of that information, and administrative activities necessary to the operation of the registry itself, including the synchronization of the various nodes on the GDFR network. Public discovery will be enabled for both human agents (via web interfaces) and machine agents (via web services interfaces). As was the case with the data model,
A Proposal for a Global Digital Format Registry 8
the final service model will be based on the work done previously by the DLF-sponsored ad hoc working group with additional enhancements suggested during the initial design and consultation phase of the project. Suggestions for administrative services will also be drawn from the ANSI X3.285 and 7, 8OASIS/ebXML standards. The services will be implemented by the business logic layer of the GDFR node.
As the GDFR is explicitly conceived of as a distributed system, the inter-nodal protocol used to synchronize data across the GDFR network is of paramount importance. The redundant replication of data across geographically dispersed nodes will increase the long-term sustainability of that data and the important preservation services making use of that data. A distributed design also reduces the computational and network load any particular node. The inter-nodal protocol will initially be tested using multiple GDFR instances hosted locally at Harvard. Beyond that, the UK National Archives (TNA) has agreed in principle to participate in the protocol test. At least two additional external testing partners will be solicited from among the institutions providing letters of support for this project (see Appendix D), peer institutions in the Digital Library Federation, and other institutions downloading the open source GDFR reference implementation for evaluation purposes. (Considerable institutional interest in participating in the GDFR development and testing process has been expressed informally in response to the many public presentations made on GDFR over the last two years.)
Once the reference implementation for a GDFR node and the inter-nodal protocol testing is complete, the project will move into a repository integration testing phase. This is primarily intended to test the automated discovery and delivery of format representation information, but will also be used to ensure that the GDFR data and service models are capable of responding to the needs of repository preservation programs. Initially, this testing will be done with Harvard’s Digital Repository Service (DRS), a large-
scale production repository in operation for over five years, currently holding more than 3.1 million 9digital objects (12 TB) contributed by 30 administrative units of the university. Additional external
institutional participants in the repository interoperability testing will be solicited from the same candidate pool used for the protocol testing. Efforts will be made to include institutions deploying a variety of 10, 11, 12technical infrastructures, including, for example, the DSpace, Fedora, and Greenstone repositories.
1 ISO 14721, Space data and information transfer systems – Open archival information system – Reference model,
February 24, 2003.
2 OCLC/RLG Working Group on Preservation Metadata, A Metadata Framework to Support the Preservation of
Digital Objects, June 2002 <http://www.oclc.org/research/projects/pmwg/pm_framework.pdf>.
3 OCLC/RLG PREMIS Working Group, Data Dictionary for Preservation Metadata: Final Report of the PREMIS
Working Group, May 2005 <http://www.oclc.org/research/projects/pmwg/premis-final.pdf>.
4 ISO/IEC 11179-1:2004, Information technology – Metadata registries (MDR) – Part 1: Framework, 2004
5 OASIS, OASIS/ebXML Registry Information Model Version 3.0, May 2, 2005 <http://www.oasisopen.org/
6 Harvard University Library, JHOVE – JSTOR/Harvard Object Validation Environment, May 26, 2005
7 ANSI X3.285, Metamodel for the Management of Shareable Data, February 20, 1999 <http://metadata-
8 OASIS, OASIS/ebXML Registry Services and Protocols Version 3.0, May 2, 2005 <http://www.oasisopen.org/
9 Harvard University Library, Overview: Digital Repository Service (DRS), (August 11, 2004)
A Proposal for a Global Digital Format Registry 9