A Proposal for a Global Digital Format Registry

By Brittany Hernandez,2014-06-15 16:41
    A Proposal for a Global Digital Format Registry

    Stephen Abrams

    Dale Flecker

    Harvard University Library

    Cambridge, Massachusetts

    September 29, 2005

1 Executive Summary

    2 Plan of Work

     2.1 Rationale

    2.2 Architecture

     2.3 Work to date

     2.4 Program of new work

     2.5 Management oversight

     2.6 Schedule and deliverables

     2.7 Staffing

    2.8 Budget

    3 Conclusion


    Appendix A Participants in the DLF Invitational Workshops

    Appendix B Use Cases

    Appendix C Provisional Data Model

    Appendix D Letters of Support

    Appendix E Staff Curriculum Vitae

    Appendix F Project Timeline

    Appendix G Project Budget

1 Executive Summary

    The Global Digital Format Registry (GDFR) will provide sustainable distributed services to store, discover, and deliver representation information about digital formats.

    The format of a digital object must be known in order to interpret the information content of that object properly. Without knowledge of its format, a digital object is merely a collection of undifferentiated bits. Thus, format typing is fundamental to the effective use, interchange, and preservation of all digitally-encoded content. In terms of the Open Archival Information System (OAIS) Reference Model, the format typing of a digital object is representation information about that object; that is, it provides 1information that maps the Data Object into more meaningful concepts. However, in order to

    implement that mapping it is necessary to have complete representation information about the format itself: its syntactic and semantic rules for encoding information into digital form. As noted in the recent NSF-DELOS report, Invest to Save, “Registries of digital formats provide keys to understanding the

    nature of digital objects, guide the managing of their transition from one state to another, and inform the 2choice of preservation method for material in specific formats.” In so doing, format registries fall

    directly into the scope of the digital preservation research agenda identified in the NSF/Library of Congress workshop report, It’s About Time, playing a key role in enabling and supporting technical

    architectures and tools “to acquire archival data, prepare data for long-term storage, and manage data over 3several generations of technology.”

    The wide diversity and rapid pace of adoption and abandonment of digital formats present an ongoing problem for long-term preservation efforts. As noted in the Library of Congress planning report, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure Preservation Program (NDIIPP), “Longevity of digital data and the ability to read those data in the future depend upon 4 standards for encoding and describing, but standards change over time.” The purpose of the GDFR

    project is to address this concern by providing a sustainable resource for managing format-critical representation information necessary to the preservation function.

The Global Digital Format Registry will provide services for:

    ; The centrally-organized collection of format representation information

    ; The distributed storage, discovery, and delivery of that information

On a larger scale, a sustainable GDFR will provide:

    ; A common mechanism to pool and share scarce technical expertise on a global basis, reducing the

    necessity for duplicative local effort

    ; A channel for the widest possible distribution of the fruits of that expertise to all actors engaged

    in preservation activities

    ; A process for generating community-wide agreement as to the normative definitions of format

    syntax and semantics, promoting best practices and effective interchange of digital assets between

    preservation institutions, programs, and systems

    ; A foundation for additional value-added services requiring detailed knowledge of digital formats

The NDIIPP initiative defines its evolving infrastructure in terms of two conceptual facets: a digital

    preservation architecture that provides the technical framework for preservation activities undertaken by a network of partners; and a digital preservation network of actors who collaborate to preserve digital

    content. The functions provided by GDFR will support preservation infrastructures with regard to both of these facets: in terms of a preservation architecture, the GDFR will be a service provider of critical format representation information necessary for effective preservation collaboration; and in terms of a

    preservation network, the GDFR will enable the work of the institutional, consortial, and regional actors engaged in the long-term preservation of digital assets of cultural, scientific, and economic significance.

    This proposal lays out a two-year plan of work leading to the operational deployment of the Global Digital Format Registry populated with representation information for a significant number of digital formats in most common contemporary use. The project will incorporate the widest possible consultation with international stakeholders to achieve community-wide consensus and instill levels of trust and ownership necessary to the long-term sustainability of the GDFR.


    1 ISO 14721:2003, Space data and information transfer systems Open archival information system Reference

    model, February 24, 2003. Previously available as CCSDS 650.0-B-1: Reference Model for an Open Archival

    Information System (OAIS), Blue Book, Issue 1, January 2002.

    2 National Science Foundation/DELOS, Invest to Save: Report and Recommendations of the NSF-DELOS Working

    Group on Digital Archiving and Preservation, 2003 <


    3 National Science Foundation/Library of Congress, It’s About Time: Research Challenges in Digital Archiving and

    Long-term Preservation, August 2003 <>. 4 Library of Congress, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure Preservation Program, October 2002 <>.

2 Plan of Work

2.1 Rationale

    Technical documentation about digital formats will necessarily be a core part of any preservation program. In the absence of a generally accessible, reliable, and persistent registry of such data, each individual preservation program will need to collect and maintain its own documentation. Not only is this wasteful in terms of large-scale duplication of effort, but it would also require each program to have access to highly sophisticated staff with the skills to document each format the program ingests. Such expertise is scarce and expensive and many programs will likely be unable to support the activity at an appropriate level.

    The Global Digital Format Registry therefore represents a highly fruitful area for shared infrastructure upon which to build a distributed program of digital preservation. The GDFR will save individual programs time and effort and will provide the entire community with access to expertise that would otherwise be unavailable to most institutional programs. It will provide a means for more preservation

    programs to have more sophisticated information about more formats and thus will contribute

    significantly to developing a robust preservation services environment distributed across institutions.

    The extant Internet Assigned Numbers Authority (IANA) MIME type registry is insufficient for many 1preservation purposes. It does not mandate disclosure of technical information; it does not define a consistent set of technical properties; it is not amenable to automated discovery and delivery; and it defines format typing at a very coarse level of granularity. The Global Digital Format Registry will provide a standardized set of representation information, accessible through human and machine

    interfaces, about formats defined at arbitrary granularity. While the GDFR cannot "mandate" disclosure

    by the owners of proprietary formats, it is hoped that the GDFR will be able to engage those owners on behalf of the international digital preservation community in an effort to build sufficient levels of trust so that the appropriate format descriptions will be deposited in the GDFR.

2.2 Architecture

    The most significant architectural aspect of the GDFR is its distributed nature. The aim of the project is not to build a single centralized registry, but rather to define a common network protocol by which multiple independent, but cooperating, registries can communicate with each other and synchronize their holdings of format representation information. Such a scheme for redundant decentralized services is an important factor contributing towards the robustness of the GDFR, by decoupling its global long-term sustainability from the effects of local short-term policy making. The selection of an appropriate network protocol for data propagation is critical. The desirable characteristics of such a protocol include efficiency, fault-tolerance, automated operation, and ease of implementation. The protocol selection process, along with other major registry design and implementation decisions, will occupy the first six months of the GDFR project. Project staff will pay particular attention to the results of other distributed registry efforts, such as the OCKHAM Digital Library Services Registry, which uses the Open Archives 2Initiative Protocol for Metadata Harvesting (OAI-PMH) for data propagation.

    Each cooperating registry is a node in the GDFR network (see Figure 1). The synchronization services of the GDFR protocol ensure that data is automatically propagated across the entire network, using tiered delegation for efficiency. Under tiered delegation, any given node in the network generally needs to be cognizant only of its parent node, from which it receives information, and its immediate children, to whom it sends information. Failover procedures ensure that a node can receive information from alternative parents should the primary parent node become unavailable. The registration of child nodes is a matter of local policy and practice.

    As evidenced by the accompanying letters of support (see Appendix B), a number of important institutional actors in the international digital preservation community have already indicated their intent to participate as contributors to the GDFR. At the discretion of the contributor, new format representation information can be introduced by any node in the GDFR network in one of two modes: vetted or non-vetted. Non-vetted information is immediately propagated through the network without further technical review; its credibility is based solely on the reputation of the submitting agent. Vetted representation information is subject to an editorial process to ensure its technical veracity prior to being propagated. (The relationship between vetted and non-vetted representation information with respect to technical review is thus similar to that between the IETF and vendor/personal trees of the IANA MIME type registry.) As a result of the technical review process, preservation programs can freely make use of vetted information with a high degree of confidence as to its technical completeness, authenticity, and reliability; non-vetted information should be approached with greater caution. However, the non-vetted avenue does permit the quicker dissemination of format information. It is also useful for defining local format profiles for which a centralized evaluation is not necessary or is not practical.

The editorial review process will enlist the participation of international experts in a manner similar to the 3Internet Engineering Task Force (IETF) Internet Standards process. Newly submitted representation

    information will be placed under the scrutiny of both a public review by interested stakeholders and a private review by recruited experts functioning as GDFR technical editors. As with the GDFR protocol, the specific communication mechanism utilized for this review process will be determined during the initial phase of the project. Since this process requires focused communication between human agents, rather than global communication between automated systems, it will probably employ a protocol different from that used for the propagation of data between GDFR nodes.

    Vetted for propagation

    Editorial RootprocessGDFR node

    Data propagationSubmissions for technical vetting

    GDFR GDFR nodeGDFR nodeprotocol

    GDFR node

    Figure 1. GDF