Summary Document for “Science Archives in the 21st Century”
A workshop held at the University of Maryland University College Inn and
Conference Center, on April 25 - 26, 2007
On April 25 - 26, 2007, the NSSDC sponsored a workshop entitled “Science Archives in
the 21st Century” at the University of Maryland University College Inn and Conference Center, to facilitate communication and elicit best practices and outstanding challenges from practicing science data manager. Emphasis was placed on good stewardship of NASA‟s Heliophysics, Planetary, Astrophysics, and Earth science data as well as perspectives from other science archives in the US and internationally.
The agenda included a keynote presentation by Raymond Walker / UCLA, invited talks by Robert Hanisch / Space Telescope Science Institute and Aaron Roberts / NASA, and was structured into sessions on “Long-Term Preservation,” “Archival Policies and
Implementation,” “Emerging Archival Standards and Technologies,” “Meeting User
Needs,” and “Provider Interactions.” Poster presentations were an integral part of the workshop with poster presenters introducing their poster topics in a “Poster Madness” session to all participants of the workshop, and with four separate poster sessions set aside for one-on-one interaction.
54 persons participated, representing 1) US government agencies such as NASA and NOAA, 2) International space agencies such as the European Space Agency, the European Space Astronomy Centre, and the Japan Aerospace Exploration Agency, 3) Academic institutions such as Caltech, The Johns Hopkins University - Applied Physics Laboratory, New Mexico State University, San Diego Supercomputer Center, Washington University, University of California Los Angeles, and the University of Maryland, and 4) Other institutions such as the Carl Sagan Center, the Center for International Earth Science Information Network, the Centre d‟Etude Spatiale des
Rayonnements, the Heliospheric Physics Laboratory, the Rutherford Appleton Laboratory, the Smithsonian Astrophysical Observatory, and the Southwest Research Institute.
The Executive Planning Committee for the workshop consisted of: Ed Grayzeck (chair)/NSSDC, Don Sawyer (co-chair)/NSSDC, Ben Kobler (logistics)/NASA GSFC Code 586, Mike A‟Hearn/University of Maryland, Jeanne Behnke/EOS, Tom
McGlynn/HEASARC, Bob McGuire/SPDF, and Michele Weiss/APL. A complete list of all participants, the agenda, and all presentations is available at: http://nssdc.gsfc.nasa.gov/nost/conf/archive21st/.
II. WORKSHOP OVERVIEW
Ed Grayzeck started off the workshop by reintroducing the three goals of the gathering:
1. To establish the level of commonality of problems and best practices, as seen by
the archives, and their interest in continuing to communicate on such matters.
2. To identify broadly based techniques and best practices that address common
concerns, and to get these identified and documented in a summary document.
3. To establish more frequent, and alternative modes of communication among the
archives. This may include the establishment of ad hoc working groups to
address particular issues and/or the development of best practices documents. Ed outlined the response. He highlighted the breadth of the experience of the 54 participants as a benefit to the group. Our challenge was to find in the five prime topics (long-term preservation, policies and implementation, standards and technology, meeting user needs and provider interactions) common ground, lessons learned and future actions. He remarked that the initial invitations had gone out to select diverse participants from earth science, planetary studies, astrophysics, and solar/space physics. The resulting group came as managers and scientists from NASA, sister government agencies, university environments, and international data partners. He further pointed out that the poster sessions would be interleaved with the oral talks so as to get full participation.
After a short introduction of the supporting staff and NSSDC sponsorship, all were invited to introduce themselves, giving a concise background. The official welcome was presented by Joe Bredekamp, NASA headquarters, who gave us the history of the NASA effort to unify the data environment and its evolution along scientific lines.
B. Keynote Presentation
Ray Walker presented the keynote presentation “The Path Toward Data System Integration.” As a scientist involved in archiving over the past 30 years, Ray Walker pointed to a persistent dream - A global data environment in which all Earth and space science data are organized in a common way with “one stop shopping” for any data product. He outlined his experience and derived five attainable goals:
1. Help scientists locate data required for a given study.
2. Provide scientists with access to those data.
3. Assure that those data are useable.
4. Preserve the data forever.
5. Aid scientists in using the data.
To achieve these goals, the fifth bullet is new and Ray sees archiving interleaved with data distribution. He cautioned that we need to work with existing standards, to evolve them, maybe re-establish the core needs and develop an interlingua that permits speaking across the science disciplines. There were two examples he highlighted from his experience. First, the Planetary Data System with its rich data model and protocols. Second, he outlined the development of SPASE as a tool to harness the diverse community of space physics.
He then identified the following evolving challenges.
• Data are found worldwide.
• Science may require data from multiple sources.
• Missions & instruments are more complex.
• Data volumes are increasing.
• Data complexity is increasing.
During the remainder of the workshop, the participants discussed these challenges and brought out news, especially relating to metadata and establishing data quality levels
C. Session on “Long Term Preservation”
The session on Long-Term Preservation started with three perspectives from the astrophysical, social and earth science, and computer science arenas: Bob Hanisch spoke on “Long-Term Preservation of Astronomical Research Results”, Bob Chen spoke on
“Government-University Collaboration in Long-Term Archiving of Scientific Data”, and
Reagan Moore spoke on “Rule Based Preservation Systems”.
The themes followed on the keynote: assure data is preserved (>20 yrs), useable, and findable. In modern scientific inquiry, the source of the data is worldwide and international efforts are needed to streamline interoperability. Three such instances are the IVOA, IPDA, and SPASE. There is a tension between the need to preserve and the need to serve the data. Libraries and universities have a long history of preservation but are usually centralized. More recently, governments and international agencies have taken a role. The archive must decide on its role as preserver in the digital arena and should look at lessons learned by analog archives. Centralized archives in the digital age are evolving and becoming more distributed. A new method which builds on this loose federation are data grids which provide for a preservation aspect centrally through use of storage resource brokers and support for infrastructure independence, where preservation is thought of as communicating with the future. Future technology will be different from today‟s technology. The preserved records need to be migrated onto the future technology. But preservation is also communication from the past. In order to make assertions about authenticity, chain of custody, and integrity, we need to be able to characterize the policies that governed prior management of the records. The management policies and preservation processes comprise representation information about the preservation environment. Preservation requires provision of representation information about both the records and the preservation environment. With each of the respective archives acting as independent sites, we need guideline for identifying when an archive is robust such as the OCLC work and the Trusted Repository Assessment Criteria. In addition, data needs metadata and there should be quality flags on both. And there needs to be recognition that science data is not normally just text.
In the panel discussions, the provenance issue was raised and was declared very important, i.e., it is best to track the data as it is migrated both in content and format. The question of a centralized archive was debated and most found the trend was to distribute
both the data and the expertise. Most agreed that we need to keep on top of the fixity issue as well as technology for any migration and long-term preservation.
D. Session on “Policies and their Implementation”
There were three oral presentations to identify current practices in three science areas within NASA (Heliophysics, Planetary Science and Earth Science). Aaron Roberts spoke on “Archiving in the Data Environment of Heliophysics with NASA”, Reta Beebe spoke
on “NASA Planetary Data System: Structure, Mission Interfaces and Distribution”, and
Jeanne Behnke spoke on “Evolving a Ten Year Old Data Archive”.
The themes spoke about the goal of NASA policies for space science - to ensure data sharing. There can be different models given a specific scientific community but in all cases that group must be involved. The models range from a centralized system that evolves to be more inclusive through a confederation of curator groups through a series of operating missions and data repositories that are loosely managed inside NASA.
A few simple lessons were given to the workshop:
1. Involve the archiving group early in the process and interact often through various
means such as an enunciated data policy, formal agreements, or archivists on the
2. Get the data providers to do the archiving in production (right from the start).
3. Provide adequate guidelines on standards, formats, and the end-to-end process.
4. A review of the archiving process and results is essential either by the community
at large or through more organized forums such as “peer reviews”.
5. In the final analysis, the user community of scientists will be the final judge of
how the data is used.
The discussions revolved around questions of implementation and cost savings. All agreed that standards must be customer based and that higher level data was best.
E. Session on “Emerging Archival Standards and Technologies”
In this session, Don Sawyer spoke on “An Overview of Selected ISO Standards
Applicable to Digital Archives”, David Giaretta spoke on “Towards and International
standard for Audit and Certification of Digital Repositories”, and Joey Mukherjee spoke
on “Usability Issues Facing 21st Century Data Archives”.
There are a number of international standards addressing digital data with particular reference to archives as addressed in “An Overview of Selected ISO Standards Applicable to Digital Archives”. Some are full ISO standards and others are in
development. The ones highlighted during this session addressed the following topic areas:
; Reference Model of an Archive and its Information (ISO)
; Checklist of Activities between Data Providers and Archives (ISO)
; Packaging Data and Metadata with an XML Manifest (developing)
; Describing Data and Sending it to an Archive (developing)
; Ensuring Archives can be Trusted to Preserve Information (developing)
All of these are applicable across the science domain and are not specific to any discipline. It can take several years for such standards to become recognized and extensively used. The uptake can also vary greatly across different communities. For example, the OAIS reference model (Reference Model for an Open Archival Information System (OAIS)) has become very widely adopted by all types of organizations. It was the right standard at the right time and it continues to meet the critical need to be able to communicate about archival systems and their information models.
The newest of the above efforts, and potentially one that will have a very wide impact, is the certification of archives. The presentation “Toward an International Standard for Audit and Certification of Digital Repositories” describes the current situation.
Experience has shown that it is difficult to preserve bits over a long time period, and even more difficult to preserve their information content, and thus there is wide interest in identifying criteria by which an archive/repository can be judged. Several efforts have developed documents addressing such criteria, and particularly noteworthy is the TRAC document (“Trustworthy Repositories Audit & Certification: Criteria and Checklist”). However all have been developed by groups with limited participation. The ISO standardization process is taking these documents as input and is open to participation by all. One can obtain these materials and participate by going to
st Century Data Archives”, the focus was In the presentation “Usability Issues Facing 21
on making data archives more useful and easier to maintain for providers, users, and management. It is argued that the current archiving reality does not adequately capture enough of the data needed by future scientists and its quality is uneven. Quality processed data should flow from the processing team and eventually get to the long-term archive. What is needed is a better format that meets all these needs, one that is simpler to use, easy to extend, and widely applicable so that it becomes widely adopted. Further, it might already exist or be some combination of the best features of a number of common formats such as HDF, IDFS, FITS, etc. It would need buy-in from visualization tool vendors and from archivists as well as archives.
During the discussion session regarding the emerging ISO standards, it was noted that very small repositories/archives may have difficulty keeping up with such standards.
Some participants had read the TRAC document and reactions were varied. One noted that he would be afraid to show it to his management, while another found it readily useful and applicable. Some leveling of the criteria seemed needed, and it was unclear how the evaluation would actually be done. It was noted that, particularly where there might be competition between archives, these criteria could become important. Also there may eventually be a high level management requirement for certification.
Regarding the prospects for a new format, or broad adoption of some newly emerging format, the prospects for securing buy-in was a central concern. Will adequate tools ensure buy-in? One comment was that what is needed is better interoperability through mapping of scientific content, not a new format.
The advisability and practicality of holding all data in a single format was questioned as it may be difficult to do ensure adequate data cleanup for higher level products, such as maps. In some cases the low level data needs to be saved because it has critical information, but in other cases it is never requested. Still, it is generally not a problem to save the low level data. The value of storing data, no longer actively being requested, in a useful form is clear and a recent example is NSSDC lunar data not looked at for many years, now of interest for future missions.
F. Session on “Meeting User Needs”
This session tried to deal with the new goal as presented from Ray Walker – namely that
to aid a scientist user these days involves more than simple data access. Four approaches were outlined by the following presenters: Arnold Rots spoke on “Associating Persistent Identifiers between Trustworthy Repositories”, Vincent Genot spoke on “Science
Archives Need to Communicate more than Data: the Example of AMDA and CDPP”,
Christophe Arviset spoke on “ESA Scientific Archives and Virtual Observatory Systems”,
and Mark Showalter spoke on “Accessing Diverse Data Sets at the PDS Rings Node”.
In the digital age, the accessibility and distribution of data/metadata are prominent, and evolving archives both centralized and decentralized have valuable lessons.
In the former category (ESAC), a function ordered approach permits the reuse of software and knowledge base is maintained from mission to mission. By separating by functions, the archive can handle both proprietary data sets as well as widely public offering. Interoperability is largely gained by insisting on one simple format - FITS. The Planetary Data System (PDS) is an example of a confederation which handles diverse data through a series of independent discipline nodes that customize data access and distribute data to specialized communities. In addition, translation tools are provided to convert a wide variety of formats. Interoperability is achieved through higher order processing.
A loose federation of missions, virtual observatories and resident archives is illustrated in the heliophysics data environment. A common data model and inter lingua (SPASE) allows cross discipline interaction. A few concepts such as time bases and simple tools provide the structure to agree on working formats in a few areas. The idea of having archives as publishing houses also was discussed since the web now allows instant exposure of the data but no implication about quality. DOI and other identifiers remove ambiguity and can be offered by archives, societies, and commercial entities.
The panel discussion showcased how this goal was fast evolving and must be customized for users in their respective scientific communities.
G. Session on “Provider Interactions”
This session had three presentations from working archives and how they were streamlining the input from data providers. Andrew Davis spoke on “Integrating an ACE
Science Data Center and SAMPEX Resident Archive into the Emerging Virtual Observatory System: Practical Experience and Perspectives,” Bruce Berriman spoke on “Best Practices in Ingestion and Data Access at the InfraRed Processing and Analysis
Center,” and Dan Kowal spoke on “Applying Submission Agreements to Long Existing Data Flows – A NOAA Story.”
There are a few lessons learned that can make the job of the data provider easier. First, during the ingestion process, make sure that an outline of the submission agreement or package is clearly understood. The rule of thumb is that it is essential to make the data useable and combinable from the start. Second, the provider needs to follow community standards on formats but the archive must make its usability criteria known. The goal is to produce well-documented data that is then bundled with the final product. Third, the archive needs to respond to the user community through a set of tools that guide it in setting the services available. These tools need to be modular so they can be reused or modified for later submissions.
All agreed that you can‟t start early enough.
H. Poster Presentation Summary
A poster summary is given in Appendix A. It contains a short summary of the posters from notes taken during the poster author's 2-minute presentation.
Rapporteur reports on the poster were prepared by Lou Reich, Kathy Fontaine, and Steve Joy. A summary of those reports appears in Appendix B.
st The posters and the full rapporteur reports are found on the Science Archives in the 21Century Web Site.
III. FEEDBACK (paraphrased):
- Workshop has been much more useful than I thought it would be. We‟re all engaged in our areas and keeping busy, but it‟s good to learn from each other.
- A couple of days is a good length for a meeting
- The scope of this workshop was very broad for 2 days, and I‟m feeling overwhelmed. One thing that hasn‟t come out since Ray‟s talk is about convergence; we wanted to see some ideas or thoughts about that.
- We need to understand the niche we‟re in.
- From an earth science perspective, one can see the idea here is interdisciplinary collaborations; it‟s a little like they do in DAC meetings too.
- This was very different, and I enjoyed visiting your planet for a change. I‟m very
interested in metadata, and on my planet we like the international metadata standards. Some of what I‟ve seen in this meeting would be addressed by those international standards. For example, what level of granularity should you use? There is lots of that in the international standards. You should look at ISO 19115 (geospatial) and 11179 (data elements) and 11915.2.
- Found it took 15 minutes to view a poster and understand it.
- Focused poster sessions worked well, overviews and summaries were well done. - When giving a poster you don‟t get to see many of those in your room.
- Two minute summaries made it easier to choose whose poster to skip. - Room is small enough to see when you had a customer; otherwise one could look at others in the same room.
- Regarding the panels, might have had more coordination?
- We had a lot of archive managers here, but only a small fraction of what there are in the world. There are a lot of things going on that we haven‟t talked about, especially in earth
science such as IPY, etc. Even with a focus on astrophysics and space physics, we‟ve only scratched the surface. Would it be more useful to go to the bigger, more general meetings?
- We don‟t have biologist or chemist here; we can get some interesting points if we go even further to other scientists and historians. We will get different perspectives. What do you mean by science archives?
- NSSDC: We wanted representation from most of the NASA space science archives and some representation from Earth science, other agencies, and some technologists.
- We didn‟t get into details in a number of areas; I‟m interesting in Ingest and also formats. If we‟re considering what to do with a series of meetings, maybe it could focus
on a couple of sub-sections for annual meetings, or general meetings less often. - Formats would be a subset, maybe 30-40 people. This meeting was intentionally small, so maybe that‟s a good approach. You're right; we do want to focus in a couple of areas.
Formats was one of the areas I heard again and again.
- There was a lot of discussion about services, including Echo from earth science and IVOA. We should not have to relearn what has already been done
- IVOA and Echo were looked at for PDS. They‟re very different.
- From an earth science perspective, earth science informatics was a new topic a couple of years ago. We wanted to know about data systems and services; this went from a
couple of people to now a standard section of the AGU meeting- earth and space informatics.
- We need to have some type of technology watch, as it won‟t change on schedule or be
tidy. For example, see xml.gov with their piece on high level prototypes where you can connect with others and see about their services. That sort of tech watch would be very valuable. Should result in some type of readable document. Formal activities are more difficult because everyone is busy.
- Regarding the tech watch, there is a tech watch from a preservation group who has done storage and services and identifiers; at least there are ways to get in touch. If we‟re going
to share services, it would be good to have a common component – between design and
architecture – which could be a valuable.
- A desire for collaboration on physical media was mentioned. Mike Martin did some of the spade work on this, but there is a need to broaden beyond hard media to tape systems, etc. There is also the suggestion that we look at optical means for storing data.
- There is a „PV‟ group that meets every 2 yrs on (scientific and technical) data preservation and adding value, hosted around Europe. The next session is in Germany this year. At the first meeting, most attendees were from Europe. At the second, they had more Americans, including the auto industry.
- There is a DCC (Digital Curation Centre) meeting in Washington this December and PV 2007 in October and another conference in China, then Codata in Kiev in 2008. They‟d appreciate sessions organized for some of these meetings.
- There is a new Federal working group on digital data. There will be lots of pressure on federal government data, so it is a good idea to get your act together and show you‟re getting the best value for the science dollar. One can see questions about cost recovery, etc., coming up.
- There was a recent meeting at University of North Carolina in concert with their efforts to develop a digital preservation curriculum. This was attended by archivists and librarians and other broad groups concerned with data preservation. This is a growing area of interest.
- There was a 'formats evolution process committee' formats group organized in 1988, I remember.
- Essentially none of those formats has gone away, although there has recently been a merging of HDF and NETCDF.
- We haven‟t answered this sufficiently. It revolves around those services which are discipline neutral. We won‟t get very far trying to find a uniform query service. But persistent identifiers are a common theme. How many of these will survive more than a decade or so? How many are required in a multidisciplinary way? Those discipline neutral systems are the core of what we‟re talking about. Biologists talk about annotation services. Time varying datasets are important for biologists, but may be important to others as well.
- Holding another workshop every couple years seems about right, and that‟s about timescale of sea-changes and tides of the field.
- NSSDC: our original plan was to do a workshop every 3 years, and we started with this eclectic group to see what would percolate up.
- However don‟t we have an IEEE mass storage meetings? Yes.
- Regarding additional collaboration, you need to consider how broad to make a collaboration, and how much value do you get from each?
Future Wiki’s, Newsletters
- A Yahoo newsgroup will only work for me if I get a monthly newsletter or email when there‟s something posted that‟s for my interest.
- If a Yahoo group is very broad and very diverse, what value added do I get out of this?
- Are there any other groups that have a newsletter? Maybe we need to do a monthly newsletter?
- All of these need a hot enough issue to keep folks interested; wiki only works when there‟s an interest hot enough.
- I would be motivated to read things if it will help solve my problems at work. Otherwise, it‟s nice, it‟s interesting, but I don‟t have time.
- I won‟t participate in yet another wiki, as I‟m already following 8 others.
- Todd volunteered for a working group for Services. A Yahoo group would help such a working group.
- Codata publishes a data science journal and it is appreciated that there is place to publish those results in a refereed journal, in a formal academic way. At GMU we have a data science program which was a graduate level program only, now enlarging to include data sciences for undergraduates, maybe make it a general requirement for all students -data science 101.
Questions with Answers
- There is the impression that IVOA was accepting more than astrophysics data. Is it? -> Not really, but is it worth including another interface or should it keep on straight and narrow? There‟s a diversity of opinion. I see an opportunity for planetary here. As was pointed out at lunch it doesn‟t apply to all types of planetary data. Of course that doesn‟t mean we shouldn‟t do some of it.
- There is the issue of professional development and a community of experts who feel some allegiance of some sort. One of Codata's issues is the structure to have a union and international members, but don‟t really have an international group looking at it. What are the professional standards adhered to? Would you join an international data organizational? -> I‟m a comet scientist, so I wouldn't.
IV. SUMMARY AND CONCLUSIONS
The workshop was favorably received by all that participated and some were quite