Cross-domain Resource Discovery: Integrated Discovery and
Textual, Numeric, and Spatial Data:
Annual Report: 1 October1999 – 30 September 2000
Ray R. Larson
(University of California, Berkeley)
Paul B. Watry
(University of Liverpool)
The pursuit of knowledge by scholars, scientists, government agencies, and ordinary citizens
requires that the seeker be familiar with the diverse information resources available. They must
be able to identify those information resources that relate to the goals of their inquiry, and must
have the knowledge and skills required to navigate those resources, once identified, and extract
the salient data that are relevant to their inquiry. The widespread distribution of recorded
knowledge across the emerging networked landscape is only the beginning of the problem. The
reality is that the repositories of recorded knowledge are only a small part of an environment with
a bewildering variety of search engines, metadata, and protocols of very different kinds and of
varying degrees of completeness and incompatibility. The challenge is to not only to decide how
to mix, match, and combine one or more search engines with one or more knowledge repositories
for any given inquiry, but also to have detailed understanding of the endless complexities of
largely incompatible metadata, transfer protocols, and so on. This report describes our progress in
the 1 October 1999 – 30 September 2000 period on NSF/JISC award #IIS-9975164 in building an information access system that provides a new paradigm for information discovery and retrieval
by exploiting the fundamental interconnections between diverse information resources including
textual and bibliographic information, numerical databases, and geo-spatial information systems.
This system is intended to provide an object-oriented architecture and framework for integrating
knowledge and software capabilities for enhanced access to diverse distributed information
This annual report discusses both the practical application of existing technology to the problems
of cross-domain resource discovery (using the Cheshire II system), and also describes the design
and basic systems architecture for our next-generation distributed object-oriented and for our
next-generation information retrieval system (Cheshire III). For the first purpose we have been
refining an making ready for production a next-generation information retrieval system based on
international standards (Z39.50 and SGML) which is already being used for cross-domain
searching in a number of applications within the Arts and Humanities Data Service (AHDS)
(Specifically for the History Data Service hosted at the University of Essex) and the Higher
Education Archives Hub (hosted at Machester Computing) in the UK. We are at work to include
additional data sources including the CURL (Consortium of University Research Libraries), the
Online Archive of California (OAC) and the Making of America II (MOA2) database as principal
repositories. The current Cheshire II system is being set up as a “turn-key” search environment
advanced retrieval methods to full-scale realistic databases. This system is being
“hardened” and additional user tools are being developed so that this system can be easily
deployed for providing access to SGML/XML collections.
2. The Cheshire III system that is a complete redesign, and indeed is an experimental
system incorporating cutting-edge technologies.
The remainder of this section describes our progress in developing and implementing databases using the Cheshire II system. In it we also describe our progress on the client-side implementation. The following section (section 3) describes the current design and implementation status for the next-generation Cheshire III system.
2.1 Cheshire II development
The continuing development of the Cheshire II client/server system is based on a particular vision of how information access tools will develop, in particular, how they must respond to the requirements of a large population of users scattered around the globe who wish simultaneously to access the complete contents of thousands of archives, museums, and libraries, containing a mixture of text, images, digital maps, and sound recordings. Such a virtual library must be a network-based distributed system with local servers responsible for maintaining individual collections of digital documents, which will conform to a specific set of standards for documentation description, representation, and communications protocols. We believe, based on the current directions of research and adoption of standards by libraries, museums and other institutions, that a major portion of this emerging global virtual library will be based on SGML (Standard GeneralizedMarkup Language), and especially its XML subset, and the Z39.50
information retrieval protocol for resource discovery and cross-database searching. (We also assume that the forthcoming versions of the HTTP protocol will continue to provide document delivery and hypertext linking services, and that SQL3, when finalized, will provide the low-level retrieval and data manipulation semantics for relational and object-relational databases). The Cheshire II retrieval system, in supporting Z39.50 “Explain” semantics for navigating digital
collections, allows users to locate and retrieve information about collections that are organized hierarchically and distributed across servers. It will enable coherent expressions of relationships among objects and collections, showing for any given collection superior, subordinate, related, and context collections. These are essential prerequisites for the development of cross-domain resources discovery tools, which will enable users to access diverse collections through a single interface. It specifically addresses the critical issue of “vocabulary control” by supporting probabilistic “best match” ranked searching (as discussed below) and support for “Entry Vocabulary Modules” (EVMs) that provide a mapping between a searcher‟s natural language and controlled vocabularies used in the description of digital objects and collections. It also allows users to “navigate” collections (the “drilling down approach”) through distributed Z39.50 “explain” databases and through the use of SGML as the primary database format, particularly for
collection-level descriptions such as the EAD DTD. The system will follow the recommendations of the Third National Resource Discovery Workshop by providing fully distributed access to existing catalogues, and is designed to support cross-domain “clumps” to facilitate resource
discovery. Finally, the proposed server anticipates the critical issue of displaying non-western character sets in its ability to handle UNICODE (in addition to the standard ASCII/ISO8859 character sets).
2.1.1 Cheshire II Development History
The development of the Cheshire system began in the early 1990s at the University of California, Berkeley, as a means of testing the use of “probabilistic information retrieval methods” upon MARC bibliographic data. It was found that these advanced retrieval methods developed at Berkeley were far more effective than traditional Boolean methods (or vector space model
methods) in accessing records from a bibliographic database. Needless to say, the deployment of
these “probabilistic” retrieval algorithms has very important economies particularly in the
searching of databases or documents such as EAD which normally do not use a controlled
The second version of Cheshire, currently deployed at both the University of Liverpool and the
University of California, Berkeley, was designed to extend the format of the server to include
SGML-encoded data. Because SGML is increasingly becoming the markup language of choice
for research institutions, it was critical to extend Cheshire‟s capabilities to support the kinds of
SGML metadata which is likely to be included in national bibliographies. These are: TEI (Text
Encoding Initiative), EAD (Encoded Archival Description), DDI (for Social Science Data
Services), CIMI (Consortium for the Interchange of Museum Information) records, as well as the
SGML version of USMARC released by the Library of Congress (based on the USMARC DTD
developed by Jerome McDonough for the Cheshire project).
The third version extends the use of SGML handling capabilities for these search indexes. This
version was developed by Berkeley and Liverpool for the Arts and Humanities Data Service,
enabling GRS-1 syntaxconversion for nested SGML data, component indexing and retrieval of
SGML formatted documents, and automatic generation of Z39.50 Explain databases from system
configuration files. The current version of the server is now able to include an element in an
SGML record that is a reference to an external digital object (such as a file name, URL or URN)
that contains full-text to be parsed and indexed, these can be local files or URL and URN
referenced files anywhere on the internet. It also enhances the users‟ ability to perform somewhat
less directed searching provided by Boolean and probabilistic search capabilities that can be
combined at the user‟s direction. This version of Cheshire can display a number of data types
ranging from full-text documents, structured bibliographic records, as well as complex hypertext
and multimedia documents. At its current stage of development, Cheshire forms a bridge between
the realms of purely bibliographic information and the rapidly expanding full-text and multimedia
collections available online.
2.1.2 Features of Cheshire II
The Cheshire II system includes the following features:
1. It supports SGML and XML as the primary database format of the underlying search
engine. The system also provides support for full-text data linked to SGML or XML
metadata records. MARC format records for traditional online catalog databases are
supported using MARC to SGML conversion software developed for the project.
2. It is a client/server application where the interfaces (clients) communicate with the
search engine (server) using the Z39.50 v.3 Information Retrieval Protocol. The
system also provides a general Z39.50 Gateway with support for mapping Z39.50
queries to local Cheshire databases and to relational databases
3. It includes a programmable graphical direct manipulation interface under X on Unix
and Windows NT. There is also CGI interpreter version that combines client and
server capabilities. These interfaces permit searches of the Cheshire II search engine
as well as any other z39.50 compatible search engine on the network.
4. It permits users to enter natural language queries and these may be combined with
Boolean logic for users who wish to use it.
5. It uses probabilistic ranking methods based on the Logistic Regression research
carried out at Berkeley to match the user's initial query with documents in the
database. In some databases it can provide two-stage searching where a set of
“classification clusters”(Larson 1991) is first retrieved in decreasing order of
probable relevance to the user's search statement. These clusters can then be used to
provide feedback about the primary topical areas of the query, and retrieve
documents within the topical area of the selected clusters. This aids the user in
subject focusing and topic/treatment discrimination. Similar facilities are used in the
Unfamiliar Metadata Vocabularies project at Berkeley for mapping users‟ natural
language expressions of topics to appropriate controlled vocabularies
6. It supports open-ended, exploratory browsing through following dynamically
established linkages between records in the database, in order to retrieve materials
related to those already found. These can be dynamically generated “hypersearches”
that let users issue a Boolean query with a mouse click to find all items that share
some field with a displayed record.
7. It uses the user's selection of relevant citations to refine the initial search statement
and automatically construct new search statements for relevance feedback searching. 8. All of the client and server facilities can be adapted to specific applications using the
Tcl scripting language. I
9. mage Content retrieval using BlobWorld
10. Support for the SDLIP (Simple Digital Library Interoperability Protocol) for search
and as Z39.50 Gateway
2.1.3 Current Usage of Cheshire II
The Cheshire II system currently has a wide variety of ongoing implementations using WWW
and Z3.50 implementations. Current usage of the Cheshire II system includes :
? Berkeley NSF/NASA/ARPA Digital Library
o Includes support for full-text and page-level search.
o Experimental Blob-World image search ? World Conservation Digital Library
? SunSite (UC Berkeley Science Libraries)
? University of Essex, HDS (part of AHDS)
? Oxford Text Archive (test only)
? California Sheet Music Project
? Cha-Cha (Berkeley Intranet Search Engine)
? Berkeley Metadata project cross-language demo
? Univ. of Virginia (test implementations)
? JISC data sets at MIMAS
? University of Liverpool Special Collections and Archives
? University of Warwick, Modern Records Centre
? Bodleian Library, Oxford
? The HE Archives Hub (Currently numbers 20 repositories, but to be extneded to include
approximately 70 HE/FE repositories throughout the United Kingdom)
? DeMontfort University (MASTER project)
? University of London Library
? Online Archive of California
? CIAO, University of California
? University of Liverpool Museum and Art Gallery
3 Background and Design
The first year of this project has been largely concerned with the design and initial development
of the next-generation Distributed Object Retrieval Architecture. This is the basis for our planned
distributed system for cross-domain retrieval. In the initial proposal we expected to be using
CORBA for distributed objects in the new system, but recent developments in Java have led us to
choose instead the 'JavaSpaces' framework based on the LINDA system from Yale University.
JavaSpaces will provide the ability to distribute the system and data in a much more effective way
than is possible with CORBA. As noted in the original proposal, established standards have been
followed in the on-going development of the Cheshire II system. While we have been designing
and beginning development on Cheshire III we have continued to update the Cheshire II system
and make it available for use as discussed in the preceding sections.
We see the architecture for the evolution of distributed information access systems as a highly
extensible and dynamic system. In such a system both the data (digital objects instantiating
information resources) and the programs that operate on that data (methods) to achieve the needs
and desires of the users of the system for display and manipulation of the data (behaviours) will
be implemented in a distributed object environment. The basic architecture is a three-tiered
division of data and functionality. The tiers are:
1. The Client. The basic client for the distributed Cheshire system can be any JAVA-
enabled WWW Browser. The primary data delivery format will be as XML (for initial
versions), and the methods for manipulating and navigating within the data will be
implemented as JAVA applets, delivered on demand to the browser.
2. The Application Tier Applications for search and manipulation of data are distributed
between the client and network servers (including the repositories) to provide distributed
functionality (and to provide new behaviours to clients on demand from any compliant
network server). The application tier or layer would both provide JAVA applets for
execution on the client, as well as providing server-side methods invoked directly on
objects in the repository either via direct invocations or indirectly via requests from other
protocols (e.g. Z39.50 or Open Geo-spatial Datastore Interface (OGDI) for network
access to heterogeneous geographic data held in multiple GIS formats and spatial
reference systems). For example, a client browser might download an applet that can
display MARC records, and invoke a server-side method to convert repository objects in
XML to MARC format. We expect, for performance reasons, that many operations on
stored objects will be server-side methods with primarily display functions on the client
3. The Repository Digital objects and metadata describing them will reside in the
repositories tier or layer.
Repositories can be implemented in a variety of ways, ranging from conventional Relational,
Object-Relational, or Object-Oriented database systems and Text retrieval engines, to metadata
repositories referencing physical collections in libraries.
The following is derived from documents available on our WWW site as the basic design
documents for the Cheshire III system.
3.1 Cheshire III Design: Introduction
As indicated in the preceding sections on Cheshire II, the Cheshire system is a client/server
information retrieval system that brings modern information retrieval techniques to a wide array
of data domains. Cheshire provides uniform document storage in the form of SGML/XML,
supports probabilistic search, and supports Z39.50 interoperability with dozens of library
information systems around the world.
Cheshire II is now several years old. Its program logic is coded entirely in C and most of its user
interface is done in Tcl/Tk. Further development is being inhibited by a system complexity that
has outgrown its original design and its dated software technologies. New technologies are now
available that can dramatically reduce coding effort and enhance robustness, maintainability, and
interoperability. This section of the annual report describes how we are trying to re-engineer
Cheshire into a modern software system, with the hope of ensuring its future viability as a
platform for information retrieval research.
The next section outlines the system objectives of a next generation system. Section 3.3 discusses
the technologies available for meeting those objectives. Section 3.4 examines issues in migrating
the existing system to the new design. Section 3.5 concludes the discussion of the current design.
3.2 Cheshire III: Design Objectives
To continue Cheshire's viability as a research and production platform, the system must appeal to
users and developers alike. It must satisfy the information needs of users, and it must also make it
easy for developers to modify and experiment with the system. We identify below seven sets of
features that we see as desireable in a next- generation Cheshire.
? Distributed Queries. We want Cheshire to be able to satisfy a user's cross-domain
information need. To do that, a Cheshire server needs to look at not only the
database it maintains, but also other information sources reachable over the
Internet. Each server needs access to meta information about other information
sources, from which it can decide whether to query that particular source to
satisfy a particular information need.
? Interoperability. To maximize its domain reach, Cheshire systems need to not
only interoperate with each other, but with as many different types of systems as
possible. Cheshire needs to support international standards such as Z39.50,
emerging protocols such as SDLIP, and be ready to adopt future interfaces.
? Concurrency, Scalability and Robustness. A research system is most useful if the
results of good research can be immediately deployed to serve a large number of
users. Cheshire should be able to efficiently use machine resources to serve
concurrent requests, grow linearly in throughput as more hardware resources are
added, and offer reliable operations suitable for academic environments.
? Web-Based System Adminitration. The system should be easy to deploy and easy
to administer. The administrator should be presented with a unified view of
system operations and given easy means to customize them. A web-based
administration tool will give administrators the most flexibility in accessing the
? Dynamic Databases. An administrator should be able to incrementally grow the
database without interrupting user services. The database should be automatically
indexed as it is grown.
? Simplified, Structured, Maintainable Code. A research platform exists to serve
innovation. It is constantly evolving as new ideas are tested and old ideas
discarded. Developers come and go. This process becomes more vibrant if the
system is accessible to new developers and is amenable to change. ? Unprivileged Deployment. A research system is a toy for everyone. Its users will
not always have privileged access to the operating system. Cheshire should not
require such access for deployment.
? A focus on high performance, network of workstations style operations. A
scalable, extensible platform for information retrieval research. Scalable
performance allows us to explore more resource intensive IR techniques.
3.3 Cheshire III: New Technologies
A number of new software technologies exist today that can help achieve the objectives outlined
in the previous section.
? Java Programming Language. Java has become the development language of
choice for most new internet development. It includes a variety of features that
make it desireable for development of the server. One of that major promises of
Java is that once a developer writes Java code it should be instantly portable to
almost every platform. Java is strongly typed and object oriented from the ground
up, giving us programs that are more robust, more maintainable, and more
expandable. Networking is at the core of Java. Everything from naming, to its
execution environment, to API design is tailored to networked computing. A rich
set of reusable Java components is available to provide infrastructure support,
and Java integration is available for nearly every other major language (to permit
inclusion of “legacy” components). Java performance appears to be adequate to
provide the systems glue, with performance critical components written in other
languages. We also expect that fully optimized native compilation for Java will
become available soon. Java 1.3 promises to provide significant improvements in
client-side performance. Java tools are mostly free. Java programmers are easy to
find. And as one of the most successful computing platforms of our era, users can
expect continued expansions, upgrades, and community support. ? Java RMI Remote Method Invocation makes it extremely simple to implement
client/server communications. A method on a remote object is called exactly the
same as a local method. The network is nearly invisible.
? Java HotSpot Server (V2.0 available for Solaris and Linux in fall, 2000). High
performance multithreaded Virtual Machine for server applications. This can be
used in Cheshire for high performance concurrency.
? Berkeley DB 3.1.x Better concurrency support. Java API included. ? JavaServlet Pages. A Java architecture for generating dynamic content, to be
delivered through standard web servers. We can use this to build a web interface
for Cheshire. More and more thin clients such as PDA's will have web browsing
capability. Cheshire may see radically different kinds of use in the future. For
example, a book store patron may want to check on his PalmPilot to find out if an
expensive book is available at a public library and if yes, immediately make a
time limited reservation.
? DOM Compliant XML Parsers. Available with Java API's. Conform to the DOM
standard for efficient (fewer passes) parsing of XML.
? Forte Integrated Development Environment. A Java IDE that includes an
integrated debugger, GUI access to object tree, etc.
? Java Naming and Directory Interface Comes with a reference directory service.
Cheshire servers connected to the Internet can discover each other through this
service, allowing them to cooperate both in the local area and in the wide area.
? SDLIP A simple information retrieval protocol available with Java transport and
CORBA and HTTP bindings. SDLIP is simple to understand and elegant in
design. SDLIP is far simpler to implement than Z39.50 and may find more
support from a wider array of information sources, particularly from the fast
evolving web search engines. The simplicity of SDLIP can encourage
experimentation in user interface, server architecture, and retrieval algorithms.
SDLIP may reduce the barrier to entry for distributed library information systems
the way the web reduced barriers to publishing. The web paradigm has been a
move away from structure, away from careful selection and long range planning,
and a move toward universal, low barrier access while powerful computers mine
structure from information where its human makers were spared from the effort.
? JavaSpaces - high level coordination mechanism for distributed systems.
Provides a light-weight publish/subscribe distributed programming model.
'Messages' in a java space should be small. Use direct, point-to-point mechanisms
for bulk data transfers. Since we assume that Cheshire will have low volume
transactions. In JavaSpaces distributed transactions, leasing, events come for
'free'. We intend to use JavaSpaces to govern distributed transactions for Search,
Display and Update of the database(s). Using JavaSpaces we are adopting a
single operational model for Cheshire that encompasses single node installations,
uniformly administered clusters, as well as independently administered
federations of servers. Some of the characteristics are:
i. every operation is a distributed operation
ii. an operation is applied over a set of collections
iii. collections may be:
1. Single node or cluster: can be partitions of other collections
2. Federation: can be partitions or subsets of other collections. In
other words, collections in a loosely coupled federation may
have overlapping records.
3. Virtual Collection: the external interface (or view) to
collections. A VC may present only part of the underlying real
collection in its interface. A VC may grow or shrink dynamically
within the bounds of the real collection. A search only needs to
be done over documents in VC, not all documents in the
This gives us a way to logically partition a collection across a number of
machines for performance increase, but with built in redundancy in the
case of node failures. When a node fails, its VC is simply distributed
(logically) to other nodes in the cluster. Using this design Cheshire
servers can be organized into server groups. A server group can be
thought of as an administrative unit.
3.3.1 Designing for parallelism and scalability:
? One I/O worker per disk (conceptually). The philosophy here is that we don't
want the OS to attempt more parallelism in software than actually exists in
hardware. Because our database software is intelligent about concurrent accesses
to disk, we already have this effect and don't need to do anything special here.
? One data worker for each data resource. Here, a data resource should not straddle
multiple disks. The philosophy is that there is an optimal way to concurrently
access each resource. And the data worker is responsible for scheduling this
? One compute worker for each CPU (conceptually). The OS can be thought to be
exactly this kind of worker. The OS knows about its CPU resources and
schedules them accordingly.
? One task worker for each task. A task may be protocol translation, search, display,
update, etc. A distinction should be made between task and data workers in the
same address space and those in difference address spaces. In general, data
intensive tranfers should happen only between data and task workers in the same
address space. Task workers have direct read access to data. Writes are handled
by background data workers.
All of the technologies listed above are freely available. Source code is available for most of them.
3.3.2 Client Technologies
The project is a client/server system with the development of the client largely being written at
the University of Liverpool and the server at the University of California, Berkeley. After some
investigation, we decided not to adopt some of the original technologies incorporated in the
original proposal, as follows:
Changes in Client Design:
Although the project proposal implied a Java based system for the client, we have subsequently
decided to utilize the Mozilla framework being developed in an Open Source fashion by the
Netscape Corporation. This has greatly increased the existing resources available for use in the
creation of the client.
The client must be able to be used under any operating system, be that Windows, Linux, MacOS,
as well as on as many hardware platforms as possible as well. Mozilla has been designed to fit
this from the ground up, with the ability to be cross-platform of primary concern in the
The client interface must be familiar to as many users as possible. As can be seen in the
information technology marketplace every day, most new products have the same style of
interface as those that have gone before in order to make the learning curve for consumers as
brief as possible. By simply extending Mozilla to handle the Z39.50 protocol, this learning curve
will be minimalised as most of the functionality of the client is already present in the original, and
thus already familiar.
Mozilla is made up of small discrete components that fit together into a larger whole. As such,
adding additional functions to it is just a matter of writing a component that works together with
the other components already written. This reduces the development time as well as ensuring that
when Mozilla is updated, the changes needed to keep the Z39.50 component synchronised will be
minimised while still having access to all future advances in the Mozilla framework.
Mozilla implements XPI - the Cross Platform Installer. This is a means of having new
components or User Interface modifications installed automatically by clicking a button on a web
page or similar method. As in any system accessable via the Internet, clients will be run over