Cross-domain Resource Discovery Integrated Discovery and use of

By Carmen Kennedy,2014-07-11 17:30
17 views 0
Cross-domain Resource Discovery Integrated Discovery and use of ...

Cross-domain Resource Discovery: Integrated Discovery and

    use of

    Textual, Numeric, and Spatial Data:

    Annual Report: 1 October1999 30 September 2000

    Ray R. Larson

    (University of California, Berkeley)

    Paul B. Watry

    (University of Liverpool)

1 Introduction

The pursuit of knowledge by scholars, scientists, government agencies, and ordinary citizens

    requires that the seeker be familiar with the diverse information resources available. They must

    be able to identify those information resources that relate to the goals of their inquiry, and must

    have the knowledge and skills required to navigate those resources, once identified, and extract

    the salient data that are relevant to their inquiry. The widespread distribution of recorded

    knowledge across the emerging networked landscape is only the beginning of the problem. The

    reality is that the repositories of recorded knowledge are only a small part of an environment with

    a bewildering variety of search engines, metadata, and protocols of very different kinds and of

    varying degrees of completeness and incompatibility. The challenge is to not only to decide how

    to mix, match, and combine one or more search engines with one or more knowledge repositories

    for any given inquiry, but also to have detailed understanding of the endless complexities of

    largely incompatible metadata, transfer protocols, and so on. This report describes our progress in

    the 1 October 1999 30 September 2000 period on NSF/JISC award #IIS-9975164 in building an information access system that provides a new paradigm for information discovery and retrieval

    by exploiting the fundamental interconnections between diverse information resources including

    textual and bibliographic information, numerical databases, and geo-spatial information systems.

    This system is intended to provide an object-oriented architecture and framework for integrating

    knowledge and software capabilities for enhanced access to diverse distributed information


1.1 Overview

This annual report discusses both the practical application of existing technology to the problems

    of cross-domain resource discovery (using the Cheshire II system), and also describes the design

    and basic systems architecture for our next-generation distributed object-oriented and for our

    next-generation information retrieval system (Cheshire III). For the first purpose we have been

    refining an making ready for production a next-generation information retrieval system based on

    international standards (Z39.50 and SGML) which is already being used for cross-domain

    searching in a number of applications within the Arts and Humanities Data Service (AHDS)

    (Specifically for the History Data Service hosted at the University of Essex) and the Higher

    Education Archives Hub (hosted at Machester Computing) in the UK. We are at work to include

    additional data sources including the CURL (Consortium of University Research Libraries), the

    Online Archive of California (OAC) and the Making of America II (MOA2) database as principal

    repositories. The current Cheshire II system is being set up as a “turn-key” search environment


    advanced retrieval methods to full-scale realistic databases. This system is being

    “hardened” and additional user tools are being developed so that this system can be easily

    deployed for providing access to SGML/XML collections.

    2. The Cheshire III system that is a complete redesign, and indeed is an experimental

    system incorporating cutting-edge technologies.

    The remainder of this section describes our progress in developing and implementing databases using the Cheshire II system. In it we also describe our progress on the client-side implementation. The following section (section 3) describes the current design and implementation status for the next-generation Cheshire III system.

2.1 Cheshire II development

    The continuing development of the Cheshire II client/server system is based on a particular vision of how information access tools will develop, in particular, how they must respond to the requirements of a large population of users scattered around the globe who wish simultaneously to access the complete contents of thousands of archives, museums, and libraries, containing a mixture of text, images, digital maps, and sound recordings. Such a virtual library must be a network-based distributed system with local servers responsible for maintaining individual collections of digital documents, which will conform to a specific set of standards for documentation description, representation, and communications protocols. We believe, based on the current directions of research and adoption of standards by libraries, museums and other institutions, that a major portion of this emerging global virtual library will be based on SGML (Standard GeneralizedMarkup Language), and especially its XML subset, and the Z39.50

    information retrieval protocol for resource discovery and cross-database searching. (We also assume that the forthcoming versions of the HTTP protocol will continue to provide document delivery and hypertext linking services, and that SQL3, when finalized, will provide the low-level retrieval and data manipulation semantics for relational and object-relational databases). The Cheshire II retrieval system, in supporting Z39.50 “Explain” semantics for navigating digital

    collections, allows users to locate and retrieve information about collections that are organized hierarchically and distributed across servers. It will enable coherent expressions of relationships among objects and collections, showing for any given collection superior, subordinate, related, and context collections. These are essential prerequisites for the development of cross-domain resources discovery tools, which will enable users to access diverse collections through a single interface. It specifically addresses the critical issue of “vocabulary control” by supporting probabilistic “best match” ranked searching (as discussed below) and support for “Entry Vocabulary Modules” (EVMs) that provide a mapping between a searcher‟s natural language and controlled vocabularies used in the description of digital objects and collections. It also allows users to “navigate” collections (the “drilling down approach”) through distributed Z39.50 “explain” databases and through the use of SGML as the primary database format, particularly for

    collection-level descriptions such as the EAD DTD. The system will follow the recommendations of the Third National Resource Discovery Workshop by providing fully distributed access to existing catalogues, and is designed to support cross-domain “clumps” to facilitate resource

    discovery. Finally, the proposed server anticipates the critical issue of displaying non-western character sets in its ability to handle UNICODE (in addition to the standard ASCII/ISO8859 character sets).

2.1.1 Cheshire II Development History

    The development of the Cheshire system began in the early 1990s at the University of California, Berkeley, as a means of testing the use of “probabilistic information retrieval methods” upon MARC bibliographic data. It was found that these advanced retrieval methods developed at Berkeley were far more effective than traditional Boolean methods (or vector space model

methods) in accessing records from a bibliographic database. Needless to say, the deployment of

    these “probabilistic” retrieval algorithms has very important economies particularly in the

    searching of databases or documents such as EAD which normally do not use a controlled


The second version of Cheshire, currently deployed at both the University of Liverpool and the

    University of California, Berkeley, was designed to extend the format of the server to include

    SGML-encoded data. Because SGML is increasingly becoming the markup language of choice

    for research institutions, it was critical to extend Cheshire‟s capabilities to support the kinds of

    SGML metadata which is likely to be included in national bibliographies. These are: TEI (Text

    Encoding Initiative), EAD (Encoded Archival Description), DDI (for Social Science Data

    Services), CIMI (Consortium for the Interchange of Museum Information) records, as well as the

    SGML version of USMARC released by the Library of Congress (based on the USMARC DTD

    developed by Jerome McDonough for the Cheshire project).

The third version extends the use of SGML handling capabilities for these search indexes. This

    version was developed by Berkeley and Liverpool for the Arts and Humanities Data Service,

    enabling GRS-1 syntaxconversion for nested SGML data, component indexing and retrieval of

    SGML formatted documents, and automatic generation of Z39.50 Explain databases from system

    configuration files. The current version of the server is now able to include an element in an

    SGML record that is a reference to an external digital object (such as a file name, URL or URN)

    that contains full-text to be parsed and indexed, these can be local files or URL and URN

    referenced files anywhere on the internet. It also enhances the users‟ ability to perform somewhat

    less directed searching provided by Boolean and probabilistic search capabilities that can be

    combined at the user‟s direction. This version of Cheshire can display a number of data types

    ranging from full-text documents, structured bibliographic records, as well as complex hypertext

    and multimedia documents. At its current stage of development, Cheshire forms a bridge between

    the realms of purely bibliographic information and the rapidly expanding full-text and multimedia

    collections available online.

2.1.2 Features of Cheshire II

The Cheshire II system includes the following features:

    1. It supports SGML and XML as the primary database format of the underlying search

    engine. The system also provides support for full-text data linked to SGML or XML

    metadata records. MARC format records for traditional online catalog databases are

    supported using MARC to SGML conversion software developed for the project.

    2. It is a client/server application where the interfaces (clients) communicate with the

    search engine (server) using the Z39.50 v.3 Information Retrieval Protocol. The

    system also provides a general Z39.50 Gateway with support for mapping Z39.50

    queries to local Cheshire databases and to relational databases

    3. It includes a programmable graphical direct manipulation interface under X on Unix

    and Windows NT. There is also CGI interpreter version that combines client and

    server capabilities. These interfaces permit searches of the Cheshire II search engine

    as well as any other z39.50 compatible search engine on the network.

    4. It permits users to enter natural language queries and these may be combined with

    Boolean logic for users who wish to use it.

    5. It uses probabilistic ranking methods based on the Logistic Regression research

    carried out at Berkeley to match the user's initial query with documents in the

    database. In some databases it can provide two-stage searching where a set of

    “classification clusters”(Larson 1991) is first retrieved in decreasing order of

    probable relevance to the user's search statement. These clusters can then be used to

    provide feedback about the primary topical areas of the query, and retrieve

    documents within the topical area of the selected clusters. This aids the user in

    subject focusing and topic/treatment discrimination. Similar facilities are used in the

    Unfamiliar Metadata Vocabularies project at Berkeley for mapping users‟ natural

    language expressions of topics to appropriate controlled vocabularies


    6. It supports open-ended, exploratory browsing through following dynamically

    established linkages between records in the database, in order to retrieve materials

    related to those already found. These can be dynamically generated “hypersearches”

    that let users issue a Boolean query with a mouse click to find all items that share

    some field with a displayed record.

    7. It uses the user's selection of relevant citations to refine the initial search statement

    and automatically construct new search statements for relevance feedback searching. 8. All of the client and server facilities can be adapted to specific applications using the

    Tcl scripting language. I

    9. mage Content retrieval using BlobWorld

    10. Support for the SDLIP (Simple Digital Library Interoperability Protocol) for search

    and as Z39.50 Gateway

    2.1.3 Current Usage of Cheshire II

    The Cheshire II system currently has a wide variety of ongoing implementations using WWW

    and Z3.50 implementations. Current usage of the Cheshire II system includes :

    ? Berkeley NSF/NASA/ARPA Digital Library

    o Includes support for full-text and page-level search.

    o Experimental Blob-World image search ? World Conservation Digital Library

    ? SunSite (UC Berkeley Science Libraries)

    ? University of Essex, HDS (part of AHDS)

    ? Oxford Text Archive (test only)

    ? California Sheet Music Project

    ? Cha-Cha (Berkeley Intranet Search Engine)

    ? Berkeley Metadata project cross-language demo

    ? Univ. of Virginia (test implementations)

    ? JISC data sets at MIMAS

    ? University of Liverpool Special Collections and Archives

    ? University of Warwick, Modern Records Centre

    ? Bodleian Library, Oxford

    ? The HE Archives Hub (Currently numbers 20 repositories, but to be extneded to include

    approximately 70 HE/FE repositories throughout the United Kingdom)

    ? DeMontfort University (MASTER project)

    ? University of London Library

    ? Online Archive of California

    ? CIAO, University of California

    ? University of Liverpool Museum and Art Gallery

3 Background and Design

The first year of this project has been largely concerned with the design and initial development

    of the next-generation Distributed Object Retrieval Architecture. This is the basis for our planned

    distributed system for cross-domain retrieval. In the initial proposal we expected to be using

    CORBA for distributed objects in the new system, but recent developments in Java have led us to

    choose instead the 'JavaSpaces' framework based on the LINDA system from Yale University.

    JavaSpaces will provide the ability to distribute the system and data in a much more effective way

    than is possible with CORBA. As noted in the original proposal, established standards have been

    followed in the on-going development of the Cheshire II system. While we have been designing

    and beginning development on Cheshire III we have continued to update the Cheshire II system

    and make it available for use as discussed in the preceding sections.

We see the architecture for the evolution of distributed information access systems as a highly

    extensible and dynamic system. In such a system both the data (digital objects instantiating

    information resources) and the programs that operate on that data (methods) to achieve the needs

    and desires of the users of the system for display and manipulation of the data (behaviours) will

    be implemented in a distributed object environment. The basic architecture is a three-tiered

    division of data and functionality. The tiers are:

    1. The Client. The basic client for the distributed Cheshire system can be any JAVA-

    enabled WWW Browser. The primary data delivery format will be as XML (for initial

    versions), and the methods for manipulating and navigating within the data will be

    implemented as JAVA applets, delivered on demand to the browser.

    2. The Application Tier Applications for search and manipulation of data are distributed

    between the client and network servers (including the repositories) to provide distributed

    functionality (and to provide new behaviours to clients on demand from any compliant

    network server). The application tier or layer would both provide JAVA applets for

    execution on the client, as well as providing server-side methods invoked directly on

    objects in the repository either via direct invocations or indirectly via requests from other

    protocols (e.g. Z39.50 or Open Geo-spatial Datastore Interface (OGDI) for network

    access to heterogeneous geographic data held in multiple GIS formats and spatial

    reference systems). For example, a client browser might download an applet that can

    display MARC records, and invoke a server-side method to convert repository objects in

    XML to MARC format. We expect, for performance reasons, that many operations on

    stored objects will be server-side methods with primarily display functions on the client


    3. The Repository Digital objects and metadata describing them will reside in the

    repositories tier or layer.

Repositories can be implemented in a variety of ways, ranging from conventional Relational,

    Object-Relational, or Object-Oriented database systems and Text retrieval engines, to metadata

    repositories referencing physical collections in libraries.

The following is derived from documents available on our WWW site as the basic design

    documents for the Cheshire III system.

3.1 Cheshire III Design: Introduction

As indicated in the preceding sections on Cheshire II, the Cheshire system is a client/server

    information retrieval system that brings modern information retrieval techniques to a wide array

    of data domains. Cheshire provides uniform document storage in the form of SGML/XML,

    supports probabilistic search, and supports Z39.50 interoperability with dozens of library

    information systems around the world.

Cheshire II is now several years old. Its program logic is coded entirely in C and most of its user

    interface is done in Tcl/Tk. Further development is being inhibited by a system complexity that

    has outgrown its original design and its dated software technologies. New technologies are now

    available that can dramatically reduce coding effort and enhance robustness, maintainability, and

    interoperability. This section of the annual report describes how we are trying to re-engineer

    Cheshire into a modern software system, with the hope of ensuring its future viability as a

    platform for information retrieval research.

The next section outlines the system objectives of a next generation system. Section 3.3 discusses

    the technologies available for meeting those objectives. Section 3.4 examines issues in migrating

    the existing system to the new design. Section 3.5 concludes the discussion of the current design.

3.2 Cheshire III: Design Objectives

To continue Cheshire's viability as a research and production platform, the system must appeal to

    users and developers alike. It must satisfy the information needs of users, and it must also make it

    easy for developers to modify and experiment with the system. We identify below seven sets of

    features that we see as desireable in a next- generation Cheshire.

    ? Distributed Queries. We want Cheshire to be able to satisfy a user's cross-domain

    information need. To do that, a Cheshire server needs to look at not only the

    database it maintains, but also other information sources reachable over the

    Internet. Each server needs access to meta information about other information

    sources, from which it can decide whether to query that particular source to

    satisfy a particular information need.

    ? Interoperability. To maximize its domain reach, Cheshire systems need to not

    only interoperate with each other, but with as many different types of systems as

    possible. Cheshire needs to support international standards such as Z39.50,

    emerging protocols such as SDLIP, and be ready to adopt future interfaces.

    ? Concurrency, Scalability and Robustness. A research system is most useful if the

    results of good research can be immediately deployed to serve a large number of

    users. Cheshire should be able to efficiently use machine resources to serve

    concurrent requests, grow linearly in throughput as more hardware resources are

    added, and offer reliable operations suitable for academic environments.

    ? Web-Based System Adminitration. The system should be easy to deploy and easy

    to administer. The administrator should be presented with a unified view of

    system operations and given easy means to customize them. A web-based

    administration tool will give administrators the most flexibility in accessing the


    ? Dynamic Databases. An administrator should be able to incrementally grow the

    database without interrupting user services. The database should be automatically

    indexed as it is grown.

    ? Simplified, Structured, Maintainable Code. A research platform exists to serve

    innovation. It is constantly evolving as new ideas are tested and old ideas

    discarded. Developers come and go. This process becomes more vibrant if the

    system is accessible to new developers and is amenable to change. ? Unprivileged Deployment. A research system is a toy for everyone. Its users will

    not always have privileged access to the operating system. Cheshire should not

    require such access for deployment.

    ? A focus on high performance, network of workstations style operations. A

    scalable, extensible platform for information retrieval research. Scalable

    performance allows us to explore more resource intensive IR techniques.

    3.3 Cheshire III: New Technologies

    A number of new software technologies exist today that can help achieve the objectives outlined

    in the previous section.

    ? Java Programming Language. Java has become the development language of

    choice for most new internet development. It includes a variety of features that

    make it desireable for development of the server. One of that major promises of

    Java is that once a developer writes Java code it should be instantly portable to

    almost every platform. Java is strongly typed and object oriented from the ground

    up, giving us programs that are more robust, more maintainable, and more

    expandable. Networking is at the core of Java. Everything from naming, to its

    execution environment, to API design is tailored to networked computing. A rich

    set of reusable Java components is available to provide infrastructure support,

    and Java integration is available for nearly every other major language (to permit

    inclusion of “legacy” components). Java performance appears to be adequate to

    provide the systems glue, with performance critical components written in other

    languages. We also expect that fully optimized native compilation for Java will

    become available soon. Java 1.3 promises to provide significant improvements in

    client-side performance. Java tools are mostly free. Java programmers are easy to

    find. And as one of the most successful computing platforms of our era, users can

    expect continued expansions, upgrades, and community support. ? Java RMI Remote Method Invocation makes it extremely simple to implement

    client/server communications. A method on a remote object is called exactly the

    same as a local method. The network is nearly invisible.

    ? Java HotSpot Server (V2.0 available for Solaris and Linux in fall, 2000). High

    performance multithreaded Virtual Machine for server applications. This can be

    used in Cheshire for high performance concurrency.

    ? Berkeley DB 3.1.x Better concurrency support. Java API included. ? JavaServlet Pages. A Java architecture for generating dynamic content, to be

    delivered through standard web servers. We can use this to build a web interface

    for Cheshire. More and more thin clients such as PDA's will have web browsing

    capability. Cheshire may see radically different kinds of use in the future. For

    example, a book store patron may want to check on his PalmPilot to find out if an

    expensive book is available at a public library and if yes, immediately make a

    time limited reservation.

    ? DOM Compliant XML Parsers. Available with Java API's. Conform to the DOM

    standard for efficient (fewer passes) parsing of XML.

    ? Forte Integrated Development Environment. A Java IDE that includes an

    integrated debugger, GUI access to object tree, etc.

    ? Java Naming and Directory Interface Comes with a reference directory service.

    Cheshire servers connected to the Internet can discover each other through this

    service, allowing them to cooperate both in the local area and in the wide area.

    ? SDLIP A simple information retrieval protocol available with Java transport and

    CORBA and HTTP bindings. SDLIP is simple to understand and elegant in

    design. SDLIP is far simpler to implement than Z39.50 and may find more

    support from a wider array of information sources, particularly from the fast

    evolving web search engines. The simplicity of SDLIP can encourage

    experimentation in user interface, server architecture, and retrieval algorithms.

    SDLIP may reduce the barrier to entry for distributed library information systems

    the way the web reduced barriers to publishing. The web paradigm has been a

    move away from structure, away from careful selection and long range planning,

    and a move toward universal, low barrier access while powerful computers mine

    structure from information where its human makers were spared from the effort.

    ? JavaSpaces - high level coordination mechanism for distributed systems.

    Provides a light-weight publish/subscribe distributed programming model.

    'Messages' in a java space should be small. Use direct, point-to-point mechanisms

    for bulk data transfers. Since we assume that Cheshire will have low volume

    transactions. In JavaSpaces distributed transactions, leasing, events come for

    'free'. We intend to use JavaSpaces to govern distributed transactions for Search,

    Display and Update of the database(s). Using JavaSpaces we are adopting a

    single operational model for Cheshire that encompasses single node installations,

    uniformly administered clusters, as well as independently administered

    federations of servers. Some of the characteristics are:

    i. every operation is a distributed operation

    ii. an operation is applied over a set of collections

    iii. collections may be:

    1. Single node or cluster: can be partitions of other collections

    2. Federation: can be partitions or subsets of other collections. In

    other words, collections in a loosely coupled federation may

    have overlapping records.

    3. Virtual Collection: the external interface (or view) to

    collections. A VC may present only part of the underlying real

    collection in its interface. A VC may grow or shrink dynamically

    within the bounds of the real collection. A search only needs to

    be done over documents in VC, not all documents in the


    This gives us a way to logically partition a collection across a number of

    machines for performance increase, but with built in redundancy in the

    case of node failures. When a node fails, its VC is simply distributed

    (logically) to other nodes in the cluster. Using this design Cheshire

    servers can be organized into server groups. A server group can be

    thought of as an administrative unit.

    3.3.1 Designing for parallelism and scalability:

    ? One I/O worker per disk (conceptually). The philosophy here is that we don't

    want the OS to attempt more parallelism in software than actually exists in

    hardware. Because our database software is intelligent about concurrent accesses

    to disk, we already have this effect and don't need to do anything special here.

    ? One data worker for each data resource. Here, a data resource should not straddle

    multiple disks. The philosophy is that there is an optimal way to concurrently

    access each resource. And the data worker is responsible for scheduling this


    ? One compute worker for each CPU (conceptually). The OS can be thought to be

    exactly this kind of worker. The OS knows about its CPU resources and

    schedules them accordingly.

    ? One task worker for each task. A task may be protocol translation, search, display,

    update, etc. A distinction should be made between task and data workers in the

    same address space and those in difference address spaces. In general, data

    intensive tranfers should happen only between data and task workers in the same

    address space. Task workers have direct read access to data. Writes are handled

    by background data workers.

All of the technologies listed above are freely available. Source code is available for most of them.

3.3.2 Client Technologies

The project is a client/server system with the development of the client largely being written at

    the University of Liverpool and the server at the University of California, Berkeley. After some

    investigation, we decided not to adopt some of the original technologies incorporated in the

    original proposal, as follows:

Changes in Client Design:

Although the project proposal implied a Java based system for the client, we have subsequently

    decided to utilize the Mozilla framework being developed in an Open Source fashion by the

    Netscape Corporation. This has greatly increased the existing resources available for use in the

    creation of the client.

The client must be able to be used under any operating system, be that Windows, Linux, MacOS,

    as well as on as many hardware platforms as possible as well. Mozilla has been designed to fit

    this from the ground up, with the ability to be cross-platform of primary concern in the

    implementation strategy.

The client interface must be familiar to as many users as possible. As can be seen in the

    information technology marketplace every day, most new products have the same style of

    interface as those that have gone before in order to make the learning curve for consumers as

    brief as possible. By simply extending Mozilla to handle the Z39.50 protocol, this learning curve

    will be minimalised as most of the functionality of the client is already present in the original, and

    thus already familiar.

Mozilla is made up of small discrete components that fit together into a larger whole. As such,

    adding additional functions to it is just a matter of writing a component that works together with

    the other components already written. This reduces the development time as well as ensuring that

    when Mozilla is updated, the changes needed to keep the Z39.50 component synchronised will be

    minimised while still having access to all future advances in the Mozilla framework.

Mozilla implements XPI - the Cross Platform Installer. This is a means of having new

    components or User Interface modifications installed automatically by clicking a button on a web

    page or similar method. As in any system accessable via the Internet, clients will be run over

Report this document

For any questions or suggestions please email