Advanced Knowledge Technologies
Interdisciplinary Research Collaboration
Mid-term Review September 2003
1Nigel Shadbolt, Fabio Ciravegna, John Domingue, Wendy Hall,
Enrico Motta, Kieron O’Hara, David Robertson, Derek Sleeman,
Austin Tate, Yorick Wilks
In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962
Our private senses are not closed systems but are endlessly translated into each
other in that experience which we call consciousness. Our extended senses,
tools, technologies, through the ages, have been closed systems incapable of
interplay or collective awareness. Now, in the electric age, the very
instantaneous nature of co-existence among our technological instruments has
created a crisis quite new in human history. Our extended faculties and senses
now constitute a single field of experience which demands that they become
collectively conscious. Our technologies, like our private senses, now demand
an interplay and ratio that makes rational co-existence possible. As long as our
technologies were as slow as the wheel or the alphabet or money, the fact that
they were separate, closed systems was socially and psychically supportable.
This is not true now when sight and sound and movement are simultaneous and
global in extent. (McLuhan 1962, p.5, emphasis in original)2
Over forty years later, the seamless interplay that McLuhan demanded between our
technologies is still barely visible. McLuhan‘s predictions of the spread, and increased
importance, of electronic media have of course been borne out, and the worlds of
business, science and knowledge storage and transfer have been revolutionised. Yet
the integration of electronic systems as open systems remains in its infancy.
The Advanced Knowledge Technologies IRC (AKT) aims to address this problem, to
create a view of knowledge and its management across its lifecycle, to research and
create the services and technologies that such unification will require. Half way
through its six-year span, the results are beginning to come through, and this paper
will explore some of the services, technologies and methodologies that have been
developed. We hope to give a sense in this paper of the potential for the next three
years, to discuss the insights and lessons learnt in the first phase of the project, to
articulate the challenges and issues that remain.
1 Authorship of the Scientific Report has been a collaborative endeavour with all
members of AKT having contributed. 2 All references in this document can be found in section 15 of Appendix 2.
2. The semantic web and knowledge management
AKT Midterm Report Appendix 2 Page 2 of 62 The WWW provided the original context that made the AKT approach to knowledge
management (KM) possible. AKT was initially proposed in 1999, it brought together
an interdisciplinary consortium with the technological breadth and complementarity to
create the conditions for a unified approach to knowledge across its lifecycle (Table 1).
The combination of this expertise, and the time and space afforded the consortium by
the IRC structure, suggested the opportunity for a concerted effort to develop an
approach to advanced knowledge technologies, based on the WWW as a basic
AKT consortium member Expertise
Aberdeen KBSs, databases, V&V
Edinburgh Knowledge representation, planning,
workflow modelling, ontologies
OU Knowledge modelling, visualisation,
Sheffield Human language technology
Southampton Multimedia, dynamic linking, knowledge
acquisition, modelling, ontologies
Table 1: Some of the specialisms of the AKT consortium
The technological context of AKT altered for the better in the short period between
the development of the proposal and the beginning of the project itself with the
development of the semantic web (SW), which foresaw much more intelligent
manipulation and querying of knowledge. The opportunities that the SW provided for
e.g., more intelligent retrieval, put AKT in the centre of information technology
innovation and knowledge management services; the AKT skill set would clearly be
central for the exploitation of those opportunities.
The SW, as an extension of the WWW, provides an interesting set of constraints to
the knowledge management services AKT tries to provide. As a medium for the
semantically-informed coordination of information, it has suggested a number of ways
in which the objectives of AKT can be achieved, most obviously through the
provision of knowledge management services delivered over the web as opposed to
the creation and provision of technologies to manage knowledge.
AKT is working on the assumption that many web services will be developed and
provided for users. The KM problem in the near future will be one of deciding which
services are needed and of coordinating them. Many of these services will be largely
or entirely legacies of the WWW, and so the capabilities of the services will vary. As
well as providing useful KM services in their own right, AKT will be aiming to
exploit this opportunity, by reasoning over services, brokering between them, and
providing essential meta-services for SW knowledge service management.
Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of
expertise on ontologies together, and ontologies were always going to be a key part of
the strategy. All kinds of knowledge sharing and transfer activities will be mediated
by ontologies, and ontology management will be an important enabling task. Different
applications will need to cope with inconsistent ontologies, or with the problems that
AKT Midterm Report Appendix 2 Page 3 of 62
will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.
Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering meta-services that are envisaged will have to deal with this heterogeneity. The emerging picture of the SW is one of great opportunity but it will not be a well-ordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts (section 5.1.6 of Management Report). But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web.
3. AKT knowledge lifecycle: the challenges
Since AKT is concerned with providing the tools and services for managing knowledge throughout its lifecycle, it is essential that it has a model of that lifecycle. The aim of the AKT knowledge lifecycle is not to provide, as most lifecycle models are intended to do, a template for knowledge management task planning. Rather, the original conceptualisation of the AKT knowledge lifecycle was to understand what the difficulties and challenges there are for managing knowledge whether in corporations or within or across repositories.
The AKT conceptualisation of the knowledge lifecycle comprises six challenges, those of acquiring, modelling, reusing, retrieving, publishing and maintaining knowledge (O‘Hara 2002, pp.38-43). The six challenge approach does not come with
formal definitions and standards of correct application; rather the aim is to classify the functions of AKT services and technologies in a straightforward manner.
AKT Midterm Report Appendix 2 Page 4 of 62
Figure 1: AKT's six knowledge challenges
This paper will examine AKT‘s current thinking on these challenges. An orthogonal challenge, when KM is conceived in this way (indeed, whenever KM is conceived as a series of stages) is to integrate the approach within some infrastructure. Therefore the discussion in this paper will consider the challenges in turn (sections 4-9), followed by integration and infrastructure (section 10). We will then see the AKT approach in action, as applications are examined (section 11). Theoretical considerations (section 12) and future work (section 13) conclude the review. 4. Acquisition
Traditionally, in knowledge engineering, knowledge acquisition (KA) has been regarded as a bottleneck (Shadbolt & Burton, 1990). The SW has exacerbated this bottleneck problem; it will depend for its efficacy on the creation of a vast amount of annotation and metadata for documents and content, much of which will have to be created automatically or semi-automatically, and much of which will have to be created for legacy documents by people who are not those documents‘ authors.
KA is not only the science of extracting information from the environment, but rather of finding a mapping from the environment to concepts described in the appropriate modelling formalism. Hence, the importance of this for acquisition is that – in a way
that was not true during the development of the field of KA in the 1970s and 80s –
KA is now focused strongly around the acquisition of ontologies. This trend is discernable in the evolution of methodologies for knowledge intensive modelling (Schreiber et al, 2000).
Therefore, in the context of the SW, an important aspect of KA is the acquisition of knowledge to build and populate ontologies, and furthermore to maintain and adapt ontologies to allow their reuse, or to extend their useful lives. Particular problems include the development and maintenance of large ontologies, creating and maintaining ontologies by exploiting the most common, but relatively intractable, source of natural language texts. However, the development of ontologies is also something that can inform KA, by providing templates for acquisition.
AKT Midterm Report Appendix 2 Page 5 of 62
AKT has a number of approaches to the KA bottleneck, and in a paper of this size it is necessary to be selective (this will be the case for all the challenges). In this section, we will chiefly discuss the harvesting and capture of large scale content from web pages and other resources, (section 4.1), content extraction of ontologies from text (section 4.2), and the extraction of knowledge from text (section 4.3). These approaches constitute the AKT response to the new challenges posed by the SW; however, AKT has not neglected other, older KA issues. A more traditional, expert-oriented KA tool approach, will be discussed in section 4.4. 4.1. Harvesting
AKT includes in its objectives the investigation of technologies to process a variety of knowledge on a web scale. There are currently insufficient resources marked up with meta-content in machine-readable form. In the short to medium term we cannot see such resources becoming available. One of the important objectives is to have up to date information, and so the ability to regularly harvest, capture and update content is fundamental. There has been a range of activities to support large-scale harvesting of content.
4.1.1 Early harvesting Scripts were written to ―screen scrape‖ university web sites (the leading CS research departments were chosen), using a new tool Dome (Leonard & Glaser 2001), that is an output of the research of an EPSRC student.
Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages. If irregularities in the pages are discovered during this process, the program can be paused and amended to cope with the new input.
We see below (Figure 2) the system running, and processing a personal web page, also shown. A Dome program has been recorded which removes all unnecessary elements from the source of this page, leaving just the desired data, and the element names and layout have been changed to the desired output format, RDF.
AKT Midterm Report Appendix 2 Page 6 of 62
Figure 2: A Dome Script to produce RDF from a Web Page
Other scripts have been written using appropriate standard programming tools to harvest data from other sources. These scripts are run on a nightly basis to ensure that the information we glean is as up to date as possible. As the harvesting has progressed, it has also been done by direct access to databases, where possible. In addition, other sites are beginning to provide RDF to us directly, as planned.
The theory behind this process is that of a bootstrap. Initially, AKT harvests from the web without involving the personnel at the sources at all. (This also finesses any problems of Data Protection, since all information is publicly available.) Once the benefits to the sources of having their information harvested becomes clear, some will contact us to cooperate. The cooperation can take various forms, such as sending us the data or RDF, or making the website more accessible, but the preferred solution is for them to publish the data on their website on a nightly basis in RDF (according to our ontology). These techniques are best suited to data which is well-structured (such as university and agency websites), and especially that which is generated from an underlying database.
As part of the harvesting activity, and as a service to the community, the data was put in almost raw form on a website registered for the purpose: www.hyphen.info. Figure 3 shows a snapshot of the range of data we were able to make available in this form.
AKT Midterm Report Appendix 2 Page 7 of 62
Figure 3: www.hyphen.info CS UK Page
4.1.2 Late harvesting The techniques above will continue to be used for suitable data sources. A knowledge mining system to extract information from several sources automatically has also been built (Armadillo – cf section 7.2.2), exploiting the redundancy found on the Internet,
apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy can be exploited to bootstrap the annotation process needed for IE, thus enabling production of machine-readable content for the SW. For example, the fact that a system knows the name of an author can be used to identify a number of other author names using resources present on the Internet, instead of using rule-based or statistical applications, or hand-built gazetteers. By combining a multiplicity of information sources, internal and external to the system, texts can be annotated with a high degree of accuracy with minimal or no manual intervention. Armadillo utilizes multiple strategies (Named Entity Recognition, external databases, existing gazetteers, various information extraction engines such as Amilcare – section 7.1.1 – and Annie) to model a domain by connecting different
entities and objects.
4.2. Extracting ontologies from text: Adaptiva Existing ontology construction methodologies involve high levels of expertise in the domain and the encoding process. While a great deal of effort is going into the planning of how to use ontologies, much less has been achieved with respect to automating their construction. We need a feasible computational process to effect knowledge capture.
The tradition in ontology construction is that it is an entirely manual process. There are large teams of editors or, so-called, ‗knowledge managers‘ who are occupied in
editing knowledge bases for eventual use by a wider community in their organisation. The process of knowledge capture or ontology construction involves three major steps: first, the construction of a concept hierarchy; secondly, the labeling of relations between concepts, and thirdly, the association of content with each node in the ontology (Brewster et al 2001a).
In the past a number of researchers have proposed methods for creating conceptual hierarchies or taxonomies of terms by processing texts. The work has sought to apply methods from Information Retrieval (term distribution in documents) and Information Theory (mutual information) (Brewster 2002). It is relatively easy to show that two terms are associated in some manner or to some degree of strength. It is possible also
AKT Midterm Report Appendix 2 Page 8 of 62
to group terms into hierarchical structures of varying degree of coherence. However, the most significant challenge is to be able to label the nature of the relationship This has led to the development of Adaptiva (Brewster et al 2001b), an ontology between the terms. building environment which implements a user-centred approach to the process of ontology learning. It is based on using multiple strategies to construct an ontology, reducing human effort by using adaptive information extraction. Adaptiva is a Technology Integration Experiment (TIE – section 3.1 of the Management Report).
The ontology learning process starts with the provision of a seed ontology, which is either imported to the system, or provided manually by the user. A seed may consist of just two concepts and one relationship. The terms used to denote concepts in the ontology are used to retrieve the first set of examples in the corpus. The sentences are then presented to the user to decide whether they are positive or negative examples of the ontological relation under consideration.
In Adaptiva, we have integrated Amilcare (discussed in greater detailed below in section 7.1.1). Amilcare is a tool for adaptive Information Extraction (IE) from text designed for supporting active annotation of documents for Knowledge Management (KM). It performs IE by enriching texts with XML annotations. The outcome of the validation process is used by Amilcare, functioning as a pattern learner. Once the learning process is completed, the induced patterns are applied to an unseen corpus and new examples are returned for further validation by the user. This iterative process may continue until the user is satisfied that a high proportion of exemplars is correctly classified automatically by the system.
Using Amilcare, positive and negative examples are transformed into a training corpus where XML annotations are used to identify the occurrence of relations in positive examples. The learner is then launched and patterns are induced and generalised. After testing, the best, most generic, patterns are retained and are then applied to the unseen corpus to retrieve other examples. From Amilcare‘s point of
view the task of ontology learning is transformed into a task of text annotation: the examples are transformed into annotations and annotations are used to learn how to reproduce such annotations.
Experiments are under way to evaluate the effectiveness of this approach. Various factors such as size and composition of the corpus have been considered. Some experiments indicate that, because domain specific corpora take the shared ontology as background knowledge, it is only by going beyond the corpus that adequate explicit information can be identified for the acquisition of the relevant knowledge (Brewster et al. 2003). Using the principles underlying the Armadillo technology (cf. Section
7.2.2), a model has been proposed for a web-service, which will identify relevant knowledge sources outside the specific domain corpus thereby compensating for the lack of explicit specification of the domain knowledge.
4.3. KA from text: Artequakt Given the amount of content on the web there is every likelihood that in some domains the knowledge that we might want to acquire is out there. Annotations on the SW could facilitate acquiring such knowledge, but annotations are rare and in the near future will probably not be rich or detailed enough to support the capture of extended amounts of integrated content. In the Artequakt work we have developed tools able to search and extract specific knowledge from the Web, guided by an
AKT Midterm Report Appendix 2 Page 9 of 62
ontology that details what type of knowledge to harvest. Artequakt is an Integrated Feasibility Demonstrator (IFD) that combines expertise and resources from three projects – Artiste, the Equator and AKT IRCs.
Many information extraction (IE) systems rely on predefined templates and pattern-based extraction rules or machine learning techniques in order to identify and extract entities within text documents. Ontologies can provide domain knowledge in the form of concepts and relationships. Linking ontologies to IE systems could provide richer knowledge guidance about what information to extract, the types of relationships to look for, and how to present the extracted information. We discuss IE in more detail in section 7.1.
There exist many IE systems that enable the recognition of entities within documents (e.g. ‗Renoir‘ is a ‗Person‘, ‗25 Feb 1841‘ is a ‗Date‘). However, such information is sometimes insufficient without acquiring the relation between these entities (e.g. ‗Renoir‘ was born on ‗25 Feb 1841‘). Extracting such relations automatically is difficult, but crucial to complete the acquisition of knowledge fragments and ontology population.
When analysing documents and extracting information, it is inevitable that duplicated and contradictory information will be extracted. Handling such information is challenging for automatic extraction and ontology population approaches. Artequakt (Alani et al 2003b, Kim et al 2002) implements a system that searches the Web and extracts knowledge about artists, based on an ontology describing that domain. This knowledge is stored in a knowledge base to be used for automatically producing tailored biographies of artists.
Artequakt's architecture (Figure 4) comprises of three key areas. The first concerns the knowledge extraction tools used to extract factual information items from documents and pass them to the ontology server. The second key area is the information management and storage. The information is stored by the ontology server and consolidated into a knowledge base that can be queried via an inference engine. The final area is the narrative generation. The Artequakt server takes requests from a reader via a simple Web interface. The reader request will include an artist and the style of biography to be generated (chronology, summary, fact sheet, etc.). The server uses story templates to render a narrative from the information stored in the knowledge base using a combination of original text fragments and natural language generation.
AKT Midterm Report Appendix 2 Page 10 of 62
Figure 4: Artequakt's architecture
The first stage of this project consisted of developing an ontology for the domain of artists and paintings. The main part of this ontology was constructed from selected sections in the CIDOC Conceptual Reference Model ontology. The ontology informs the extraction tool of the type of knowledge to search for and extract. An information extraction tool was developed and applied that automatically populates the ontology with information extracts from online documents. The information extraction tool makes use of an ontology, coupled with a general-purpose lexical database, WordNet and an entity-recogniser, GATE (Cunningham et al 2002 – see section 10.4) as
guidance tools for identifying knowledge fragments consisting not just of entities, but also the relationships between them. Automatic term expansion is used to increase the scope of text analysis to cover syntactic patterns that imprecisely match our definitions.