DOC

Advanced Knowledge Technologies

By Glenn Allen,2014-04-16 16:50
24 views 0
Advanced Knowledge Technologies

    Advanced Knowledge Technologies

    Interdisciplinary Research Collaboration

    Mid-term Review September 2003

    Scientific report

    1Nigel Shadbolt, Fabio Ciravegna, John Domingue, Wendy Hall,

    Enrico Motta, Kieron O’Hara, David Robertson, Derek Sleeman,

    Austin Tate, Yorick Wilks

    1. Introduction

    In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962

    Our private senses are not closed systems but are endlessly translated into each

    other in that experience which we call consciousness. Our extended senses,

    tools, technologies, through the ages, have been closed systems incapable of

    interplay or collective awareness. Now, in the electric age, the very

    instantaneous nature of co-existence among our technological instruments has

    created a crisis quite new in human history. Our extended faculties and senses

    now constitute a single field of experience which demands that they become

    collectively conscious. Our technologies, like our private senses, now demand

    an interplay and ratio that makes rational co-existence possible. As long as our

    technologies were as slow as the wheel or the alphabet or money, the fact that

    they were separate, closed systems was socially and psychically supportable.

    This is not true now when sight and sound and movement are simultaneous and

    global in extent. (McLuhan 1962, p.5, emphasis in original)2

    Over forty years later, the seamless interplay that McLuhan demanded between our

    technologies is still barely visible. McLuhan‘s predictions of the spread, and increased

    importance, of electronic media have of course been borne out, and the worlds of

    business, science and knowledge storage and transfer have been revolutionised. Yet

    the integration of electronic systems as open systems remains in its infancy.

    The Advanced Knowledge Technologies IRC (AKT) aims to address this problem, to

    create a view of knowledge and its management across its lifecycle, to research and

    create the services and technologies that such unification will require. Half way

    through its six-year span, the results are beginning to come through, and this paper

    will explore some of the services, technologies and methodologies that have been

    developed. We hope to give a sense in this paper of the potential for the next three

    years, to discuss the insights and lessons learnt in the first phase of the project, to

    articulate the challenges and issues that remain.

1 Authorship of the Scientific Report has been a collaborative endeavour with all

    members of AKT having contributed. 2 All references in this document can be found in section 15 of Appendix 2.

2. The semantic web and knowledge management

    AKT Midterm Report Appendix 2 Page 2 of 62 The WWW provided the original context that made the AKT approach to knowledge

    management (KM) possible. AKT was initially proposed in 1999, it brought together

    an interdisciplinary consortium with the technological breadth and complementarity to

    create the conditions for a unified approach to knowledge across its lifecycle (Table 1).

    The combination of this expertise, and the time and space afforded the consortium by

    the IRC structure, suggested the opportunity for a concerted effort to develop an

    approach to advanced knowledge technologies, based on the WWW as a basic

    infrastructure.

    AKT consortium member Expertise

    Aberdeen KBSs, databases, V&V

    Edinburgh Knowledge representation, planning,

    workflow modelling, ontologies

    OU Knowledge modelling, visualisation,

    reasoning services

    Sheffield Human language technology

    Southampton Multimedia, dynamic linking, knowledge

    acquisition, modelling, ontologies

    Table 1: Some of the specialisms of the AKT consortium

    The technological context of AKT altered for the better in the short period between

    the development of the proposal and the beginning of the project itself with the

    development of the semantic web (SW), which foresaw much more intelligent

    manipulation and querying of knowledge. The opportunities that the SW provided for

    e.g., more intelligent retrieval, put AKT in the centre of information technology

    innovation and knowledge management services; the AKT skill set would clearly be

    central for the exploitation of those opportunities.

    The SW, as an extension of the WWW, provides an interesting set of constraints to

    the knowledge management services AKT tries to provide. As a medium for the

    semantically-informed coordination of information, it has suggested a number of ways

    in which the objectives of AKT can be achieved, most obviously through the

    provision of knowledge management services delivered over the web as opposed to

    the creation and provision of technologies to manage knowledge.

    AKT is working on the assumption that many web services will be developed and

    provided for users. The KM problem in the near future will be one of deciding which

    services are needed and of coordinating them. Many of these services will be largely

    or entirely legacies of the WWW, and so the capabilities of the services will vary. As

    well as providing useful KM services in their own right, AKT will be aiming to

    exploit this opportunity, by reasoning over services, brokering between them, and

    providing essential meta-services for SW knowledge service management.

    Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of

    expertise on ontologies together, and ontologies were always going to be a key part of

    the strategy. All kinds of knowledge sharing and transfer activities will be mediated

    by ontologies, and ontology management will be an important enabling task. Different

    applications will need to cope with inconsistent ontologies, or with the problems that

AKT Midterm Report Appendix 2 Page 3 of 62

    will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.

    Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering meta-services that are envisaged will have to deal with this heterogeneity. The emerging picture of the SW is one of great opportunity but it will not be a well-ordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts (section 5.1.6 of Management Report). But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web.

    3. AKT knowledge lifecycle: the challenges

    Since AKT is concerned with providing the tools and services for managing knowledge throughout its lifecycle, it is essential that it has a model of that lifecycle. The aim of the AKT knowledge lifecycle is not to provide, as most lifecycle models are intended to do, a template for knowledge management task planning. Rather, the original conceptualisation of the AKT knowledge lifecycle was to understand what the difficulties and challenges there are for managing knowledge whether in corporations or within or across repositories.

    The AKT conceptualisation of the knowledge lifecycle comprises six challenges, those of acquiring, modelling, reusing, retrieving, publishing and maintaining knowledge (O‘Hara 2002, pp.38-43). The six challenge approach does not come with

    formal definitions and standards of correct application; rather the aim is to classify the functions of AKT services and technologies in a straightforward manner.

AKT Midterm Report Appendix 2 Page 4 of 62

     Figure 1: AKT's six knowledge challenges

    This paper will examine AKT‘s current thinking on these challenges. An orthogonal challenge, when KM is conceived in this way (indeed, whenever KM is conceived as a series of stages) is to integrate the approach within some infrastructure. Therefore the discussion in this paper will consider the challenges in turn (sections 4-9), followed by integration and infrastructure (section 10). We will then see the AKT approach in action, as applications are examined (section 11). Theoretical considerations (section 12) and future work (section 13) conclude the review. 4. Acquisition

    Traditionally, in knowledge engineering, knowledge acquisition (KA) has been regarded as a bottleneck (Shadbolt & Burton, 1990). The SW has exacerbated this bottleneck problem; it will depend for its efficacy on the creation of a vast amount of annotation and metadata for documents and content, much of which will have to be created automatically or semi-automatically, and much of which will have to be created for legacy documents by people who are not those documents‘ authors.

    KA is not only the science of extracting information from the environment, but rather of finding a mapping from the environment to concepts described in the appropriate modelling formalism. Hence, the importance of this for acquisition is that in a way

    that was not true during the development of the field of KA in the 1970s and 80s

    KA is now focused strongly around the acquisition of ontologies. This trend is discernable in the evolution of methodologies for knowledge intensive modelling (Schreiber et al, 2000).

    Therefore, in the context of the SW, an important aspect of KA is the acquisition of knowledge to build and populate ontologies, and furthermore to maintain and adapt ontologies to allow their reuse, or to extend their useful lives. Particular problems include the development and maintenance of large ontologies, creating and maintaining ontologies by exploiting the most common, but relatively intractable, source of natural language texts. However, the development of ontologies is also something that can inform KA, by providing templates for acquisition.

AKT Midterm Report Appendix 2 Page 5 of 62

    AKT has a number of approaches to the KA bottleneck, and in a paper of this size it is necessary to be selective (this will be the case for all the challenges). In this section, we will chiefly discuss the harvesting and capture of large scale content from web pages and other resources, (section 4.1), content extraction of ontologies from text (section 4.2), and the extraction of knowledge from text (section 4.3). These approaches constitute the AKT response to the new challenges posed by the SW; however, AKT has not neglected other, older KA issues. A more traditional, expert-oriented KA tool approach, will be discussed in section 4.4. 4.1. Harvesting

    AKT includes in its objectives the investigation of technologies to process a variety of knowledge on a web scale. There are currently insufficient resources marked up with meta-content in machine-readable form. In the short to medium term we cannot see such resources becoming available. One of the important objectives is to have up to date information, and so the ability to regularly harvest, capture and update content is fundamental. There has been a range of activities to support large-scale harvesting of content.

    4.1.1 Early harvesting Scripts were written to ―screen scrape‖ university web sites (the leading CS research departments were chosen), using a new tool Dome (Leonard & Glaser 2001), that is an output of the research of an EPSRC student.

    Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages. If irregularities in the pages are discovered during this process, the program can be paused and amended to cope with the new input.

    We see below (Figure 2) the system running, and processing a personal web page, also shown. A Dome program has been recorded which removes all unnecessary elements from the source of this page, leaving just the desired data, and the element names and layout have been changed to the desired output format, RDF.

AKT Midterm Report Appendix 2 Page 6 of 62

    Figure 2: A Dome Script to produce RDF from a Web Page

    Other scripts have been written using appropriate standard programming tools to harvest data from other sources. These scripts are run on a nightly basis to ensure that the information we glean is as up to date as possible. As the harvesting has progressed, it has also been done by direct access to databases, where possible. In addition, other sites are beginning to provide RDF to us directly, as planned.

    The theory behind this process is that of a bootstrap. Initially, AKT harvests from the web without involving the personnel at the sources at all. (This also finesses any problems of Data Protection, since all information is publicly available.) Once the benefits to the sources of having their information harvested becomes clear, some will contact us to cooperate. The cooperation can take various forms, such as sending us the data or RDF, or making the website more accessible, but the preferred solution is for them to publish the data on their website on a nightly basis in RDF (according to our ontology). These techniques are best suited to data which is well-structured (such as university and agency websites), and especially that which is generated from an underlying database.

    As part of the harvesting activity, and as a service to the community, the data was put in almost raw form on a website registered for the purpose: www.hyphen.info. Figure 3 shows a snapshot of the range of data we were able to make available in this form.

AKT Midterm Report Appendix 2 Page 7 of 62

    Figure 3: www.hyphen.info CS UK Page

    4.1.2 Late harvesting The techniques above will continue to be used for suitable data sources. A knowledge mining system to extract information from several sources automatically has also been built (Armadillo cf section 7.2.2), exploiting the redundancy found on the Internet,

    apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy can be exploited to bootstrap the annotation process needed for IE, thus enabling production of machine-readable content for the SW. For example, the fact that a system knows the name of an author can be used to identify a number of other author names using resources present on the Internet, instead of using rule-based or statistical applications, or hand-built gazetteers. By combining a multiplicity of information sources, internal and external to the system, texts can be annotated with a high degree of accuracy with minimal or no manual intervention. Armadillo utilizes multiple strategies (Named Entity Recognition, external databases, existing gazetteers, various information extraction engines such as Amilcare section 7.1.1 and Annie) to model a domain by connecting different

    entities and objects.

    4.2. Extracting ontologies from text: Adaptiva Existing ontology construction methodologies involve high levels of expertise in the domain and the encoding process. While a great deal of effort is going into the planning of how to use ontologies, much less has been achieved with respect to automating their construction. We need a feasible computational process to effect knowledge capture.

    The tradition in ontology construction is that it is an entirely manual process. There are large teams of editors or, so-called, ‗knowledge managers‘ who are occupied in

    editing knowledge bases for eventual use by a wider community in their organisation. The process of knowledge capture or ontology construction involves three major steps: first, the construction of a concept hierarchy; secondly, the labeling of relations between concepts, and thirdly, the association of content with each node in the ontology (Brewster et al 2001a).

    In the past a number of researchers have proposed methods for creating conceptual hierarchies or taxonomies of terms by processing texts. The work has sought to apply methods from Information Retrieval (term distribution in documents) and Information Theory (mutual information) (Brewster 2002). It is relatively easy to show that two terms are associated in some manner or to some degree of strength. It is possible also

AKT Midterm Report Appendix 2 Page 8 of 62

     to group terms into hierarchical structures of varying degree of coherence. However, the most significant challenge is to be able to label the nature of the relationship This has led to the development of Adaptiva (Brewster et al 2001b), an ontology between the terms. building environment which implements a user-centred approach to the process of ontology learning. It is based on using multiple strategies to construct an ontology, reducing human effort by using adaptive information extraction. Adaptiva is a Technology Integration Experiment (TIE section 3.1 of the Management Report).

    The ontology learning process starts with the provision of a seed ontology, which is either imported to the system, or provided manually by the user. A seed may consist of just two concepts and one relationship. The terms used to denote concepts in the ontology are used to retrieve the first set of examples in the corpus. The sentences are then presented to the user to decide whether they are positive or negative examples of the ontological relation under consideration.

    In Adaptiva, we have integrated Amilcare (discussed in greater detailed below in section 7.1.1). Amilcare is a tool for adaptive Information Extraction (IE) from text designed for supporting active annotation of documents for Knowledge Management (KM). It performs IE by enriching texts with XML annotations. The outcome of the validation process is used by Amilcare, functioning as a pattern learner. Once the learning process is completed, the induced patterns are applied to an unseen corpus and new examples are returned for further validation by the user. This iterative process may continue until the user is satisfied that a high proportion of exemplars is correctly classified automatically by the system.

    Using Amilcare, positive and negative examples are transformed into a training corpus where XML annotations are used to identify the occurrence of relations in positive examples. The learner is then launched and patterns are induced and generalised. After testing, the best, most generic, patterns are retained and are then applied to the unseen corpus to retrieve other examples. From Amilcare‘s point of

    view the task of ontology learning is transformed into a task of text annotation: the examples are transformed into annotations and annotations are used to learn how to reproduce such annotations.

    Experiments are under way to evaluate the effectiveness of this approach. Various factors such as size and composition of the corpus have been considered. Some experiments indicate that, because domain specific corpora take the shared ontology as background knowledge, it is only by going beyond the corpus that adequate explicit information can be identified for the acquisition of the relevant knowledge (Brewster et al. 2003). Using the principles underlying the Armadillo technology (cf. Section

    7.2.2), a model has been proposed for a web-service, which will identify relevant knowledge sources outside the specific domain corpus thereby compensating for the lack of explicit specification of the domain knowledge.

    4.3. KA from text: Artequakt Given the amount of content on the web there is every likelihood that in some domains the knowledge that we might want to acquire is out there. Annotations on the SW could facilitate acquiring such knowledge, but annotations are rare and in the near future will probably not be rich or detailed enough to support the capture of extended amounts of integrated content. In the Artequakt work we have developed tools able to search and extract specific knowledge from the Web, guided by an

AKT Midterm Report Appendix 2 Page 9 of 62

    ontology that details what type of knowledge to harvest. Artequakt is an Integrated Feasibility Demonstrator (IFD) that combines expertise and resources from three projects Artiste, the Equator and AKT IRCs.

    Many information extraction (IE) systems rely on predefined templates and pattern-based extraction rules or machine learning techniques in order to identify and extract entities within text documents. Ontologies can provide domain knowledge in the form of concepts and relationships. Linking ontologies to IE systems could provide richer knowledge guidance about what information to extract, the types of relationships to look for, and how to present the extracted information. We discuss IE in more detail in section 7.1.

    There exist many IE systems that enable the recognition of entities within documents (e.g. ‗Renoir‘ is a ‗Person‘, ‗25 Feb 1841‘ is a ‗Date‘). However, such information is sometimes insufficient without acquiring the relation between these entities (e.g. ‗Renoir‘ was born on ‗25 Feb 1841‘). Extracting such relations automatically is difficult, but crucial to complete the acquisition of knowledge fragments and ontology population.

    When analysing documents and extracting information, it is inevitable that duplicated and contradictory information will be extracted. Handling such information is challenging for automatic extraction and ontology population approaches. Artequakt (Alani et al 2003b, Kim et al 2002) implements a system that searches the Web and extracts knowledge about artists, based on an ontology describing that domain. This knowledge is stored in a knowledge base to be used for automatically producing tailored biographies of artists.

    Artequakt's architecture (Figure 4) comprises of three key areas. The first concerns the knowledge extraction tools used to extract factual information items from documents and pass them to the ontology server. The second key area is the information management and storage. The information is stored by the ontology server and consolidated into a knowledge base that can be queried via an inference engine. The final area is the narrative generation. The Artequakt server takes requests from a reader via a simple Web interface. The reader request will include an artist and the style of biography to be generated (chronology, summary, fact sheet, etc.). The server uses story templates to render a narrative from the information stored in the knowledge base using a combination of original text fragments and natural language generation.

AKT Midterm Report Appendix 2 Page 10 of 62

    Figure 4: Artequakt's architecture

    The first stage of this project consisted of developing an ontology for the domain of artists and paintings. The main part of this ontology was constructed from selected sections in the CIDOC Conceptual Reference Model ontology. The ontology informs the extraction tool of the type of knowledge to search for and extract. An information extraction tool was developed and applied that automatically populates the ontology with information extracts from online documents. The information extraction tool makes use of an ontology, coupled with a general-purpose lexical database, WordNet and an entity-recogniser, GATE (Cunningham et al 2002 see section 10.4) as

    guidance tools for identifying knowledge fragments consisting not just of entities, but also the relationships between them. Automatic term expansion is used to increase the scope of text analysis to cover syntactic patterns that imprecisely match our definitions.

Report this document

For any questions or suggestions please email
cust-service@docsford.com