DOC

Clustering algorithms group objects into classes based on some ---

By Victoria Ferguson,2014-08-09 01:12
20 views 0
Clustering algorithms group objects into classes based on some ---

    Journal of Machine Learning Research 3 (2002) 747-780 Submitted: 12/01; Published: 12/02

    Coupled Clustering: A Method for Detecting

    Structural Correspondence

    Zvika Marx MARXZV@CS.BIU.AC.IL

    The Interdisciplinary Center for Neural Computation

    The Hebrew University of Jerusalem

    Givat-Ram, Jerusalem 91904, Israel

    and Department of Computer Science

    Bar-Ilan University

    Ramat-Gan 52900, Israel

    Ido Dagan DAGAN@CS.BIU.AC.IL

    Department of Computer Science

    Bar-Ilan University

    Ramat-Gan 52900, Israel

    Joachim M. Buhmann JB@INFORMATIK.UNI-BONN.DE

    Institut für Informatik III

    University of Bonn

    Römerstr. 164, D-53117 Bonn, Germany

    Eli Shamir SHAMIR@CS.HUJI.AC.IL

    School of Computer Science and Engineering

    The Hebrew University of Jerusalem

    Givat-Ram, Jerusalem 91904, Israel

Editors: Carla E. Brodley and Andrea Danyluk

    Abstract

    This paper proposes a new paradigm and a computational framework for revealing

    equivalencies (analogies) between sub-structures of distinct composite systems that

    are initially represented by unstructured data sets. For this purpose, we introduce

    and investigate a variant of traditional data clustering, termed coupled clustering,

    which outputs a configuration of corresponding subsets of two such representative

    sets. We apply our method to synthetic as well as textual data. Its achievements in

    detecting topical correspondences between textual corpora are evaluated through

    comparison to performance of human experts.

    Keywords: Unsupervised learning, Clustering, Structure mapping, Data mining in

    texts, Natural language processing

    1. Introduction

    Unsupervised learning methods aim at the analysis of data, based on patterns within the data itself while no supplementary directions are provided. Two extensively studied unsupervised tasks are: (a) assessing similarity between object pairs, typically quantified by a single value denoting an overall similarity level, and (b) detecting, through various techniques, the detailed structure of individual composite objects. The common approach to similarity assessment has been to examine feature-based vectorial representations of each object in order to calculate some distance or proximity measure between object pairs (see Subsection 2.2 below for examples). A natural extension to this task would be to analyze in detail what makes two composite

    ?2002 Zvika Marx, Ido Dagan, Joachim M. Buhmann and Eli Shamir

    MARX, DAGAN, BUHMANN AND SHAMIR

    Data Set AData Set B

    Figure 1: The coupled clustering framework. The diamonds represent elements of the

    two clustered data sets A and B. Closed contours represent coupled clusters,

    capturing corresponding sub-structures of the two sets.

    objects similar. According to the common view, two given objects are considered similar if they share a relatively large subset of common features, which are identifiable independently of the role they perform within the internal organization of the compared objects. Nonetheless, one might wonder how faithfully a single value (or a list of common features) represents the whole richness and subtlety of what could be conceived as similar. Indeed, cognitive studies make a distinction between surface-level similar appearance and deep structure-based correspondence relationships, such as in analogies and metaphors. Motivated by this conception of structure-related similarity, we introduce in this paper a novel conceptual and algorithmic framework for unsupervised identification of structural correspondences. Data clustering techniquesrelying themselves on vectorial representations or

    similarities within the clustered elementsimpose elementary structure on a given

    unstructured corpus of data by partitioning it into disjoint clusters (see Subsection 2.1 for more details). The method that we introduce herecoupled clusteringextends

    the standard clustering methods for a setting consisting of a pair of distinct data sets. We study the problem of partitioning these given two sets into corresponding subsets, so that every subset is matched with a counterpart in the other data set. Each pair of matched subsets forms jointly a coupled cluster. A resulting configuration of coupled

    clusters is sketched in Figure 1. A coupled cluster consists of elements that are similar to one another and distinct from elements in other clusters, subject to the context imposed by aligning the clustered data sets with respect to each other. Coupled clustering is intended to reflect comparison-dependent equivalencies rather than overall similarity. It produces interesting results in cases where the two clustered data sets are overall not very similar to each other. In such cases, coupled clustering yields inherently different outcome than standard clustering applied to the union of the two data sets: standard clustering might be inclined to produce clusters that are exclusive to elements of either data set. Coupled clustering, on the other hand, is directed to include representatives from both data sets in every cluster. Coupled clustering is a newly defined computational task of general purpose. Although it is ill posed, similarly to the standard data-clustering problem, it is potentially usable for a variety of applications.

    Coupled clustering can potentially be utilized to reveal equivalencies in any type of data that can be represented by unstructured data sets. Representative data sets might contain, for instance, pixels or contours sampled from two image collections, or patterns of physiological measurements collected from distinct populations and so on.

     748

    COUPLED CLUSTERING: A METHOD FOR DETECTING STRUCTURAL CORRESPONDENCE

    Figure 2: Keyword samples from news articles regarding two conflicts. Examples of

    coupled clusters, i.e. matched pairs of corresponding keyword subsets, are

    marked by curved contours.

    The current work concentrates on textual data, while the variety of other conceivable applications should be investigated within subsequent projects. Specifically, we apply coupled clustering to pairs of textual sourcesdocument collections or corpora