Automatic summarization From Wikipedia, the free encyclopedia Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text.
The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital. As access to data has increased so has interest in
automatic summarization. An example of the use of summarization technology is search engines such asGoogle.
Technologies that can make a coherent summary, of any kind of text, need to take into account
several variables such as length, writing style and syntax to make a useful summary.
; 1 Introduction
o 1.1 Extraction and abstraction
o 1.2 Types of summaries
; 2 Keyphrase extraction
o 2.1 Task description and example
o 2.2 Keyphrase extraction as supervised learning
; 2.2.1 Design choices
; 18.104.22.168 What are the examples?
; 22.214.171.124 What are the features?
; 126.96.36.199 How many keyphrases to return?
; 188.8.131.52 What learning algorithm?
o 2.3 Unsupervised keyphrase extraction: TextRank
; 2.3.1 Design choices
; 184.108.40.206 What should vertices be?
; 220.127.116.11 How should we create edges?
; 18.104.22.168 How are the final keyphrases formed?
; 2.3.2 Why it works
; 3 Document summarization
o 3.1 Overview of supervised learning approaches
o 3.2 Unsupervised approaches: TextRank and LexRank
; 3.2.1 Design choices
; 22.214.171.124 What are the vertices?
; 126.96.36.199 What are the edges?
; 188.8.131.52 How are summaries formed?
; 3.2.2 TextRank and LexRank differences
o 3.3 Why unsupervised summarization works
; 4 Incorporating diversity: GRASSHOPPER algorithm
; 5 Maximum entropy-based summarization
; 6 Aided summarization
; 7 Evaluation
o 7.1 Current difficulties in evaluating summaries automatically
o 7.2 Evaluating summaries qualitatively
; 8 Further reading
; 9 See also
; 10 References
Automatic summarization involves reducing a text document or a larger corpus of multiple
documents into a short set of words or paragraph that conveys the main meaning of the text.
Extractive methods work by selecting a subset of existing words, phrases, or sentences in the
original text to form the summary. In contrast, abstractive methods build an internal semantic
representation and then use natural language generation techniques to create a summary that is
closer to what a human might generate. Such a summary might contain words not explicitly present
in the original. The state-of-the-art abstractive methods are still quite weak, so most research has
focused on extractive methods, and this is what we will cover. Two particular types of
summarization often addressed in the literature are keyphrase extraction, where the goal is to
select individual words or phrases to ―tag‖ a document, and document summarization, where the
goal is to select whole sentences to create a short paragraph summary.
Extraction and abstraction
Broadly, one distinguishes two approaches: extraction and abstraction.
Extraction techniques merely copy the information deemed most important by the system to the
summary (for example, key clauses, sentences or paragraphs), while abstraction involves
paraphrasing sections of the source document. In general, abstraction can condense a text more
strongly than extraction, but the programs that can do this are harder to develop as they require the use of natural language generation technology, which itself is a growing field.,
Types of summaries
There are different types of summaries depending what the summarization program focuses on to make the summary of the text, for examplegeneric summaries or query relevant
summaries (sometimes called query-biased summaries).
Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. Summarization
of multimedia documents, e.g. pictures or movies, is also possible.
Some systems will generate a summary based on a single source document, while others can use multiple source documents (for example, a cluster of news stories on the same topic). These
systems are known as multi-document summarization systems.
Task description and example
The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or keyphrases that capture the primary topics discussed in the text. In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below. Consider the example text from a recent news article:
"The Army Corps of Engineers, rushing to meet President Bush’s promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press."
An extractive keyphrase extractor might select ―Army Corps of Engineers,‖ ―President Bush,‖ ―New Orleans,‖ and ―defective flood-control pumps‖ as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that might be more descriptive and more like what a human would produce, such as ―political negligence‖ or ―inadequate protection from floods.‖ Note that these terms do not appear in the text and require a deep understanding, which makes it difficult for a computer to produce such keyphrases. Keyphrases have many applications, such as to improve document browsing by providing a short summary. Also, keyphrases can improve information retrieval—if documents have
keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a
full-text search. Also, automatic keyphrase extraction can be useful in generating index entries for a large text corpus.
Keyphrase extraction as supervised learning
Beginning with the Turney paper, many researchers have approached keyphrase extraction as a supervised machine learning problem. Given a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a binary classification for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases. After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases. Keyphrase extractors are generally evaluated using precision and recall. Precision measures how many of the proposed keyphrases are actually correct. Recall measures how many of the true keyphrases your system proposed. The two measures can be combined in an F-score, which is the harmonic mean of the two (F = 2PR/(P + R) ). Matches between the proposed
keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization.
Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too):
What are the examples?
The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.
What are the features?
We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various boolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney’s seminal paper.
How many keyphrases to return?
In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number. What learning algorithm?
Once examples and features are created, we need a way to learn to predict keyphrases. Virtually any supervised learning algorithm could be used, such as decision trees, Naive Bayes, and rule induction. In the case of Turney’s GenEx algorithm, a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm. The extractor follows a series of heuristics to identify keyphrases. The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.