Automatic method of extracting summarization using feature probabilities
First Claim
1. A processor implemented method of automatically extracting a subset of sentences from sentences of a natural language document presented in machine readable form to the processor, the document including a second multiplicity of sentences, the processor being coupled to a memory storing machine readable instructions for extracting sentences, the method comprising the steps of:
- a) designating a sentence of the document as a selected sentence;
b) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases;
c) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature;
d) if all sentences of the document have not been designated as the selected sentence, repeating steps a) through c); and
e) selecting the subset of sentences to be extracted based upon sentence scores.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of automatically generating document extracts. The method makes use of feature value probabilities generated from a statistical analysis of manually generated summaries to extract the same set of sentences an expert might. The method is based upon an iterative approach. First, the computer system designates a sentence of the document as a selected sentence. Second, the computer system determine values for the selected sentence of each feature of a feature set. Third, the computer system increases a score for the selected sentence based upon the value of the feature for the selected sentence and upon the probability associated with that value. Fourth, after scoring all of the sentences of the document the computer system, the computer system selects a subset of the highest scoring sentences to be extracted.
112 Citations
7 Claims
-
1. A processor implemented method of automatically extracting a subset of sentences from sentences of a natural language document presented in machine readable form to the processor, the document including a second multiplicity of sentences, the processor being coupled to a memory storing machine readable instructions for extracting sentences, the method comprising the steps of:
-
a) designating a sentence of the document as a selected sentence; b) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases; c) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature; d) if all sentences of the document have not been designated as the selected sentence, repeating steps a) through c); and e) selecting the subset of sentences to be extracted based upon sentence scores. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An article of manufacture comprising:
-
a) a memory; b) data stored in the memory, the data including a probability for each value of each feature of a feature set, the probabilities being generated from a statistical analysis of a document corpus and an associated corpus of manually generated summaries; c) instructions stored in the memory, the instructions being accessible for extracting a subset of sentences from sentences of a natural language document in machine readable form, the document including a second multiplicity of sentences, the instructions representing the steps of; 1) designating a sentence of the document as a selected sentence; 2) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that the selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating that the selected sentence includes a one of the selected upper case phrases; 3) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature; 4) if all sentences of the document have not designated as the selected sentence, repeating steps c1) through c3); and 5) selecting the subset of sentences to be extracted based upon sentence scores.
-
Specification