Automatic method of extracting summarization using feature probabilities

US 5,918,240 A
Filed: 06/28/1995
Issued: 06/29/1999
Est. Priority Date: 06/28/1995
Status: Expired due to Term

First Claim

Patent Images

1. A processor implemented method of automatically extracting a subset of sentences from sentences of a natural language document presented in machine readable form to the processor, the document including a second multiplicity of sentences, the processor being coupled to a memory storing machine readable instructions for extracting sentences, the method comprising the steps of:

a) designating a sentence of the document as a selected sentence;

b) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases;

c) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature;

d) if all sentences of the document have not been designated as the selected sentence, repeating steps a) through c); and

e) selecting the subset of sentences to be extracted based upon sentence scores.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically generating document extracts. The method makes use of feature value probabilities generated from a statistical analysis of manually generated summaries to extract the same set of sentences an expert might. The method is based upon an iterative approach. First, the computer system designates a sentence of the document as a selected sentence. Second, the computer system determine values for the selected sentence of each feature of a feature set. Third, the computer system increases a score for the selected sentence based upon the value of the feature for the selected sentence and upon the probability associated with that value. Fourth, after scoring all of the sentences of the document the computer system, the computer system selects a subset of the highest scoring sentences to be extracted.

112 Citations

7 Claims

1. A processor implemented method of automatically extracting a subset of sentences from sentences of a natural language document presented in machine readable form to the processor, the document including a second multiplicity of sentences, the processor being coupled to a memory storing machine readable instructions for extracting sentences, the method comprising the steps of:
- a) designating a sentence of the document as a selected sentence;
  
  b) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases;
  
  c) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature;
  
  d) if all sentences of the document have not been designated as the selected sentence, repeating steps a) through c); and
  
  e) selecting the subset of sentences to be extracted based upon sentence scores.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 wherein the feature set further comprises a direct theme feature, the direct theme feature having a first value indicating that the selected sentence represents a theme of the document, the direct theme feature having a second value indicating that the selected sentence does not represent a theme of the document.
  - 3. The method of claim 1 wherein the feature set further comprises a cue word feature, the cue word feature having a first value indicating that the selected sentence summarizes the document, the cue word feature having a second value indicating that the selected sentence does not summarize the document.
  - 4. The method of claim 1 wherein the feature set further comprises a length feature, the length feature having a first value indicating that the selected sentence exceeds a minimum length, and the length feature having a second value indicating that the selected sentence does not exceed the minimum length.
  - 5. The method of claim 2 wherein the feature set further comprises a cue word feature, the cue word feature having a first value indicating that the selected sentence summarizes the document, the cue word feature having a second value indicating that the selected sentence does not summarize the document.
  - 6. The method of claim 5 wherein the feature set further comprises a length feature, the length feature having a first value indicating that the selected sentence exceeds a minimum length, and the length feature having a second value indicating that the selected sentence does not exceed the minimum length.

7. An article of manufacture comprising:
- a) a memory;
  
  b) data stored in the memory, the data including a probability for each value of each feature of a feature set, the probabilities being generated from a statistical analysis of a document corpus and an associated corpus of manually generated summaries;
  
  c) instructions stored in the memory, the instructions being accessible for extracting a subset of sentences from sentences of a natural language document in machine readable form, the document including a second multiplicity of sentences, the instructions representing the steps of;
  
  1) designating a sentence of the document as a selected sentence;
  
  2) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the document, the second location value indicating that the selected sentence is included within a middle portion of the document, and the third location value indicating that the selected sentence is included within an ending portion of the document, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that the selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating that the selected sentence includes a one of the selected upper case phrases;
  
  3) for each feature increasing a score for the selected sentence based upon the value of the feature for the selected sentence and upon a probability associated with the value of the feature;
  
  4) if all sentences of the document have not designated as the selected sentence, repeating steps c1) through c3); and
  
  5) selecting the subset of sentences to be extracted based upon sentence scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Chen, Francine R., Pedersen, Jan O., Putz, Steven B., Brotsky, Daniel C., Kupiec, Julian M.
Primary Examiner(s)
Hayes, Gail O.
Assistant Examiner(s)
COSIMANO, EDWARD R

Application Number

US08/495,986
Time in Patent Office

1,462 Days
Field of Search

395/793, 395/752, 395/759, 382/203, 382/224, 382/173, 707/531, 704/2, 704/9
US Class Current

715/243
CPC Class Codes

G06F 16/345 Summarisation for human users

Automatic method of extracting summarization using feature probabilities

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

112 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic method of extracting summarization using feature probabilities

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

112 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links