Automatic method of generating feature probabilities for automatic extracting summarization

US 5,778,397 A
Filed: 06/28/1995
Issued: 07/07/1998
Est. Priority Date: 06/28/1995
Status: Expired due to Term

First Claim

Patent Images

1. A method of automatically generating feature probabilities from a document corpus, each document including a multiplicity of sentences, the method of comprising the steps of:

a) designating as a selected document a document of the document corpus;

b) designating as a selected sentence a one of the sentences of the selected document;

c) determining a value of a location feature for the selected sentence, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the selected document, the second location value indicating that the selected sentence is included within a middle portion of the selected document, and the third location value indicating that the selected sentence is included within an ending portion of the selected document;

d) determining a value of an upper case feature for the selected sentence, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the selected upper case phrases forming a subset of upper case phrases included within the selected document, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases;

e) incrementing a location counter associated with the value of the location feature for the selected sentence;

f) incrementing an upper case counter associated with the value of the upper case feature for the selected document;

g) if all sentences of the selected document have not been designated as the selected sentence, repeating steps b) through f);

h) if all documents of the document corpus have not been designated as the selected document, repeating steps a) through g);

i) determining probabilities for each value of the location feature using the associated counter for each location feature value;

j) determining the probabilities for each value of the upper case feature using the associated counter for each upper case feature value; and

k) generating an extract for a first document presented in machine readable form to the user using the upper case feature, the location feature and the probabilities for each value of the upper case feature and the location feature.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically generating feature probabilities that allow later automatic generation of document extracts. The computer system generates the probabilities by analyzing each document a document at a time. First, the computer system designates one of the documents as a selected document. Next, the computer system analyzes each sentence of the selected document to determine the value of the paragraph feature and the value of the uppercase feature. The computer system repeats this effort for each document of the document corpus. Afterward, the number of occurrences of each value of each feature is calculated and is used to calculate feature value probabilities for all of the features.

223 Citations

8 Claims

1. A method of automatically generating feature probabilities from a document corpus, each document including a multiplicity of sentences, the method of comprising the steps of:
- a) designating as a selected document a document of the document corpus;
  
  b) designating as a selected sentence a one of the sentences of the selected document;
  
  c) determining a value of a location feature for the selected sentence, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the selected document, the second location value indicating that the selected sentence is included within a middle portion of the selected document, and the third location value indicating that the selected sentence is included within an ending portion of the selected document;
  
  d) determining a value of an upper case feature for the selected sentence, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the selected upper case phrases forming a subset of upper case phrases included within the selected document, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases;
  
  e) incrementing a location counter associated with the value of the location feature for the selected sentence;
  
  f) incrementing an upper case counter associated with the value of the upper case feature for the selected document;
  
  g) if all sentences of the selected document have not been designated as the selected sentence, repeating steps b) through f);
  
  h) if all documents of the document corpus have not been designated as the selected document, repeating steps a) through g);
  
  i) determining probabilities for each value of the location feature using the associated counter for each location feature value;
  
  j) determining the probabilities for each value of the upper case feature using the associated counter for each upper case feature value; and
  
  k) generating an extract for a first document presented in machine readable form to the user using the upper case feature, the location feature and the probabilities for each value of the upper case feature and the location feature.

2. A method of automatically generating feature probabilities from a document corpus and a summary corpus of model summaries, each document of the document corpus being associated with a summary of the summary corpus, each document including a multiplicity of sentences, the multiplicity of sentences including a plurality of matching sentences, each matching sentence matching a sentence of the associated summary, the method of comprising the steps of:
- a) designating as a selected document a document of the document corpus;
  
  b) designating as a selected sentence a one of the sentences of the selected document;
  
  c) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the selected document, the second location value indicating that the selected sentence is included within a middle portion of the selected document, and the third location value indicating that the selected sentence is included within an ending portion of the selected document, each value of the location feature having an associated total counter, and an associated matching counter, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases, each value of the upper case feature having an associated total counter and an associated matching counter;
  
  d) for each feature incrementing the total counter associated with the feature value for the selected sentence;
  
  e) if the selected sentence is a one of the plurality of matching sentences, for each feature incrementing the matching counter associated with the feature value for the selected sentence;
  
  f) if all sentences of the selected document have not been designated as the selected sentence, repeating steps b) through e);
  
  g) if all documents of the document corpus have not been designated as the selected document, repeating steps a) through f);
  
  h) for each value of each feature determining a probability using the associated total counter and the associated matching counter; and
  
  i) generating an extract for a first document presented in machine readable form to the user using the feature set and the probabilities for each value of each feature.
- View Dependent Claims (3, 4, 5, 6, 7)
- - 3. The method of claim 2 wherein the feature set further comprises a direct theme feature, the direct theme feature having a first value indicating that the selected sentence represents a theme of the selected document, the direct theme feature having a second value indicating that the selected sentence does not represent a theme of the selected document.
  - 4. The method of claim 3 wherein the feature set further comprises a cue word feature, the cue word feature having a first value indicating that the selected sentence summarizes the selected document, the cue word feature having a second value indicating that the selected sentence does not summarize the selected document.
  - 5. The method of claim 4 wherein the feature set further comprises a sentence length feature, the sentence length feature having a first value indicating that the selected sentence exceeds a minimum length, and the sentence length feature having a second value indicating that the selected sentence does not exceed the minimum length.
  - 6. The method of claim 2 wherein the feature set further comprises a cue word feature, the cue word feature having a first value indicating that the selected sentence summarizes the selected document, the cue word feature having a second value indicating that the selected sentence does not summarize the selected document.
  - 7. The method of claim 2 wherein the feature set further comprises a sentence length feature, the sentence length feature having a first value indicating that the selected sentence exceeds a minimum length, and the sentence length feature having a second value indicating that the selected sentence does not exceed the minimum length.

8. An article of manufacture comprising:
- a) a memory; and
  
  b) instructions stored in the memory, the instructions for automatically generating feature probabilities from a document corpus and a summary corpus of manually generated summaries, each document of the document corpus being associated with a summary of the summary corpus, each document including a multiplicity of sentences, the multiplicity of sentences including a plurality of matching sentences, each matching sentence matching a sentence of the associated summary, the instructions comprising the steps of;
  
  1) designating as a selected document a document of the document corpus;
  
  2) designating as a selected sentence a one of the sentences of the selected document;
  
  3) determining values for the selected sentence of each feature of a feature set, the feature set including a location feature and an upper case feature, the location feature having a first location value, a second location value, and a third location value, the first location value indicating that the selected sentence is included within a beginning portion of the selected document, the second location value indicating that the selected sentence is included within a middle portion of the selected document, and the third location value indicating that the selected sentence is included with an ending portion of the selected document, each value of the location feature having an associated total counter, and an associated matching counter, the upper case feature having a first upper case value and a second upper case value, the first upper case value indicating that selected sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value indicating the selected sentence includes a one of the selected upper case phrases, each value of the upper case feature having an associated total counter and an associated matching counter;
  
  4) for each feature incrementing the total counter associated with the feature value for the selected sentence;
  
  5) if the selected sentence is a one of the plurality of matching sentences, for each feature incrementing the matching counter associated with the feature value for the selected sentence;
  
  6) if all sentences of the selected document have not been designated as the selected sentence, repeating steps b2) through b5);
  
  7) if all documents of the document corpus have not been designated as the selected document, repeating steps b1) through b6); and
  
  8) for each value of each feature determining a probability using the associated total counter and the associated matching counter.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Chen, Francine R., Brotsky, Daniel C., Pedersen, Jan O., Kupiec, Julian M., Putz, Steven B.
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Thomas, Joseph

Application Number

US08/495,865
Time in Patent Office

1,105 Days
Field of Search

395/751, 395/759, 395/761, 395/793, 395/792, 395/20, 395/23, 395/50, 395/75, 395/77, 704/1, 704/9, 707/500, 707/530, 707/531
US Class Current

715/243
CPC Class Codes

G06F 16/345 Summarisation for human users

Automatic method of generating feature probabilities for automatic extracting summarization

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

223 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic method of generating feature probabilities for automatic extracting summarization

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

223 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links