Automatic correlation method for generating summaries for text documents

US 7,017,114 B2
Filed: 08/31/2001
Issued: 03/21/2006
Est. Priority Date: 09/20/2000
Status: Expired due to Fees

First Claim

Patent Images

1. An automatic method for generating summaries for text documents, comprising steps of:

generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process;

initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum;

computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and to a degree of correlation between said each word and user related information;

wherein said aggregated word score (SCORE[w]) has a weighted (λ

) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w, user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
SCORE[w]=λ

₁*salience(w, user summarization profile)+λ

₂*salience(w, user'"'"'s query or topic)+λ

₃*Σ

(SCORE[s], sω

)+λ

₄*salience(w, title words)+λ

₅*FREQUENCY(w/d)/FREQUENCY(w/D)+λ

₆*NUMBER(d, dw)/NUMBER(D);

computing an aggregated sentence score for said each sentence according to an aggregate of word scores composing said each sentence and a respective sentence position in a section and a paragraph;

comparing an aggregate sum with said score sum, said aggregate sum being a sum of aggregated word scores and aggregated sentence scores; and

if said aggregate sum is different than said score sum, returning to the step of computing the aggregated word scare;

otherwise,outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and program product to generate summaries for text documents. A user can also specify a query, topic, and terms that he/she is interested in. This method determines the importance of each sentence by using the linguistic salience of the word to the user profile, the similarity among the word, the query and topic provided by a user and the sum of scores of the sentence comprising the word. After computing the score for each word, this method computes the score for each sentence in the set of sentences according to the score of words composing it and the position of the sentence in a section and a paragraph.

Citations

16 Claims

1. An automatic method for generating summaries for text documents, comprising steps of:
- generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process;
  
  initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum;
  
  computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and to a degree of correlation between said each word and user related information;
  
  wherein said aggregated word score (SCORE[w]) has a weighted (λ
  
  ) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w, user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
  SCORE[w]=λ
  
  ₁*salience(w, user summarization profile)+λ
  
  ₂*salience(w, user'"'"'s query or topic)+λ
  
  ₃*Σ
  
  (SCORE[s], sω
  
  )+λ
  
  ₄*salience(w, title words)+λ
  
  ₅*FREQUENCY(w/d)/FREQUENCY(w/D)+λ
  
  ₆*NUMBER(d, dw)/NUMBER(D);
  
  computing an aggregated sentence score for said each sentence according to an aggregate of word scores composing said each sentence and a respective sentence position in a section and a paragraph;
  
  comparing an aggregate sum with said score sum, said aggregate sum being a sum of aggregated word scores and aggregated sentence scores; and
  
  if said aggregate sum is different than said score sum, returning to the step of computing the aggregated word scare;
  
  otherwise,outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. An automatic method according to claim 1, wherein the step of computing the aggregated word score for said each word comprises:
    - computing a score for said each word according to linguistic salience of said each word to a user profile.
  - 3. An automatic method according to claim 1, wherein the step of computing the aggregated word score for said each word comprises:
    - computing a score for said each word according to similarities among said each word, a query and a provided topic.
  - 4. An automatic method according to claim 1, wherein the step of computing the aggregated word score for said each word comprises:
    - computing a score for said each word according to similarities among said each word and terms in titles of the documents.
  - 5. An automatic method according to claim 1, wherein the step of computing the aggregated word score for said each word comprises:
    - computing a score for said each word according to a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents.
  - 6. An automatic method according to claim 1, wherein the step of computing the aggregated word score for said each word comprises:
    - computing a score for said each word according to a ratio of a number of documents including said each word to a total number of documents in the set of documents.
  - 7. An automatic method according to claim 1, wherein document discourse analysis comprises identifying titles, sections, lists, paragraph boundaries and sentence boundaries of the documents.
  - 8. An automatic method according to claim 1, wherein said aggregate sentence score further has a weighted relationship with each of said aggregated word score, sentence position (position(s, d)) and similarity (similarity(s, S)) of the form
    SCORE[s]=λ
    - ₇*Σ
      
      (SCORE[w], sw)+λ
      
      ₈*position(s, d)+λ
      
      ₉*similarity(s, S).

9. A computer program product for automatically generating summaries for text documents, said computer program product comprising a computer usable medium having computer readable program code thereon, said computer readable program code comprising:
- computer program code means for generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process;
  
  computer program code means for initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum;
  
  computer program code means fix computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and computing a degree of correlation between said each word and user related information;
  
  computer program code means for computing an aggregated sentence score for each sentence in the set of sentences according to an aggregate of word scores composing it and a respective sentence position in a section and a paragraph;
  
  wherein said aggregated word score (SCORE[w]) has a weighted (λ
  
  ) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
  SCORE[w]=λ
  
  ₁*salience(w, user summarization profile)+λ
  
  ₂*salience(w, user'"'"'s query or topic)+λ
  
  ₃*Σ
  
  (SCORE[s], sω
  
  )+λ
  
  ₄*salience(w, title words)+λ
  
  ₅*FREQUENCY(w/d)/FREQUENCY(w/D)+λ
  
  ₆*NUMBER(d, dw)/NUMBER(D)computer program code means for computing an aggregate sum from aggregated word scores and aggregated sentence scores;
  
  computer program code means for determining if said aggregate sum is different than said score sum and for selectively replacing said score sum with said aggregate sum, each said word scare with a corresponding said aggregated word score and each said sentence score with a corresponding said aggregated sentence score; and
  
  computer program code means for outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. A computer program product for automatically generating summaries according to claim 9, wherein the computer program code means for computing the aggregated word score for said each word comprises:
    - computer program code means for computing a score for said each word according to linguistic salience of said each word to a user profile.
  - 11. A computer program product for automatically generating summaries according to claim 9, wherein the computer program code means for computing the aggregated word score for said each word comprises:
    - computer program code means for computing a score for said each word according to similarities among said each word, a query and a provided topic.
  - 12. A computer program product for automatically generating summaries according to claim 9, wherein the computer program code means for computing the aggregated word score for said each word comprises:
    - computer program code means for computing a score for said each word according to similarities among said each word and terms in titles of the documents.
  - 13. A computer program product for automatically generating summaries according to claim 9, wherein the computer program code means for computing the aggregated word score for said each word comprises:
    - computer program code means for computing a score for said each word according to a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents.
  - 14. A computer program product for automatically generating summaries according to claim 9, wherein the computer program code means for computing the aggregated word score for said each word comprises:
    - computer program code means for computing a score for said each word according to a ratio of a number of documents including said each word to a total number of documents in the set of documents.
  - 15. A computer program product for automatically generating summaries according to claim 9, wherein computer program code means for generating a set of sentences for a set of documents by document discourse analysis comprises computer program code means for identifying titles, sections, lists, paragraph boundaries and sentence boundaries of the documents.
  - 16. A computer program product for automatically generating summaries according to claim 9, wherein said aggregate sentence score further has a weighted relationship with each of said aggregated word score, sentence position (position(s, d)) and similarity (similarity(s, S)) of the form
    SCORE[s]=λ
    - ₇*Σ
      
      (SCORE[w], sw)+λ
      
      ₈*position(s, d)+λ
      
      ₉*similarity(s, S).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yang, Li Ping, Guo, Zhi Li
Primary Examiner(s)
Bashore, William
Assistant Examiner(s)
NGUYEN, CHAU T

Application Number

US09/943,341
Publication Number

US 20020052901A1
Time in Patent Office

1,663 Days
Field of Search

715/530, 715/531, 707/1, 707 3- 4, 707/6
US Class Current

715/247
CPC Class Codes

G06F 16/345   Summarisation for human users

G06F 40/20   Natural language analysis s...

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Automatic correlation method for generating summaries for text documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic correlation method for generating summaries for text documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links