Automatic correlation method for generating summaries for text documents
First Claim
Patent Images
1. An automatic method for generating summaries for text documents, comprising steps of:
- generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process;
initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum;
computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and to a degree of correlation between said each word and user related information;
wherein said aggregated word score (SCORE[w]) has a weighted (λ
) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w, user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
SCORE[w]=λ
1*salience(w, user summarization profile)+λ
2*salience(w, user'"'"'s query or topic)+λ
3*Σ
(SCORE[s], sω
)+λ
4*salience(w, title words)+λ
5*FREQUENCY(w/d)/FREQUENCY(w/D)+λ
6*NUMBER(d, dw)/NUMBER(D);
computing an aggregated sentence score for said each sentence according to an aggregate of word scores composing said each sentence and a respective sentence position in a section and a paragraph;
comparing an aggregate sum with said score sum, said aggregate sum being a sum of aggregated word scores and aggregated sentence scores; and
if said aggregate sum is different than said score sum, returning to the step of computing the aggregated word scare;
otherwise,outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and program product to generate summaries for text documents. A user can also specify a query, topic, and terms that he/she is interested in. This method determines the importance of each sentence by using the linguistic salience of the word to the user profile, the similarity among the word, the query and topic provided by a user and the sum of scores of the sentence comprising the word. After computing the score for each word, this method computes the score for each sentence in the set of sentences according to the score of words composing it and the position of the sentence in a section and a paragraph.
-
Citations
16 Claims
-
1. An automatic method for generating summaries for text documents, comprising steps of:
-
generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process; initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum; computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and to a degree of correlation between said each word and user related information; wherein said aggregated word score (SCORE[w]) has a weighted (λ
) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w, user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
SCORE[w]=λ
1*salience(w, user summarization profile)+λ
2*salience(w, user'"'"'s query or topic)+λ
3*Σ
(SCORE[s], sω
)+λ
4*salience(w, title words)+λ
5*FREQUENCY(w/d)/FREQUENCY(w/D)+λ
6*NUMBER(d, dw)/NUMBER(D);computing an aggregated sentence score for said each sentence according to an aggregate of word scores composing said each sentence and a respective sentence position in a section and a paragraph; comparing an aggregate sum with said score sum, said aggregate sum being a sum of aggregated word scores and aggregated sentence scores; and if said aggregate sum is different than said score sum, returning to the step of computing the aggregated word scare;
otherwise,outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product for automatically generating summaries for text documents, said computer program product comprising a computer usable medium having computer readable program code thereon, said computer readable program code comprising:
-
computer program code means for generating a set of sentences for a set of documents by document discourse analysis and a set of words by morphologic process; computer program code means for initializing a word score for each word in the set of words, a sentence score for each sentence in the set of sentences and a score sum; computer program code means fix computing an aggregated word score for said each word according to an aggregate of sentence scores of sentences containing said each word and computing a degree of correlation between said each word and user related information; computer program code means for computing an aggregated sentence score for each sentence in the set of sentences according to an aggregate of word scores composing it and a respective sentence position in a section and a paragraph; wherein said aggregated word score (SCORE[w]) has a weighted (λ
) relationship with each of said aggregated sentence score (SCORE[s]), linguistic salience of said each word to a user profile (salience(w user summarization profile)), similarities among said each word, a query and a provided topic (salience(w, user'"'"'s query or topic)), similarities among said each word and terms in titles of the documents (salience(w, tile words)), a ratio of an occurrence number for said each word in a document to a total occurrence number for said each word in the set of documents (FREQUENCY(w/d)/FREQUENCY(w/D)), and a ratio of a number of documents including said each word to a total number of documents in the set of documents (NUMBER(d, dw)/NUMBER(D)), of the form
SCORE[w]=λ
1*salience(w, user summarization profile)+λ
2*salience(w, user'"'"'s query or topic)+λ
3*Σ
(SCORE[s], sω
)+λ
4*salience(w, title words)+λ
5*FREQUENCY(w/d)/FREQUENCY(w/D)+λ
6*NUMBER(d, dw)/NUMBER(D)computer program code means for computing an aggregate sum from aggregated word scores and aggregated sentence scores; computer program code means for determining if said aggregate sum is different than said score sum and for selectively replacing said score sum with said aggregate sum, each said word scare with a corresponding said aggregated word score and each said sentence score with a corresponding said aggregated sentence score; and computer program code means for outputting top-ranked sentences according to sentence score as a summary of the set of documents, top-ranked words according to word score as a keywords list of the set of documents. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification