DOCUMENT SUMMARIZATION
First Claim
1. A method, comprising:
- providing a summarization item for a document to be summarized, where the summarization item is selected based, at least in part, on a term score for a member of a set of terms in the document to be summarized, where the term score depends, at least in part, on the number of uni-grams in a term.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and other embodiments associated with automatically summarizing a document are described. One method embodiment includes computing term scores for members of a set of terms in a document to be summarized and computing sentence scores for sentences in a set of sentences in the document. The method embodiment also includes computing a set of entries for a term-sentence matrix that relates terms to sentences. The method embodiment also includes computing a dominant topic for the document and simultaneously ranking the set of terms and the set of sentences based on the dominant topic. The method embodiment provides a summarization item(s) selected from the set of terms and/or the set of sentences.
-
Citations
8 Claims
-
1. A method, comprising:
providing a summarization item for a document to be summarized, where the summarization item is selected based, at least in part, on a term score for a member of a set of terms in the document to be summarized, where the term score depends, at least in part, on the number of uni-grams in a term. - View Dependent Claims (2)
-
3. A method, comprising:
-
determining one or more sentence boundaries in a document to be summarized, a sentence containing one or more terms, a term appearing in one or more sentences; establishing a set of terms in the document, a term comprising an n-gram, n being an integer from one to four; applying a simplified Porter stemming process to one or more members of the set of terms; selectively removing from the set of terms one or more terms that begin with a stop-word; selectively removing from the set of terms one or more terms that end with a stop-word; selectively removing from the set of terms one or more terms that are stop-words; computing a term score for a member of the set of terms according to;
S(t)=(a−
b(sft/N−
c)2)*f(ng)S(t) being the term score; a, b, and c being pre-determined, configurable constants; sft being the number of sentences in which term t occurs; N being the total number of sentences; and f(ng) being a function that returns a penalizing value for terms having a single uni-gram, a linearly increasing value for terms having two to four uni-grams, and a constant value for terms having more than four uni-grams; computing a sentence score for a member of a set of sentences in the document, the sentence score depending, at least in part, on a length of the sentence measured in uni-grams and a position of the sentence in a paragraph in the document; computing a set of entries for a term-sentence matrix that relates members of the set of terms to members of the set of sentences, a value of an entry in the set of entries for the term-sentence matrix depending, at least in part, on a term score and a sentence score; computing a dominant topic for the document to be summarized by computing a term eigenvector, where a member of the term eigenvector represents a relevancy of a term to the dominant topic, and by computing a sentence eigenvector, where a member of the sentence eigenvector represents a relevancy of a sentence to the dominant topic; simultaneously ranking the set of terms and the set of sentences based, at least in part, on the dominant topic; providing a summarization item selected from one or more of, the set of terms, and the set of sentences, the summarization item being selected based, at least in part, on one or more of, a ranking of the set of terms, and a ranking of the set of sentences; logically removing one or more members of the set of terms from the set of terms based on a relation with the dominant topic and logically removing one or more members of the set of sentences from the set of sentences based on a relation with the dominant topic; and determining a subsequent dominant topic.
-
-
4. A method, comprising:
-
producing a term vector related to one or more terms in a text to be summarized; producing a sentence vector related to one or more sentences in the text, a sentence containing one or more terms; producing a term eigenvector based on the term vector and the sentence vector; producing a sentence eigenvector based on the term vector and the sentence vector; producing a principal eigenvalue related to the text; simultaneously calculating a term ranking and a sentence ranking based, at least in part, on the principal eigenvalue; and providing a text summary based, at least in part, on the term ranking and the sentence ranking, the text summary comprising one or more of, a term, and a sentence. - View Dependent Claims (5, 6, 7, 8)
-
Specification