Apparatus for and method of summarising text
First Claim
Patent Images
1. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of content of the document data;
a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
a phrase ranker configured to rank phrases in the document data;
a words selector configured to select the words with a highest ranking;
a co-occurrence identifier configured to identify which of the co-occurrences with a highest ranking contain at least one of the highest ranking words;
a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun;
a phrase selector configured to select one or ones of the identified phrases with a highest ranking as a topic or topics of the document data; and
an outputter configured to output data relating to the selected topics.
1 Assignment
0 Petitions
Accused Products
Abstract
Apparatus for identifying topics of document data has:
- a word ranker (171) for ranking words that are present in or representative of the content of the document data;
- a co-occurrence ranker (172) for ranking co-occurrences of words that are present in or representative of the content of the document data;
- a phrase ranker (170) for ranking phrases in the document data;
- a word selector (174) for selecting the highest ranking words;
- a co-occurrence identifier (176) for identifying which of the highest ranking co-occurrences contain at least one of the highest ranking words;
- a phrase identifier (177) for identifying the phrases containing at least one word from the identified co-occurrences;
- a phrase selector (178) for selecting the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
- an output device (40) for outputting data relating to the selected topics.
111 Citations
36 Claims
-
1. An apparatus including a processor to identify topics of document data, the apparatus comprising:
-
a word ranker configured to rank words that are present in or representative of content of the document data; a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data; a phrase ranker configured to rank phrases in the document data; a words selector configured to select the words with a highest ranking; a co-occurrence identifier configured to identify which of the co-occurrences with a highest ranking contain at least one of the highest ranking words; a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun; a phrase selector configured to select one or ones of the identified phrases with a highest ranking as a topic or topics of the document data; and an outputter configured to output data relating to the selected topics. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. An apparatus including a processor to identify topics of document data, the apparatus comprising:
-
a word ranker configured to rank words that are present in or representative of the content of the document data; a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data; a phrase ranker configured to rank phrases in the document data; a words selector configured to select the highest ranking words; a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words; a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences; a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; an outputter configured to output data relating to the selected topics; a text splitter configured to split the document data into text segments; a classifier configured to classify the selected topics of the distribution in the text segments which define main and subsidiary topics in the document data, wherein the outputter is configured to output data relating to the classified topics; and a topic hierarchy identifier configured to identify a topic as being a child or subsidiary topic of another topic when text portions in which that subsidiary topic occurs represent a sub-set of the text portions in which the said other topic occurs, wherein the outputter is configured to output data relating to the identified topic hierarchy. - View Dependent Claims (19, 20, 21)
-
-
22. An apparatus including a processor to identify topics of document data, the apparatus comprising:
-
a word ranker configured to rank words that are present in or representative of the content of the document data; a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data; a phrase ranker configured to rank phrases in the document data; a words selector configured to select the highest ranking words; a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words; a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences; a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; an outputter configured to output data relating to the selected topics; a text splitter configured to split the document data into text segments; a classifier configured to classify the selected topics of the distribution in the text segments which define main and subsidiary topics in the document data, wherein the outputter is configured to output data relating to the classified topics; and a topic hierarchy identifier configured to identify a topic as being a child or subsidiary topic of another topic when the text segments in which that subsidiary topic occurs represent a sub-set of the text segments in which the said other topic occurs, wherein the outputter is configured to output data relating to the identified topic hierarchy.
-
-
23. An apparatus including a processor to identify topics of document data, the apparatus comprising:
-
a word ranker configured to rank words that are present in or representative of content of the document data; a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data; a phrase ranker configured to rank phrases in the document data; a words selector configured to select the highest ranking words; a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words; a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences; a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and a summary provider configured to provide summary data on the basis of the selected topics, wherein the summary provider comprises a sentence selector configured to select sentences to use in the summary data; wherein the sentence selector comprises; a topic weight assigner configured to assign weights to the topics; a sentence weight assigner configured to assign weights to sentences in the document data; a scorer configured to score the sentences by summing the assigned topic and sentence weights; a selector configured to select the sentence or sentences having the highest score or scores; a topic weight adjuster configured to relatively reduce the weight allocated to the topic or topics in the selected sentence or sentences, wherein the topic weight adjuster is configured to set to zero the weight of any topic in the selected sentence or sentences; a controller configured to cause the scorer, selector and topic weight adjuster to a controller configured to cause the scorer, selector and topic weight adjuster to repeat the above operations until a predetermined number of sentences has been selected for the summary from the document data; and an outputter configured to output the summary data. - View Dependent Claims (24, 25, 26, 27, 28, 29)
-
-
30. An apparatus including a processor to identify topics of document data, the apparatus comprising:
-
a word ranker configured to rank words that are present in or representative of the content of the document data; a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data; a phrase ranker configured to rank phrases in the document data; a words selector configured to select the highest ranking words; a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words; a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences; a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; a summary provider configured to provide summary data on the basis of the selected topics, wherein the summary provider comprises a sentence selector configured to select sentences to use in the summary data; a chunk identifier configured to identify in sentences selected for a summary chunks that do not contain words in the selected topics; and a chunk modifier configured to modify the identified chunks wherein the chunk modifier is configured to modify chunks by causing them to be displayed which place less emphasis on the modified chunks; and an outputter configured to output the summary data, wherein the outputter is configured to output the summary data and the chunk modifier is configured to modify chunks to cause, when the outputter provides output data for display by a display, the modified chunks to be displayed using at least one of a smaller font size, a different font, a different font characteristic and a different font colour from the other chunks. - View Dependent Claims (31, 32, 33, 34)
-
-
35. A method to identify topics of document data, the method comprising the steps of:
-
ranking words that are present in or representative of content of the document data; ranking co-occurrences of words that are present in or representative of the content of the document data; ranking phrases in the document data; selecting the words with a highest ranking; identifying which of the co-occurrences with a highest ranking contain at least one of the highest ranking words; identifying the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun; selecting one or ones of the identified phrases with a highest ranking as the topic or topics of the document data; and outputting data relating to the selected topics.
-
-
36. A computer-executable program stored on a computer-readable storage medium, the program when executed by a computer, performing the steps of:
-
ranking words that are present in or representative of content of the document data; ranking co-occurrences of words that are present in or representative of the content of the document data; ranking phrases in the document data; selecting the words with a highest ranking; identifying which of the co-occurrences with a highest ranking contain at least one of the highest ranking words; identifying the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun; selecting one or ones of the identified phrases with a highest ranking as the topic or topics of the document data; and outputting data relating to the selected topics.
-
Specification