Apparatus for and method of summarising text
First Claim
1. Apparatus for identifying topics of document data, the apparatus comprising:
- a word ranker operable to rank words that are present in or representative of the content of the document data;
a co-occurrence ranker operable to rank co-occurrences of words that are present in or representative of the content of the document data;
a phrase ranker operable to rank phrases in the document data;
a words selector operable to select the highest ranking words;
a co-occurrence identifier operable to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
a phrase identifier operable to identify the phrases containing at least one word from the identified co-occurrences;
a phrase selector operable to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
an outputter operable to output data relating to the selected topics.
1 Assignment
0 Petitions
Accused Products
Abstract
Apparatus for identifying topics of document data has:
a word ranker (171) for ranking words that are present in or representative of the content of the document data;
a co-occurrence ranker (172) for ranking co-occurrences of words that are present in or representative of the content of the document data;
a phrase ranker (170) for ranking phrases in the document data;
a word selector (174) for selecting the highest ranking words;
a co-occurrence identifier (176) for identifying which of the highest ranking co-occurrences contain at least one of the highest ranking words;
a phrase identifier (177) for identifying the phrases containing at least one word from the identified co-occurrences;
a phrase selector (178) for selecting the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
an output device (40) for outputting data relating to the selected topics.
201 Citations
72 Claims
-
1. Apparatus for identifying topics of document data, the apparatus comprising:
-
a word ranker operable to rank words that are present in or representative of the content of the document data;
a co-occurrence ranker operable to rank co-occurrences of words that are present in or representative of the content of the document data;
a phrase ranker operable to rank phrases in the document data;
a words selector operable to select the highest ranking words;
a co-occurrence identifier operable to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
a phrase identifier operable to identify the phrases containing at least one word from the identified co-occurrences;
a phrase selector operable to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
an outputter operable to output data relating to the selected topics. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. Co-occurrence significance calculating apparatus for use in text summarisation apparatus, the co-occurrence significance calculating apparatus comprising:
-
a co-occurrence identifier operable to identify as co-occurrences particular combinations of categories of words present in or representative of the content of document data;
a significance calculator operable to calculate a significance measure for the identified co-occurrences to determine significant ones of the identified co-occurrence; and
an outputter operable to output data representing the determined significant co-occurrences. - View Dependent Claims (37, 38)
-
-
39. Apparatus for searching document data, the apparatus comprising:
-
a receiver operable to receive query terms supplied by a user;
a co-occurrence determiner operable to identify, for each query term, co-occurrences of words present in or representative of the content of the document data that include the query terms; and
an outputter operable to output parts or portions of the document data containing the identified co-occurrences. - View Dependent Claims (40, 41, 42)
-
-
43. Apparatus for classifying topics in document data, which apparatus comprises:
-
a text splitter operable to split the document data into text segments;
a classifier operable to classify topics in the document data according to the distribution of the topics in the text segments so as to define main and subsidiary topics in the document data; and
an outputter operable to output data representing the classified topics. - View Dependent Claims (44, 45, 46)
-
-
47. Apparatus for selecting sentences for use in a summary, the apparatus comprising:
-
a topic weight assigner operable to assign weights to topics in document data to be summarised;
a sentence weight assigner operable to assign weights to sentences in the document data;
a scorer operable to score each sentence in the document data by summing the assigned weights;
a selector operable to select the sentence or sentences having the highest score;
a topic weight adjuster operable to relatively reduce the weight allocated to topics in the selected sentence or sentences; and
a controller operable to cause the scorer, selector and topic weight adjuster to repeat the above operations until a certain number of sentences has been selected for the summary from the document data. - View Dependent Claims (48)
-
-
49. Apparatus for providing a summary of document data, which apparatus comprises:
-
a receiver operable to receive data representing the topic or topics of the document data;
a locator operable to locate, for words in the or each topic, words in or representative of the content of the document data that co-occur with those words; and
an outputter operable to output summary data in which the or each topic is associated with subsidiary items comprising located co-occurring words. - View Dependent Claims (50, 51)
-
-
52. Apparatus for modifying chunks of sentences selected for a document data summary, which apparatus comprises:
-
a chunk identifier operable to identify chunks that do not contain words in topics representative of the content of the document data;
a chunk modifier operable to modify the identified chunks; and
an outputter operable to output the document data summary with the identified chunks of the selected sentences modified by the chunk modifier. - View Dependent Claims (53, 54, 55, 56, 57, 58, 59, 60, 61)
-
-
62. A method of identifying topics of document data, the method comprising a processor carrying out the steps of:
-
ranking words that are present in or representative of the content of the document data;
ranking co-occurrences of words that are present in or representative of the content of the document data;
ranking phrases in the document data;
selecting the highest ranking words;
identifying which of the highest ranking co-occurrences contain at least one of the highest ranking words;
identifying the phrases containing at least one word from the identified co-occurrences;
selecting the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
outputting data relating to the selected topics. - View Dependent Claims (69, 70, 71)
-
-
63. A method of calculating co-occurrence significances for use in text summarisation apparatus, the method comprising a processor carrying out the steps of:
-
identifying as co-occurrences particular combinations of categories of words present in or representative of the content of document data;
calculating a significance measure for the identified co-occurrences to determine significant ones of the identified co-occurrence; and
outputting data representing the determined significant co-occurrences.
-
-
64. A method of searching document data, the method comprising a processor carrying out the steps of:
-
receiving query terms supplied by a user;
identifying, for each query term, co-occurrences of words present in or representative of the content of the document data that include the query terms; and
outputting parts or portions of the document data containing the identified co-occurrences.
-
-
65. A method of classifying topics in document data, which apparatus comprises a processor carrying out the steps of:
-
splitting the document data into text segments;
classifying topics in the document data according to the distribution of the topics in the text segments so as to define main and subsidiary topics in the document data; and
outputting data representing the classified topics.
-
-
66. A method of for selecting sentences for use in a summary, the method comprising a processor carrying out the steps of:
-
assigning weights to topics in document data to be summarised;
assigning weights to sentences in the document data;
scoring each sentence in the document data by summing the assigned weights;
selecting the sentence or sentences having the highest score;
relatively reducing the weight allocated to topics in the selected sentence or sentences; and
repeating the scoring, selecting and topic weight adjusting steps until a certain number of sentences has been selected for the summary from the document data.
-
-
67. A method of providing a summary of document data, which method comprises a processor carrying out the steps of:
-
receiving data representing the topic or topics of the document data;
locating, for words in the or each topic, words in or representative of the content of the document data that co-occur with those words; and
outputting summary data in which the or each topic is associated with subsidiary items comprising located co-occurring words.
-
-
68. A method of modifying chunks of sentences selected for a document data summary, which method comprises a processor carrying out the steps of:
-
identifying chunks that do not contain words in topics representative of the content of the document data;
modifying the identified chunks; and
outputting the document data summary with the modified identified chunks of the selected sentences.
-
-
72. Apparatus for identifying topics of document data, the apparatus comprising:
-
word ranking means for ranking words that are present in or representative of the content of the document data;
co-occurrence ranking means for ranking co-occurrences of words that are present in or representative of the content of the document data;
phrase ranking means for ranking phrases in the document data;
words selecting means for selecting the highest ranking words;
co-occurrence identifying means for identifying which of the highest ranking co-occurrences contain at least one of the highest ranking words;
phrase identifying means for identifying the phrases containing at least one word from the identified co-occurrences;
phrase selecting means for selecting the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
output means for outputting data relating to the selected topics.
-
Specification