Apparatus for and method of summarising text

US 7,263,530 B2
Filed: 03/11/2004
Issued: 08/28/2007
Est. Priority Date: 03/12/2003
Status: Expired due to Fees

First Claim

Patent Images

1. An apparatus including a processor to identify topics of document data, the apparatus comprising:

a word ranker configured to rank words that are present in or representative of content of the document data;

a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;

a phrase ranker configured to rank phrases in the document data;

a words selector configured to select the words with a highest ranking;

a co-occurrence identifier configured to identify which of the co-occurrences with a highest ranking contain at least one of the highest ranking words;

a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun;

a phrase selector configured to select one or ones of the identified phrases with a highest ranking as a topic or topics of the document data; and

an outputter configured to output data relating to the selected topics.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Apparatus for identifying topics of document data has:

- a word ranker (171) for ranking words that are present in or representative of the content of the document data;
- a co-occurrence ranker (172) for ranking co-occurrences of words that are present in or representative of the content of the document data;
- a phrase ranker (170) for ranking phrases in the document data;
- a word selector (174) for selecting the highest ranking words;
- a co-occurrence identifier (176) for identifying which of the highest ranking co-occurrences contain at least one of the highest ranking words;
- a phrase identifier (177) for identifying the phrases containing at least one word from the identified co-occurrences;
- a phrase selector (178) for selecting the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
- an output device (40) for outputting data relating to the selected topics.

111 Citations

View as Search Results

36 Claims

1. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of content of the document data;
  
  a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
  
  a phrase ranker configured to rank phrases in the document data;
  
  a words selector configured to select the words with a highest ranking;
  
  a co-occurrence identifier configured to identify which of the co-occurrences with a highest ranking contain at least one of the highest ranking words;
  
  a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun;
  
  a phrase selector configured to select one or ones of the identified phrases with a highest ranking as a topic or topics of the document data; and
  
  an outputter configured to output data relating to the selected topics.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The apparatus of claim 1, wherein the words selector is configured to select as the highest ranking words a predetermined number of the highest ranking words, a number of the highest ranking words that represents a predetermined percentage of the words in the document data, or a number of the highest ranking words that represents a predetermined percentage of the number of ranked words.
  - 3. The apparatus of claim 1, wherein the co-occurrence identifier is configured to select as the highest ranking co-occurrences a predetermined number of co-occurrences, a number of the highest ranking co-occurrences that represents a predetermined percentage of the co-occurrences in the document data, or a number of the highest ranking co-occurrences that represents a predetermined percentage of the number of ranked co-occurrences.
  - 4. The apparatus of claim 1, wherein the phrase selector is configured to select as the highest ranking identified phrases a predetermined number of the identified phrases, a number of the highest ranking identified phrases that represents a predetermined percentage of the identified phrases in the document data, or a number of the highest ranking identified phrases that represents a predetermined percentage of the number of ranked phrases.
  - 5. The apparatus of claim 1, wherein at least one of the word ranker, co-occurrence ranker, and phrase ranker is configured to weight the items to be ranked in accordance with their position in the document data.
  - 6. The apparatus of claim 1, further comprising a co-occurrence determiner configured to determine word co-occurrences in the document data by identifying, as co-occurrences, word combinations comprising words in particular grammatical categories.
  - 7. The apparatus of claim 6, wherein the co-occurrence determiner is configured to ignore the order of the words in the word combinations.
  - 8. The apparatus of claim 1, further comprising a co-occurrence determiner configured to determine word co-occurrences in the document data by identifying as co-occurrences at least some of the following combinations:
    - noun and verb;
      
      noun and noun;
      
      noun and proper noun;
      
      verb and proper noun;
      
      and proper noun and proper noun.
  - 9. The apparatus of claim 1, wherein the co-occurrence ranker is configured to rank significant co-occurrences and the apparatus further comprises a co-occurrence determiner configured to determine word co-occurrences in the document data by identifying as co-occurrences word combinations comprising words in particular grammatical categories and a significance calculator configured to calculate a significance measure for the identified co-occurrences.
  - 10. The apparatus of claim 1, wherein the co-occurrence ranker is configured to rank significant co-occurrences and the apparatus further comprises a co-occurrence determiner configured to determine word co-occurrences in the document data by identifying as co-occurrences at least some of the following combinations:
    - noun and verb;
      
      noun and noun;
      
      noun and proper noun;
      
      verb and proper noun; and
      
      proper noun and proper noun, and a significance calculator configured to calculate a significance measure for the identified co-occurrences.
  - 11. The apparatus of claim 1, further comprising:
    - a text splitter configured to split the document data into text segments; and
      
      a classifier configured to classify the selected topics of a distribution in the text segments, to define main and subsidiary topics in the document data, wherein the outputter is configured to output data relating to the classified topics.
  - 12. The apparatus of claim 1, further comprising a summary provider configured to provide summary data on the basis of the selected topics, wherein the outputter is configured to output the summary data.
  - 13. The apparatus of claim 12, wherein the summary provider comprises a sentence selector configured to select sentences for use in the summary data.
  - 14. The apparatus of claim 13, wherein the sentence selector comprises:
    - a topic weight assigner configured to assign weights to the topics;
      
      a sentence weight assigner configured to assign weights to sentences in the document data;
      
      a scorer configured to score the sentences by summing the assigned topic and sentence weights;
      
      a selector configured to select the sentence or sentences having the highest score or scores;
      
      a topic weight adjuster configured to relatively reduce the weight allocated to the topic or topics in the selected sentence or sentences; and
      
      a controller configured to cause the scorer, selector and topic weight adjuster to repeat the above operations until a predetermined number of sentences has been selected for the summary from the document data.
  - 15. The apparatus of claim 13, further comprising:
    - a chunk identifier configured to identify in sentences selected for a summary chunks that do not contain words in the selected topics; and
      
      a chunk modifier configured to modify the identified chunks.
  - 16. The apparatus of claim 15, wherein the chunk modifier is configured to modify chunks by causing them to be displayed, to place less emphasis on the modified chunks.
  - 17. The apparatus of claim 1, further comprising a concept identifier configured to identify from the document data concepts that determine words representative of the content of the document data.

18. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of the content of the document data;
  
  a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
  
  a phrase ranker configured to rank phrases in the document data;
  
  a words selector configured to select the highest ranking words;
  
  a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
  
  a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences;
  
  a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data;
  
  an outputter configured to output data relating to the selected topics;
  
  a text splitter configured to split the document data into text segments;
  
  a classifier configured to classify the selected topics of the distribution in the text segments which define main and subsidiary topics in the document data, wherein the outputter is configured to output data relating to the classified topics; and
  
  a topic hierarchy identifier configured to identify a topic as being a child or subsidiary topic of another topic when text portions in which that subsidiary topic occurs represent a sub-set of the text portions in which the said other topic occurs, wherein the outputter is configured to output data relating to the identified topic hierarchy.
- View Dependent Claims (19, 20, 21)
- - 19. The apparatus of claim 18, wherein the classifier is configured to determine that a topic is a main topic when the topic occurs in a predetermined percentage of the text segments and to classify any topic not meeting this requirement as a subsidiary or lesser topic.
  - 20. The apparatus of claim 18, wherein the classifier is configured to weight a topic in accordance with a position in the document data of the text segment containing the topic.
  - 21. The apparatus of claim 18, wherein the classifier is configured to weight a topic in accordance with a position in the document data of the text segments containing the topic, wherein a topic occurring in at least one of a first and last text segment of document data representing a document is given a higher weighting than topics occurring in the other text segments.

22. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of the content of the document data;
  
  a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
  
  a phrase ranker configured to rank phrases in the document data;
  
  a words selector configured to select the highest ranking words;
  
  a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
  
  a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences;
  
  a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data;
  
  an outputter configured to output data relating to the selected topics;
  
  a text splitter configured to split the document data into text segments;
  
  a classifier configured to classify the selected topics of the distribution in the text segments which define main and subsidiary topics in the document data, wherein the outputter is configured to output data relating to the classified topics; and
  
  a topic hierarchy identifier configured to identify a topic as being a child or subsidiary topic of another topic when the text segments in which that subsidiary topic occurs represent a sub-set of the text segments in which the said other topic occurs, wherein the outputter is configured to output data relating to the identified topic hierarchy.

23. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of content of the document data;
  
  a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
  
  a phrase ranker configured to rank phrases in the document data;
  
  a words selector configured to select the highest ranking words;
  
  a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
  
  a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences;
  
  a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data; and
  
  a summary provider configured to provide summary data on the basis of the selected topics, wherein the summary provider comprises a sentence selector configured to select sentences to use in the summary data;
  
  wherein the sentence selector comprises;
  
  a topic weight assigner configured to assign weights to the topics;
  
  a sentence weight assigner configured to assign weights to sentences in the document data;
  
  a scorer configured to score the sentences by summing the assigned topic and sentence weights;
  
  a selector configured to select the sentence or sentences having the highest score or scores;
  
  a topic weight adjuster configured to relatively reduce the weight allocated to the topic or topics in the selected sentence or sentences, wherein the topic weight adjuster is configured to set to zero the weight of any topic in the selected sentence or sentences;
  
  a controller configured to cause the scorer, selector and topic weight adjuster to a controller configured to cause the scorer, selector and topic weight adjuster to repeat the above operations until a predetermined number of sentences has been selected for the summary from the document data; and
  
  an outputter configured to output the summary data.
- View Dependent Claims (24, 25, 26, 27, 28, 29)
- - 24. The apparatus of claim 23, wherein the sentence selector comprises:
    - a topic weight assigner configured to assign weights to the topics;
      
      a sentence weight assigner configured to assign weights to sentences in the document data;
      
      a scorer configured to score the sentences by summing the assigned topic and sentence weights; and
      
      a selector configured to select the sentence or sentences having the highest score or scores for the summary.
  - 25. The apparatus of claim 23, wherein the summary provider comprises a locater configured to locate words present in or representative of the content of the document data that co-occur with words in the topics;
    - andthe outputter is configured to output summary data in which the or each topic is associated with subsidiary items comprising located co-occurring words.
  - 26. The apparatus of claim 23, wherein the summary provider further comprises a further locater configured to locate all words present in or representative of the content of the document data that co-occur with the subsidiary items and the outputter is configured to associate each such co-occurring word with the corresponding subsidiary item in the summary data.
  - 27. The apparatus of claim 23, wherein the summary provider further comprises a filter configured to filter the co-occurring words to select for the summary data those co-occurring words that themselves have co-occurrences with the subsidiary items.
  - 28. The apparatus of claim 23, wherein the concept identifier is configured to identify as concepts at least one of synonyms, hypernyms and hyponyms in or relating to the document data.
  - 29. The apparatus of claim 23, wherein the concept identifier is configured to access a lexical database to identify as concepts at least one of synonyms, hypernyms and hyponyms in or relating to the document data.

30. An apparatus including a processor to identify topics of document data, the apparatus comprising:
- a word ranker configured to rank words that are present in or representative of the content of the document data;
  
  a co-occurrence ranker configured to rank co-occurrences of words that are present in or representative of the content of the document data;
  
  a phrase ranker configured to rank phrases in the document data;
  
  a words selector configured to select the highest ranking words;
  
  a co-occurrence identifier configured to identify which of the highest ranking co-occurrences contain at least one of the highest ranking words;
  
  a phrase identifier configured to identify the phrases containing at least one word from the identified co-occurrences;
  
  a phrase selector configured to select the highest ranking one or ones of the identified phrases as the topic or topics of the document data;
  
  a summary provider configured to provide summary data on the basis of the selected topics, wherein the summary provider comprises a sentence selector configured to select sentences to use in the summary data;
  
  a chunk identifier configured to identify in sentences selected for a summary chunks that do not contain words in the selected topics; and
  
  a chunk modifier configured to modify the identified chunks wherein the chunk modifier is configured to modify chunks by causing them to be displayed which place less emphasis on the modified chunks; and
  
  an outputter configured to output the summary data,wherein the outputter is configured to output the summary data and the chunk modifier is configured to modify chunks to cause, when the outputter provides output data for display by a display, the modified chunks to be displayed using at least one of a smaller font size, a different font, a different font characteristic and a different font colour from the other chunks.
- View Dependent Claims (31, 32, 33, 34)
- - 31. The apparatus of claim 30, wherein the chunk modifier is configured to modify chunks by replacing them by ellipsis.
  - 32. The apparatus of claim 30, wherein the chunk modifier is configured to remove the identified chunks.
  - 33. The apparatus of claim 30, further comprising a processor configured to carry out syntactic or semantic processing on sentences from which chunks have been removed to maintain sentence coherence or cohesion.
  - 34. The apparatus of claim 30, wherein the chunk identifier is configured to identify chunks by using punctuation marks to define bounds of the chunks.

35. A method to identify topics of document data, the method comprising the steps of:
- ranking words that are present in or representative of content of the document data;
  
  ranking co-occurrences of words that are present in or representative of the content of the document data;
  
  ranking phrases in the document data;
  
  selecting the words with a highest ranking;
  
  identifying which of the co-occurrences with a highest ranking contain at least one of the highest ranking words;
  
  identifying the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun;
  
  selecting one or ones of the identified phrases with a highest ranking as the topic or topics of the document data; and
  
  outputting data relating to the selected topics.

36. A computer-executable program stored on a computer-readable storage medium, the program when executed by a computer, performing the steps of:
- ranking words that are present in or representative of content of the document data;
  
  ranking co-occurrences of words that are present in or representative of the content of the document data;
  
  ranking phrases in the document data;
  
  selecting the words with a highest ranking;
  
  identifying which of the co-occurrences with a highest ranking contain at least one of the highest ranking words;
  
  identifying the phrases containing at least one word from the identified co-occurrences by concatenating consecutive nouns, concatenating consecutive proper nouns, and concatenating consecutive adjectives with a final noun;
  
  selecting one or ones of the identified phrases with a highest ranking as the topic or topics of the document data; and
  
  outputting data relating to the selected topics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Canon Kabushiki Kaisha (Canon Inc.)
Original Assignee
Canon Kabushiki Kaisha (Canon Inc.)
Inventors
Hu, Jiawei, Imlah, William George
Primary Examiner(s)
Gaffin; Jeffrey
Assistant Examiner(s)
Ponikiewski; Tomasz

Application Number

US10/797,107
Publication Number

US 20040225667A1
Time in Patent Office

1,265 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/345   Summarisation for human users

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99943   Generating database or data...

Apparatus for and method of summarising text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

111 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus for and method of summarising text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

111 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links