Method and apparatus for summarizing documents according to theme

US 5,384,703 A
Filed: 07/02/1993
Issued: 01/24/1995
Est. Priority Date: 07/02/1993
Status: Expired due to Term

First Claim

Patent Images

1. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing text represented by characters, said method comprising the steps of:

a) using the computer, automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said stop list stored in the memory of said computer;

b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring expressions determined in step (a), said seed list stored in said memory of said computer;

c) using said computer, automatically forming a summary of the document comprised of regions in the document containing at least two members of said seed list, said summary stored in said memory of said computer; and

d) using said computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the members of said seed list to said stop list and reducing said first predetermined level of complexity.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A summary is automatically formed by selecting regions of a document. Each selected region includes at least two members of a seed list. The seed list is formed from a predetermined number of the most frequently occurring complex expressions in the document that are not on a stop list. If the summary is too long, the region-selection process is performed on the summary to produce a shorter summary. This region-selection process is repeated until a summary is produced having a desired length. Each time the region selection process is repeated, the seed list members are added to the stop list and the complexity level used to identify frequently occurring expressions is reduced.

Citations

55 Claims

1. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing text represented by characters, said method comprising the steps of:
- a) using the computer, automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said stop list stored in the memory of said computer;
  
  b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring expressions determined in step (a), said seed list stored in said memory of said computer;
  
  c) using said computer, automatically forming a summary of the document comprised of regions in the document containing at least two members of said seed list, said summary stored in said memory of said computer; and
  
  d) using said computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the members of said seed list to said stop list and reducing said first predetermined level of complexity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 2. The method of claim 1, wherein each expression is represented in the document as at least one character.
  - 3. The method of claim 2, wherein at least some of said expressions are represented as character strings.
  - 4. The method of claim 3, wherein the level of complexity of each character string is determined by said computer based on a length of said character string.
  - 5. The method of claim 1, wherein the level of complexity of each expression is determined by said computer based on a length of said expression.
  - 6. The method of claim 1, wherein the level of complexity of each expression is determined by said computer based on a number of strokes contained in said expression.
  - 7. The method of claim 1, wherein each expression is a word, so that in step (a) said computer determines a frequency of occurrence of words in the document.
  - 8. The method of claim 7, wherein the complexity of each word is determined by said computer based on a length of said word.
  - 9. The method of claim 7, further comprising, using said computer, performing a word-stemming operation on words in said document prior to performing step (a).
  - 10. The method of claim 1, wherein each of said regions comprises a portion of the document containing at least two of said members of said seed list.
  - 11. The method of claim 10, wherein said portion is a sentence, so that each of said regions includes a sentence containing at least two of said members of said seed list.
  - 12. The method of claim 1, wherein each of said regions comprises a sentence containing at least two of said members of said seed list and immediately preceding and following sentences located in a same paragraph as said sentence containing at least two of said members of said seed list.
  - 13. The method of claim 1, further comprising including in a final summary of said document, any regions located in a first paragraph of said document, and contained in a summary of the document a first time step (c) is performed by said computer.
  - 14. The method of claim 1, further comprising including in a final summary of said document, any regions located in a last paragraph of said document, and contained in a summary of the document a first time step (c) is performed by said computer.
  - 15. The method of claim 1, further comprising including in a final summary of said document, any regions located in a first and a last paragraph of said document, and contained in a summary of the document a first time step (c) is performed by said computer.
  - 16. The method of claim 1, wherein said stop list is empty before step (a) is performed a first time.
  - 17. The method of claim 1, wherein said stop list includes a plurality of predefined stop expressions before step (a) is performed a first time.
  - 18. The method of claim 17, wherein said predefined stop expressions include conjunctions, articles and modals.
  - 19. The method of claim 1, wherein said summary is formed using said computer by extracting said regions from said document and constructing said summary from said extracted regions.
  - 20. The method of claim 19, wherein said extracted regions are maintained in a same order in said summary as in the document.
  - 21. The method of claim 1, wherein said second predetermined number is at least six.
  - 22. The method of claim 1, wherein said predetermined length of said summary is no greater than one page.
  - 23. The method of claim 1, further comprising using said computer, automatically outputting said summary when a length of said summary is no greater than said predetermined length.
  - 24. The method of claim 23, wherein said summary is output on a display screen.
  - 25. The method of claim 23, wherein said summary is output by a printer.
  - 26. The method of claim 1, wherein said document includes roman characters, and step (a) comprises, using said computer, determining a frequency of occurrence of words defined by said roman characters in the document not contained in said stop list and having at least a predetermined number of said roman characters.
  - 27. The method of claim 26, wherein said document is an English-language document.
  - 28. The method of claim 1, wherein said document is a Japanese language document containing text represented by hiragana characters, katakana characters, roman characters, and kanji characters, and wherein step (a) includes:
    - using said computer, determining a frequency of occurrence of character strings in the document not contained in said stop list, containing at least one of said hiragana, katakana, and roman characters, and having at least a predetermined number of characters.
  - 29. The method of claim 1, wherein said document is a Japanese language document containing text represented by hiragana characters, katakana characters, roman characters, and kanji characters, and wherein step (a) includes:
    - using said computer, determining a frequency of occurrence of kanji characters and repeated kanji character strings in the document not contained in said stop list, and containing at least a predetermined number of strokes.
  - 30. The method of claim 1, wherein said document is a Chinese-language document containing text represented by chinese characters, and wherein:
    - step (a) includes using said computer, determining;
      
      a frequency of occurrence of chinese characters in the document not contained in said stop list and having a predetermined number of strokes, anda frequency of chinese character strings in the document not contained in said stop list, having at least a third predetermined number of chinese characters, and having said predetermined number of strokes; and
      
      step (b) includes forming said seed list from said second predetermined number of the most frequently occurring characters and character strings determined in step (a).

31. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing words represented by roman characters, said method comprising the steps of:
- a) using the computer, automatically determining a frequency of occurrence of words in the document having at least a first predetermined number of characters and not contained in a stop list that is stored in the memory of the computer;
  
  b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring words in the document having at least said first predetermined number of characters, said seed list stored in the memory of the computer;
  
  c) using the computer, automatically forming a summary of the document comprised of regions in the document containing at least two words in said seed list, said summary stored in the memory of the computer; and
  
  d) using the computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the words on said seed list to said stop list and reducing a value of said first predetermined number.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 32. The method of claim 31, wherein said document is an English-language document.
  - 33. The method of claim 31, further comprising using said computer, performing a word-stemming operation on words in said document prior to performing step (a).
  - 34. The method of claim 31, wherein said regions include at least a sentence containing said at least two words in said seed list.
  - 35. The method of claim 31, wherein each of said regions comprises a sentence containing said at least two words in said seed list and immediately preceding and following sentences located in a same paragraph as said sentence containing said at least two words in said seed list.
  - 36. The method of claim 31, further comprising including in a final summary of said document, any regions located in a first paragraph of said document, and contained in a summary of the document a first time step (c) is performed by the computer.
  - 37. The method of claim 31, further comprising including in a final summary of said document, any regions located in a last paragraph of said document, and contained in a summary of the document a first time step (c) is performed by the computer.
  - 38. The method of claim 31, wherein said stop list is empty before step (a) is performed a first time.
  - 39. The method of claim 31, wherein said stop list includes a plurality of predefined stop words before step (a) is performed a first time.
  - 40. The method of claim 31, further comprising using said computer, automatically outputting said summary when a length of said summary is no greater than said predetermined length.

41. An automated, computer implemented method of electronically processing a Japanese language document stored in a memory of a computer, said document containing text represented by hiragana characters, katakana characters, roman characters, and kanji characters, said method comprising the steps of:
- a) using the computer, automatically determining a frequency of occurrence of character strings in the document not contained in a stop list, containing at least one of said hiragana, katakana, and roman characters, and having at least a first predetermined number of characters, said stop list stored in the memory of the computer;
  
  b) using the computer, automatically determining a frequency of occurrence of kanji characters and repeated kanji character strings in the document not contained in said stop list, and containing at least a second predetermined number of strokes;
  
  c) using the computer, automatically forming a seed list stored in the memory of the computer, said seed list comprised of a third predetermined number of the most frequently occurring;
  
  character strings in the document having at least said first predetermined number of characters, andkanji characters and repeated kanji character strings in the document having at least said second predetermined number of strokes;
  
  d) using the computer, automatically forming a summary of the document comprised of all sentences in the document containing at least two members of said seed list and surrounding sentences, said summary stored in the memory of the computer; and
  
  e) using the computer, automatically repeating steps (a)-(d) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(d) are repeated, adding the members on said seed list to said stop list and reducing values of said first and second predetermined numbers.

42. An automated, computer implemented method of electronically processing a Chinese-Language document stored in a memory of a computer, said document containing text represented by Chinese characters, said method comprising the steps of:
- a) using the computer, automatically determining a frequency of occurrence of character strings in the document not contained in a stop list, having at least a first predetermined number of characters, and containing at least a second predetermined number of strokes, said stop list stored in the memory of the computer;
  
  b) using the computer, automatically determining a frequency of occurrence of characters in the document not contained in said stop list and containing at least said second predetermined number of strokes;
  
  c) using the computer, automatically forming a seed list comprised of a third predetermined number of the most frequently occurring characters and character strings determined in steps (a) and (b), said seed list stored in the memory of the computer;
  
  d) using the computer, automatically forming a summary of the document comprised of all sentences in the document containing at least two members of said seed list and surrounding sentences, said summary stored in the memory of the computer; and
  
  e) using the computer, automatically repeating steps (a)-(d) on said summary until a length of said summary is not greater than a predetermined length, each time steps (a)-(d) are repeated, adding the members on said seed list to said stop list and reducing values of said first and second predetermined numbers.

43. A computer apparatus for automatically processing a document containing text represented by characters, comprising:
- input means for automatically inputting a document into a memory of the computer apparatus to produce document data;
  
  a processor coupled to said memory and including;
  
  frequency determining means, receiving said document data from said memory, for automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said frequency determining means outputting expression frequency data, said stop list stored in said memory;
  
  seed list defining means, receiving said expression frequency data from said frequency determining means, for automatically defining a seed list comprised of a second predetermined number of the most frequently occurring expressions in the document, said seed list stored in said memory;
  
  identifying means for automatically identifying portions of the document data containing at least two members of said seed list;
  
  summarizing means for automatically forming a summary by combining said portions identified by said identifying means, said summary stored in said memory;
  
  length determining means for automatically determining whether a length of said summary is not greater than a predetermined length; and
  
  control means for automatically outputting said summary as a document summary when said length determining means determines said summary to be not greater than said predetermined length, otherwise automatically inputting said summary as a document to said input means, adding members of said seed list to said stop list, and reducing said first predetermined level of complexity so that said document data is iteratively processed until a document summary is produced having a length not greater than said predetermined length.
- View Dependent Claims (44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
- - 44. The apparatus of claim 43, wherein said input means produces said document data as decoded text data.
  - 45. The apparatus of claim 43, wherein said expressions are represented in the document as at least one character, and said frequency determining means automatically determines whether said expressions have said first predetermined level of complexity by determining a number of strokes in said expressions.
  - 46. The apparatus of claim 43, wherein said expressions are represented in the document as character strings, and said frequency determining means automatically determines whether said expressions have said first predetermined level of complexity by determining a number of characters in said expressions.
  - 47. The apparatus of claim 43, wherein said expressions are represented in the document as character strings, and said frequency determining means automatically determines whether said expressions have said first predetermined level of complexity by determining a number of strokes in said expressions.
  - 48. The apparatus of claim 43, wherein said expressions are represented in the document as words, and said frequency determining means automatically determines whether said expressions have said first predetermined level of complexity by determining a length of said words.
  - 49. The apparatus of claim 48, wherein said frequency determining means automatically determines the length of said words by determining a number of characters in said words.
  - 50. The apparatus of claim 43, wherein said identifying means automatically identifies sentences containing at least two of said members of said seed list.
  - 51. The apparatus of claim 50, wherein each of said portions includes a sentence containing at least two of said members of said seed list and immediately preceding and following sentences located in a same paragraph as said sentence containing at least two of said members of said seed list.
  - 52. The apparatus of claim 43, wherein said processor further includes means for automatically predefining members of said stop list before a document is input to said input means.
  - 53. The apparatus of claim 43, wherein said summarizing means forms said summary by extracting said portions from said document and constructing said summary from said extracted portions while maintaining said extracted portions in a same order in said summary as in the document.
  - 54. The apparatus of claim 43, further comprising a display screen for displaying said output document summary.
  - 55. The apparatus of claim 43, further comprising a printer for printing said output document summary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Withgott, M. Margaret, Cutting, Douglass R.
Primary Examiner(s)
Hayes, Gail O.
Assistant Examiner(s)
SHINGALA, GITA

Application Number

US08/085,385
Time in Patent Office

571 Days
Field of Search

364/419.1, 364/419.13, 364/419.19, 364/419.02, 364/419.08
US Class Current

715/236
CPC Class Codes

G06F 16/345 Summarisation for human users

Method and apparatus for summarizing documents according to theme

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

55 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for summarizing documents according to theme

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

55 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links