Method and apparatus for summarizing documents according to theme
First Claim
1. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing text represented by characters, said method comprising the steps of:
- a) using the computer, automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said stop list stored in the memory of said computer;
b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring expressions determined in step (a), said seed list stored in said memory of said computer;
c) using said computer, automatically forming a summary of the document comprised of regions in the document containing at least two members of said seed list, said summary stored in said memory of said computer; and
d) using said computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the members of said seed list to said stop list and reducing said first predetermined level of complexity.
4 Assignments
0 Petitions
Accused Products
Abstract
A summary is automatically formed by selecting regions of a document. Each selected region includes at least two members of a seed list. The seed list is formed from a predetermined number of the most frequently occurring complex expressions in the document that are not on a stop list. If the summary is too long, the region-selection process is performed on the summary to produce a shorter summary. This region-selection process is repeated until a summary is produced having a desired length. Each time the region selection process is repeated, the seed list members are added to the stop list and the complexity level used to identify frequently occurring expressions is reduced.
-
Citations
55 Claims
-
1. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing text represented by characters, said method comprising the steps of:
-
a) using the computer, automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said stop list stored in the memory of said computer; b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring expressions determined in step (a), said seed list stored in said memory of said computer; c) using said computer, automatically forming a summary of the document comprised of regions in the document containing at least two members of said seed list, said summary stored in said memory of said computer; and d) using said computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the members of said seed list to said stop list and reducing said first predetermined level of complexity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. An automated, computer implemented method of electronically processing a document stored in a memory of a computer, said document containing words represented by roman characters, said method comprising the steps of:
-
a) using the computer, automatically determining a frequency of occurrence of words in the document having at least a first predetermined number of characters and not contained in a stop list that is stored in the memory of the computer; b) using the computer, automatically forming a seed list comprised of a second predetermined number of the most frequently occurring words in the document having at least said first predetermined number of characters, said seed list stored in the memory of the computer; c) using the computer, automatically forming a summary of the document comprised of regions in the document containing at least two words in said seed list, said summary stored in the memory of the computer; and d) using the computer, automatically repeating steps (a)-(c) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(c) are repeated, adding the words on said seed list to said stop list and reducing a value of said first predetermined number. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40)
-
-
41. An automated, computer implemented method of electronically processing a Japanese language document stored in a memory of a computer, said document containing text represented by hiragana characters, katakana characters, roman characters, and kanji characters, said method comprising the steps of:
-
a) using the computer, automatically determining a frequency of occurrence of character strings in the document not contained in a stop list, containing at least one of said hiragana, katakana, and roman characters, and having at least a first predetermined number of characters, said stop list stored in the memory of the computer; b) using the computer, automatically determining a frequency of occurrence of kanji characters and repeated kanji character strings in the document not contained in said stop list, and containing at least a second predetermined number of strokes; c) using the computer, automatically forming a seed list stored in the memory of the computer, said seed list comprised of a third predetermined number of the most frequently occurring; character strings in the document having at least said first predetermined number of characters, and kanji characters and repeated kanji character strings in the document having at least said second predetermined number of strokes; d) using the computer, automatically forming a summary of the document comprised of all sentences in the document containing at least two members of said seed list and surrounding sentences, said summary stored in the memory of the computer; and e) using the computer, automatically repeating steps (a)-(d) on said summary until a length of said summary is no greater than a predetermined length, each time steps (a)-(d) are repeated, adding the members on said seed list to said stop list and reducing values of said first and second predetermined numbers.
-
-
42. An automated, computer implemented method of electronically processing a Chinese-Language document stored in a memory of a computer, said document containing text represented by Chinese characters, said method comprising the steps of:
-
a) using the computer, automatically determining a frequency of occurrence of character strings in the document not contained in a stop list, having at least a first predetermined number of characters, and containing at least a second predetermined number of strokes, said stop list stored in the memory of the computer; b) using the computer, automatically determining a frequency of occurrence of characters in the document not contained in said stop list and containing at least said second predetermined number of strokes; c) using the computer, automatically forming a seed list comprised of a third predetermined number of the most frequently occurring characters and character strings determined in steps (a) and (b), said seed list stored in the memory of the computer; d) using the computer, automatically forming a summary of the document comprised of all sentences in the document containing at least two members of said seed list and surrounding sentences, said summary stored in the memory of the computer; and e) using the computer, automatically repeating steps (a)-(d) on said summary until a length of said summary is not greater than a predetermined length, each time steps (a)-(d) are repeated, adding the members on said seed list to said stop list and reducing values of said first and second predetermined numbers.
-
-
43. A computer apparatus for automatically processing a document containing text represented by characters, comprising:
-
input means for automatically inputting a document into a memory of the computer apparatus to produce document data; a processor coupled to said memory and including; frequency determining means, receiving said document data from said memory, for automatically determining a frequency of occurrence of expressions in the document not contained in a stop list and having at least a first predetermined level of complexity, said frequency determining means outputting expression frequency data, said stop list stored in said memory; seed list defining means, receiving said expression frequency data from said frequency determining means, for automatically defining a seed list comprised of a second predetermined number of the most frequently occurring expressions in the document, said seed list stored in said memory; identifying means for automatically identifying portions of the document data containing at least two members of said seed list; summarizing means for automatically forming a summary by combining said portions identified by said identifying means, said summary stored in said memory; length determining means for automatically determining whether a length of said summary is not greater than a predetermined length; and control means for automatically outputting said summary as a document summary when said length determining means determines said summary to be not greater than said predetermined length, otherwise automatically inputting said summary as a document to said input means, adding members of said seed list to said stop list, and reducing said first predetermined level of complexity so that said document data is iteratively processed until a document summary is produced having a length not greater than said predetermined length. - View Dependent Claims (44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
-
Specification