Automatic method of generating thematic summaries from a document image without performing character recognition
First Claim
1. A processor implemented method of generating a thematic summary of a document image without performing character recognition using a processor, the document including a first multiplicity of sentences and a second multiplicity of word occurrences, the processor implementing the method by executing instructions stored in electronic form in a memory coupled to the processor, the processor implemented method comprising the steps of:
- a) analyzing the document image to identify sentence boundaries;
b) analyzing the document image to identify a plurality of word image equivalence classes, each word image equivalence class including at least one word occurrence of the second multiplicity of word occurrences;
c) selecting as thematic word images a first number of word image equivalence classes, the first number being less than a second number of thematic sentences to be extracted;
d) scoring each sentence of the first multiplicity of sentences based upon occurrence of thematic word images in each sentence; and
e) selecting the second number of thematic sentences from the first multiplicity of sentences based upon the score of each sentence.
5 Assignments
0 Petitions
Accused Products
Abstract
A method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text. The method begins with decomposition of the document image into text blocks, and text lines. Using the median x-height of text blocks the main body of text is identified. Afterward, word image equivalence classes and sentence boundaries within the blocks of the main body of text are determined. The word image equivalence classes are used to identify thematic words. These, in turn are used to score the sentences within the main body of text, and the highest scoring sentences are selected for extraction.
-
Citations
12 Claims
-
1. A processor implemented method of generating a thematic summary of a document image without performing character recognition using a processor, the document including a first multiplicity of sentences and a second multiplicity of word occurrences, the processor implementing the method by executing instructions stored in electronic form in a memory coupled to the processor, the processor implemented method comprising the steps of:
-
a) analyzing the document image to identify sentence boundaries; b) analyzing the document image to identify a plurality of word image equivalence classes, each word image equivalence class including at least one word occurrence of the second multiplicity of word occurrences; c) selecting as thematic word images a first number of word image equivalence classes, the first number being less than a second number of thematic sentences to be extracted; d) scoring each sentence of the first multiplicity of sentences based upon occurrence of thematic word images in each sentence; and e) selecting the second number of thematic sentences from the first multiplicity of sentences based upon the score of each sentence. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A processor implemented method of generating a thematic summary of a document image without performing character recognition using a processor, the document including a first multiplicity of sentences and a second multiplicity of word occurrences, the processor implementing the method by executing instructions stored electronically in a memory coupled to the processor, the processor implemented method comprising the steps of:
-
a) analyzing the document image to identify sentence boundaries of the first multiplicity of sentences; b) analyzing the document image to identify a plurality of word image equivalence classes, each word image equivalence class including at least one word occurrence of the second multiplicity of word occurrences, each word image equivalence class having a bounding box, each bounding box having a width; c) determining a number of times each word image equivalence class occurs in the document; d) identifying drop words within the word image equivalence classes based upon their bounding box width and the number of times each word image equivalence class occurs in the document; e) generating a list of content words by eliminating the drop words from the word image equivalence classes; f) selecting as thematic word images a first number of the word equivalence classes from the list of content words based upon the number of times each content word occurs in the document; g) scoring each sentence of the first multiplicity of sentences that includes a thematic word image based upon occurrences of thematic word images within each sentence; and h) selecting as thematic sentences a second number of the first multiplicity of sentences based upon the sentence scores. - View Dependent Claims (7, 8, 9, 10)
-
-
11. An article of manufacture comprising:
-
a) a memory; and b) instructions stored in the memory for a method of generating a thematic summary of a document image without performing character recognition using a processor coupled to the memory, the document including a first multiplicity of sentences and a second multiplicity of word occurrences, the method comprising the steps of; 1) analyzing the document image to identify sentence boundaries; 2) analyzing the document image to identify word image equivalence classes, each word image equivalence class including at least one word occurrence of the second multiplicity of word occurrences; 3) selecting as thematic word images a first number of word image equivalence classes, the first number of thematic word images being less than a second number of thematic sentences to be extracted; 4) scoring each sentence of the first multiplicity of sentences based upon occurrence of thematic word images in each sentence; and 5) selecting the second number of thematic sentences from the first multiplicity of sentences based upon the score of each sentence.
-
-
12. A system for generating a thematic summary from a document image without performing character recognition, the system comprising:
-
a) a scanner for generating an electronic representation of the document image; b) a processor coupled to the scanner; c) a memory coupled to the processor for storing instructions for generating a thematic summary from the electronic representation of the document image, the instructions representing the steps of; 1) analyzing the document image to identify sentence boundaries; 2) analyzing the document image to identify a plurality of word image equivalence classes, each word image equivalence class including at least one word occurrence of the second multiplicity of word occurrences within the document; 3) selecting as thematic word images a first number of word image equivalence classes, the first number being less than a second number of thematic sentences to be extracted; 4) scoring each sentence of the first multiplicity of sentences based upon occurrence of thematic word images in each sentence; and 5) selecting the second number of thematic sentences from the first multiplicity of sentences based upon the score of each sentence.
-
Specification