Method and apparatus for computer understanding and manipulation of minimally formatted text documents
First Claim
1. An apparatus for analyzing a text document on a computer having a memory, said text document comprising one or more blocks of contiguous text, said blocks of contiguous text having one or more boundaries, said apparatus comprising:
- digital imaging means for scanning the text document and for converting the text document into a matrix of picture elements representing a digital image of the text document;
spatial analysis means coupled to the digital imaging means for scanning the matrix of picture elements in two dimensions to identify horizontal and vertical boundaries of the blocks of contiguous text in the digital image;
character recognition means coupled to the spatial analysis means for converting the blocks of contiguous text into a text file, said text file comprising word patterns and block identifiers;
a grammar stored in said memory comprising predetermined word patterns and block identifiers;
extractor means coupled to the memory and character recognition means for matching the word patterns and block identifiers contained in the grammar with the word patterns and block identifiers in the text file; and
formatting means coupled to the extractor means for generating output in a predefined pattern using the matched word patterns and block identifiers.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus which enables a computer to understand and manipulate minimally formatted text of such documents as resumes, purchase order forms, insurance forms, bank statements and similar items is disclosed. The documents are digitized by an optical scanner and translated into ASCII text by an optical character reader. The invention manipulates the digital image of the document to find blocks of contiguous text. After separating the text by block, each block is converted into an ASCII character file. Next, these files are processed by a Grammar, which uses pattern matching techniques and syntax rules to enable the host computer to understand the text. After further manipulation by the invention, the text is either stored or outputted in a form which greatly facilitates its use and readability. In this manner documents whose information content is partially location dependent can be understood despite the fact that the documents'"'"' text is written using English language phrases with little or no grammatical structure.
79 Citations
12 Claims
-
1. An apparatus for analyzing a text document on a computer having a memory, said text document comprising one or more blocks of contiguous text, said blocks of contiguous text having one or more boundaries, said apparatus comprising:
-
digital imaging means for scanning the text document and for converting the text document into a matrix of picture elements representing a digital image of the text document; spatial analysis means coupled to the digital imaging means for scanning the matrix of picture elements in two dimensions to identify horizontal and vertical boundaries of the blocks of contiguous text in the digital image; character recognition means coupled to the spatial analysis means for converting the blocks of contiguous text into a text file, said text file comprising word patterns and block identifiers; a grammar stored in said memory comprising predetermined word patterns and block identifiers; extractor means coupled to the memory and character recognition means for matching the word patterns and block identifiers contained in the grammar with the word patterns and block identifiers in the text file; and formatting means coupled to the extractor means for generating output in a predefined pattern using the matched word patterns and block identifiers. - View Dependent Claims (2, 3)
-
-
4. A method executing on a computer to analyze a text document comprising one or more blocks of contiguous text, said blocks of contiguous text having one or more boundaries, the method comprising the steps of:
-
making a digital image of the printed document, said digital image comprising a matrix of picture elements; scanning the digital image in two dimensions to identify horizontal and vertical boundaries of the blocks of contiguous text in the digital image; converting the blocks of contiguous text into a text file, the text file comprising word patterns and block identifiers; storing a grammar in the computer, said grammar comprising predetermined word patterns and block identifiers; matching the word patterns and block identifiers from the text file with the word patterns and block identifiers in the computer data base; and generating output in a predefined format using the matched word patterns and block identifiers. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
-
Specification