METHODS AND SYSTEMS THAT USE HIERARCHICALLY ORGANIZED DATA STRUCTURE CONTAINING STANDARD FEATURE SYMBOLS IN ORDER TO CONVERT DOCUMENT IMAGES TO ELECTRONIC DOCUMENTS
First Claim
1. A system that transforms a document image into an electronic document, the system comprising:
- one or more processors;
one or more electronic memories; and
a hierarchically organized data structure, stored in one or more of the one or more electronic memories, each entry of which corresponds to one or more natural-language entities selected from among one or more morphemes, words, or phrases encoded as sequences of standard feature symbols; and
computer instructions, digitally encoded and stored in one or more of the one or more electronic memories and executed on the one or more processors, thatreceive an image of a block of text of an Arabic-like language,identify images of lines of text within the received image of the block of text;
identify subimages within the image of the line of text corresponding to one or more of words and morphemes,for each identified subimage,identify sets of characters that represent candidate character-sequence representations of the subimage; and
use the candidate character-sequence representations of the subimages as hypotheses regarding the lexical identities of the subimages;
reconstruct a portion of an electronic document corresponding to the received image of the block of text using the hypotheses regarding the lexical identities of the subimages; and
store the reconstructed portion of the electronic document in one or more of the one or more electronic memories.
3 Assignments
0 Petitions
Accused Products
Abstract
The current application is directed to methods and systems that convert document images, which contain Arabic text and text in other languages in which symbols are joined together to produce continuous words and portions of words, into corresponding electronic documents. In one implementation, a document-image-processing method and system to which the current application is directed employs numerous techniques and features that render efficiently computable an otherwise intractable or impractical document-image-to-electronic-document conversion. These techniques and features include transformation of text-image morphemes and words into feature symbols with associated parameters, efficiently identifying similar morphemes and words in an electronic store of standard-feature-symbol-encoded morphemes and words, and identifying candidate inter-character division points and corresponding traversal paths using the similar morphemes and words identified in the word store.
25 Citations
20 Claims
-
1. A system that transforms a document image into an electronic document, the system comprising:
-
one or more processors; one or more electronic memories; and a hierarchically organized data structure, stored in one or more of the one or more electronic memories, each entry of which corresponds to one or more natural-language entities selected from among one or more morphemes, words, or phrases encoded as sequences of standard feature symbols; and computer instructions, digitally encoded and stored in one or more of the one or more electronic memories and executed on the one or more processors, that receive an image of a block of text of an Arabic-like language, identify images of lines of text within the received image of the block of text; identify subimages within the image of the line of text corresponding to one or more of words and morphemes, for each identified subimage, identify sets of characters that represent candidate character-sequence representations of the subimage; and use the candidate character-sequence representations of the subimages as hypotheses regarding the lexical identities of the subimages; reconstruct a portion of an electronic document corresponding to the received image of the block of text using the hypotheses regarding the lexical identities of the subimages; and store the reconstructed portion of the electronic document in one or more of the one or more electronic memories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method that transforms a document image into an electronic document within a system having one or more processors, one or more electronic memories, and the hierarchically organized data structure, the method comprising:
-
receiving an image of a block of text of an Arabic-like language; identifying images of lines of text within the received image of the block of text; identifying subimages within the image of the line of text corresponding to one or more of words and morphemes; for each identified subimage, identifying sets of characters that represent candidate character-sequence representations of the subimage, and using the candidate character-sequence representations of the subimages as hypotheses regarding the lexical identities of the subimages; reconstructing a portion of an electronic document corresponding to the received image of the block of text using the hypotheses regarding the lexical identities of the subimages; and storing the reconstructed portion of the electronic document in one or more of the one or more electronic memories. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification