Methods and systems that use hierarchically organized data structure containing standard feature symbols in order to convert document images to electronic documents
First Claim
1. A system that transforms a document image into an electronic document, the system comprising:
- one or more processors;
one or more electronic memories; and
a hierarchically organized data structure, stored in one or more of the one or more electronic memories, the hierarchically organized data structure comprising a plurality of entries corresponding to one or more natural-language entities selected from among one or more morphemes, words, or phrases encoded as sequences of standard feature symbols, wherein the plurality of entries are associated with a plurality of scores; and
computer instructions, digitally encoded and stored in one or more of the one or more electronic memories and executed on the one or more processors, that;
receive an image comprising text of a language;
identify a subimage within the image, the subimage corresponding to one or more of words and morphemes;
identify a set of character-sequences that represent candidate character-sequence representations of the subimage, wherein a character-sequence of the set is identified by traversing a path of the hierarchically organized data structure and accumulating a value for the character-sequence based on the scores on the path, wherein the value for the character-sequence in the set satisfies a predetermined threshold;
use the candidate character-sequence representations of the subimage as hypotheses regarding lexical identities of the subimage;
construct a portion of an electronic document corresponding to the received image of text using the hypotheses regarding the lexical identities of the subimage; and
store the constructed portion of the electronic document in one or more of the one or more electronic memories.
3 Assignments
0 Petitions
Accused Products
Abstract
The current application is directed to methods and systems that convert document images, which contain Arabic text and text in other languages in which symbols are joined together to produce continuous words and portions of words, into corresponding electronic documents. In one implementation, a document-image-processing method and system to which the current application is directed employs numerous techniques and features that render efficiently computable an otherwise intractable or impractical document-image-to-electronic-document conversion. These techniques and features include transformation of text-image morphemes and words into feature symbols with associated parameters, efficiently identifying similar morphemes and words in an electronic store of standard-feature-symbol-encoded morphemes and words, and identifying candidate inter-character division points and corresponding traversal paths using the similar morphemes and words identified in the word store.
25 Citations
20 Claims
-
1. A system that transforms a document image into an electronic document, the system comprising:
-
one or more processors; one or more electronic memories; and a hierarchically organized data structure, stored in one or more of the one or more electronic memories, the hierarchically organized data structure comprising a plurality of entries corresponding to one or more natural-language entities selected from among one or more morphemes, words, or phrases encoded as sequences of standard feature symbols, wherein the plurality of entries are associated with a plurality of scores; and computer instructions, digitally encoded and stored in one or more of the one or more electronic memories and executed on the one or more processors, that; receive an image comprising text of a language; identify a subimage within the image, the subimage corresponding to one or more of words and morphemes; identify a set of character-sequences that represent candidate character-sequence representations of the subimage, wherein a character-sequence of the set is identified by traversing a path of the hierarchically organized data structure and accumulating a value for the character-sequence based on the scores on the path, wherein the value for the character-sequence in the set satisfies a predetermined threshold; use the candidate character-sequence representations of the subimage as hypotheses regarding lexical identities of the subimage; construct a portion of an electronic document corresponding to the received image of text using the hypotheses regarding the lexical identities of the subimage; and store the constructed portion of the electronic document in one or more of the one or more electronic memories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
receiving, by one or more processors, an image comprising text of a language; identifying a subimage within the image, the subimage corresponding to one or more of words and morphemes; identifying a set of character-sequences that represent candidate character-sequence representations of the subimage, wherein a character-sequence of the set is identified by traversing a path of the hierarchically organized data structure and accumulating a value for the character-sequence based on the scores on the path, wherein the value for the character-sequence in the set satisfies a predetermined threshold; using the candidate character-sequence representations of the subimage as hypotheses regarding the lexical identities of the subimage; constructing a portion of an electronic document corresponding to the received image comprising text using the hypotheses regarding the lexical identities of the subimages; and storing the constructed portion of the electronic document in one or more of the one or more electronic memories. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification