Method and apparatus for converting bitmap image documents to editable coded data using a standard notation to record document recognition ambiguities
First Claim
1. A method of transforming a document represented as a bitmap image into an editable coded data stream defined using a machine readable document description language that records coded information resulting from the document transformation process and information regarding uncertainties in the document transformation process, comprising:
- performing a first transformation operation on at least a text portion of said bitmap image using a character recognition apparatus, to transform at least said text portion of said bitmap image into coded information recognized with a level of confidence;
outputting and recording said coded information into one or more elements that are defined using said document description language, each element having a machine readable element type identifier that indicates the type of said coded information regarding the recognized bitmap image recorded in said element so that the type of coded information contained in each element can be known without examining the coded information contained in each element, the element type identifier for each element having been defined based on the type of coded information recorded in the element and the level of confidence with which said bitmap image represented by said coded information was recognized so that each of said elements is selectively identified, each element having coded information of a single type recorded therein; and
when said character recognition apparatus determines that the recognized bitmap image contained in an element has not been recognized with at least a predetermined level of confidence, recording in said element uncertainty information determined by said first recognition apparatus regarding said recognized bitmap image contained in said element;
wherein said element type identifier is a character-string-element or a questionable-character-element, each character-string-element containing a string of consecutive characters recognized by said character recognition apparatus with at least said predetermined level of confidence, each questionable-character-element containing said uncertainty information determined by said character recognition apparatus for a character which was not recognized with at least said predetermined level of confidence by said character recognition apparatus.
4 Assignments
0 Petitions
Accused Products
Abstract
Documents represented as bitmap images are transformed into coded textual data and coded graphics data by graphics and textual recognizers, which use a standard notation for recording the results of the document recognition processes, including any ambiguities, in a document description language. Recognized portions of the document, represented as editable coded data, such as for example ASCII, are placed in elements, defined in the document description language, with all contents of an element sharing some common characteristic. Elements can include, for example: character-string-elements, questionable-character-elements, questionable-word-elements, verified-word-elements, alternate-word-elements, segment-elements, and arc-elements.
61 Citations
20 Claims
-
1. A method of transforming a document represented as a bitmap image into an editable coded data stream defined using a machine readable document description language that records coded information resulting from the document transformation process and information regarding uncertainties in the document transformation process, comprising:
-
performing a first transformation operation on at least a text portion of said bitmap image using a character recognition apparatus, to transform at least said text portion of said bitmap image into coded information recognized with a level of confidence; outputting and recording said coded information into one or more elements that are defined using said document description language, each element having a machine readable element type identifier that indicates the type of said coded information regarding the recognized bitmap image recorded in said element so that the type of coded information contained in each element can be known without examining the coded information contained in each element, the element type identifier for each element having been defined based on the type of coded information recorded in the element and the level of confidence with which said bitmap image represented by said coded information was recognized so that each of said elements is selectively identified, each element having coded information of a single type recorded therein; and when said character recognition apparatus determines that the recognized bitmap image contained in an element has not been recognized with at least a predetermined level of confidence, recording in said element uncertainty information determined by said first recognition apparatus regarding said recognized bitmap image contained in said element; wherein said element type identifier is a character-string-element or a questionable-character-element, each character-string-element containing a string of consecutive characters recognized by said character recognition apparatus with at least said predetermined level of confidence, each questionable-character-element containing said uncertainty information determined by said character recognition apparatus for a character which was not recognized with at least said predetermined level of confidence by said character recognition apparatus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of transforming a document represented as a bitmap image into an editable coded data stream defined using a machine readable document description language that records coded information resulting from the document transformation process and information regarding uncertainties in the document transformation process, comprising the steps of:
-
a) analyzing text portions of said bitmap image using a character recognizer to transform said text portions of said bitmap image into said coded information recognized with a level of confidence; b) outputting and recording said coded information into a series of elements that are defined using said document description language, each element having a machine readable element type identifier that indicates the type of said coded information regarding the recognized bitmap image recorded in said element so that the type of coded information contained in each element can be known without examining the coded information contained in each element; c) the element type identifier for each element having been defined using said document description language based on the type of coded information recorded in the elements by defining the type identifier of each of said elements as character-string-elements or questionable-character-elements, each element defined as a character-string-element containing a string of consecutive characters recognized by said character recognizer with at least a predetermined level of confidence, each element defined as a questionable-character-element containing information pertaining to a character not recognized with at least said predetermined level of confidence, said character-string-elements and said questionable-character-elements being arranged as a continuous stream of elements based on an order of the characters in said bitmap image, each element having coded information of a single type recorded therein; d) for each element defined as a questionable-character-element, analyzing the coded information pertaining to the character contained therein and adjacent confidently recognized characters contained in a same word as said questionable-character-element using a word recognizer, to transform and record said coded information and adjacent confidently recognized characters into one or more elements having type identifiers defined as verified-word-elements by substituting alternate characters for said questionable-character-element and determining whether one or more verified words created by said substituting are recognized by said word recognizer; and e) when more than one verified-word-elements are transformed for each questionable-character-element, placing said more than one elements defined as verified-word-elements in an element having an element type identifier defined as an alternate-word-element, said questionable-character-element remaining when no verified words are recognized by said word recognizer. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. An automatic document recognition apparatus for transforming text documents represented as bitmap image data into an editable coded data stream defined using a machine readable document description language that records coded data resulting from the document transformation process and information regarding uncertainties in the document transformation process, said apparatus comprising:
a character recognizer having; a) first transformation means for performing a character recognition operation on said bitmap image representation of said document to transform said document into coded character data recognized with a level of confidence; and b) first identification means for outputting and recording said coded character data into one or more elements that are defined using said document description language, said first identification means further defining a machine readable element type identifier for each element, said element type identifier indicating the type of said coded character data regarding the recognized bitmap image recorded in said element so that the type of coded character data contained in each element can be known without examining the coded character data contained in each element, said identification means selectively identifying said one or more elements by defining the element type identifier for each element based on the type of coded data recorded in the element and the level of confidence with which said bitmap image was recognized, each element having coded character data of a single type recorded therein, and, when said first transformation means determines that the coded character data contained in the element has not been transformed with a predetermined level of confidence, said identification means also recording in said element uncertainty information determined by said first transformation means regarding said coded character data contained in said element; wherein said element type identifier is a character-string-element or a questionable-character-element, each character-string-element containing a string of consecutive characters recognized by said character recognition apparatus with at least said predetermined level of confidence, each questionable-character-element containing said uncertainty information determined by said character recognition apparatus for a character which was not recognized with at least said predetermined level of confidence by..said character recognition apparatus. - View Dependent Claims (18, 19, 20)
Specification