Method for identifying and resolving erroneous characters output by an optical character recognition system
First Claim
1. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
- a) synchronizing said different OCR engine outputs from said different OCR engines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines by executing one or more synchronization heuristics to pattern match said OCR engine outputs, by varying a character substitution ratio and a number of look-ahead characters to determine whether the corresponding number of look-ahead characters in said OCR engine outputs match;
b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a); and
c) outputting said matches and said resolved mismatches.
0 Assignments
0 Petitions
Accused Products
Abstract
A post-processing method for an optical character recognition (OCR) method for combining different OCR engines to identify and resolve characters and attributes of the characters that are erroneously recognized by multiple optical character recognition engines. The characters can originate from many different types of character environments. OCR engine outputs are synchronized in order to detect matches and mismatches between said OCR engine outputs by using synchronization heuristics. The mismatches are resolved using resolution heuristics and neural networks. The resolution heuristics and neural networks are based on observing many different conventional OCR engines in different character environments to find what specific OCR engine correctly identifies a certain character having particular attributes. The results are encoded into the resolution heuristics and neural networks to create an optimal OCR post-processing solution.
58 Citations
13 Claims
-
1. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
-
a) synchronizing said different OCR engine outputs from said different OCR engines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines by executing one or more synchronization heuristics to pattern match said OCR engine outputs, by varying a character substitution ratio and a number of look-ahead characters to determine whether the corresponding number of look-ahead characters in said OCR engine outputs match; b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a); and c) outputting said matches and said resolved mismatches.
-
-
2. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
-
a) synchronizing said different OCR engine outputs from said different OCR engines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines by performing the steps of; a1) converting each of said OCR engine outputs into a corresponding character list; a2) comparing each of said character lists to each other; and a3) identifying said matches and said mismatches between said OCR engine outputs based on said comparing in step (a2); b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a); and c) outputting said matches and said resolved mismatches.
-
-
3. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different Optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
-
a) synchronizing said different OCR engine outputs from said different OCR entwines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines by performing the steps of; a1) converting each of said OCR engine outputs into a corresponding character list; a2) comparing each of said character lists to each other; and a3) identifying character substitution errors between said character lists as a mismatch based on said comparing in step (a2); b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a); and c) outputting said matches and said resolved mismatches. - View Dependent Claims (4)
-
-
5. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
-
a) synchronizing said different OCR engine outputs from said different OCR engines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines; b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a) by performing the steps of; b1) determining whether one or more resolution heuristics will resolve a mismatch of said mismatches; b2) resolving said mismatch by applying said one or more resolution heuristics based on said determining in step (b1); and b3) executing one of a plurality of neural networks to resolve said mismatch if none of said resolution heuristics are capable of resolving said mismatch; and c) outputting said matches and said resolved mismatches. - View Dependent Claims (6, 7, 8)
-
-
9. A method executed by a computer as part of a computer program for identifying and resolving characters and attributes of said characters erroneously recognized by a plurality of different optical character recognition engines, said characters originating from different types of character environments, said computer connectable to receive a plurality of different optical character recognition (OCR) engine outputs from corresponding said different OCR engines, said method comprising the steps of:
-
a) synchronizing said different OCR engine outputs from said different OCR engines to each other to detect matches and mismatches between said different OCR engine outputs from said different OCR engines by performing the steps of; a1) converting each of said OCR engine outputs into a corresponding character list and character-attribute list; a2) comparing each of said character lists to each other; a3) identifying character substitution errors between said character lists as a mismatch based on said comparing in step (a2); a4) comparing attribute information of each of said matches and said mismatches; and a5) identifying character attribute errors between said character-attribute lists as a mismatch based on said comparing in step (a4); b) resolving each of said mismatches from said different OCR engines if any mismatch is detected in step (a); and c) outputting said matches and said resolved mismatches. - View Dependent Claims (10)
-
-
11. A synchronization method for matching characters from a plurality of different character lists output by a corresponding plurality of different OCR engines, comprising the steps of:
-
a) adjusting a number of look-ahead characters which defines how many characters are being matched in each of said character lists from the different optical character recognition (OCR) engines; b) adjusting a character substitution ratio which defines how many characters are being ignored in each of said character lists; c) ignoring a number of characters in each of said character lists based on said character substitution ratio; d) comparing a number of characters following said ignored characters in each of said character lists based on said number of look-ahead characters; and e) identifying a character substitution error if said number of look-ahead characters in each of said character lists match. - View Dependent Claims (12, 13)
-
Specification