System and method for capturing and processing business data
First Claim
Patent Images
1. A method of interpreting information in a document comprising:
- receiving an image of a document from a remote source;
representing the image as text comprising characters, wherein at least some of the characters have alternative versions with associated confidence probabilities;
representing the text as tokens, wherein the tokens comprise collections of characters and wherein different tokens are defined for different versions of a character;
combining tokens into tokenizations, wherein each tokenization is a set of tokens, wherein for characters with different versions only one version is included in a tokenization;
assigning one or more tags to those tokens, wherein the tags indicate a possible meaning of a corresponding token, and assigning a score value indicating a probability of accuracy of a corresponding tag;
parsing each of said tokenizations based on a predetermined grammar so as to obtain multiple tokenizations wherein only one tag with associated score is assigned to each token based on both dictionary and grammar matching;
assigning each tokenization an aggregate score based on compliance with the grammar and scores of all tokens; and
selecting one tokenization with tags using the aggregated score as a metric of success so as to obtain a final tokenization from the multiple tokenizations with tags.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and a system for interpreting information in a document are provided, with the system receiving an image of a document from a remote source and converting it into multiple sets of blocks of characters. Tags indicating likely meaning of blocks are assigned to them. At least some of the blocks have an associated score representing the probability that the characters in the block correctly represent the characters in the original image. The system selects one set from multiple sets based on the scores associated to certain blocks determined by accessing remote information over the Internet.
-
Citations
39 Claims
-
1. A method of interpreting information in a document comprising:
-
receiving an image of a document from a remote source; representing the image as text comprising characters, wherein at least some of the characters have alternative versions with associated confidence probabilities; representing the text as tokens, wherein the tokens comprise collections of characters and wherein different tokens are defined for different versions of a character; combining tokens into tokenizations, wherein each tokenization is a set of tokens, wherein for characters with different versions only one version is included in a tokenization; assigning one or more tags to those tokens, wherein the tags indicate a possible meaning of a corresponding token, and assigning a score value indicating a probability of accuracy of a corresponding tag; parsing each of said tokenizations based on a predetermined grammar so as to obtain multiple tokenizations wherein only one tag with associated score is assigned to each token based on both dictionary and grammar matching; assigning each tokenization an aggregate score based on compliance with the grammar and scores of all tokens; and selecting one tokenization with tags using the aggregated score as a metric of success so as to obtain a final tokenization from the multiple tokenizations with tags. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method of interpreting information in a document comprising the steps of:
-
receiving an image of a document from a remote source; converting said image into multiple sets of blocks of characters, wherein said blocks in said sets have been assigned tags indicating their likely meaning and at least some of said blocks have an associated score representing the probability that the characters in the block are assigned the tag correctly representing the meaning of the characters in the image; and selecting one final set from the multitude of sets based on the scores associated with at least some of the blocks and based on information provided as a result of accessing remote information over the Internet. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A system for interpreting information in a document comprising:
-
storage for an image of a document received from a remote source; software for converting said image into multiple sets of blocks of characters, wherein said blocks in said sets have tags indicating their meaning and at least some of said blocks have an associated score representing probability that the characters in the block are assigned the tag correctly representing the meaning of the characters in the image; and software for selecting one final set from the multitude of sets based on the scores associated with at least some of blocks and based on information provided as a result of accessing remote information over the Internet. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
Specification