System and method for capturing and processing business data
First Claim
Patent Images
1. A server device for use in interpreting information in a document, comprising:
- a storage component arranged to receive and store an image of a document received from a remote source; and
a processor that includes data and instructions configured to perform actions, including;
representing the image as text that includes a plurality of characters, some of the characters in the plurality having alternative versions with associated confidence probabilities;
generating a set of tokenization'"'"'s, each tokenization comprising a set of unique tokens that comprise collections of characters, wherein different tokens are defined for different versions of a character, and wherein for characters with different versions a single version is included in a tokenization;
assigning one or more tags to the tokens, the tags indicating a possible meaning of a corresponding token, and at least some of the tags having a score value indicating a probability of accuracy;
parsing each tokenization in the set of tokenizations based on a determined grammar to obtain multiple tokenizations with a single tag being assigned to each token;
assigning each tokenization an aggregate score based at least on compliance with the determined grammar; and
selecting as a final tokenization one tokenization with tags based on the aggregate score from the multiple tokenizations.
5 Assignments
0 Petitions
Accused Products
Abstract
A method and a system for interpreting information in a document are provided, with the system receiving an image of a document from a remote source and converting it into multiple sets of blocks of characters. Tags indicating likely meaning of blocks are assigned to them. At least some of the blocks have an associated score representing the probability that the characters in the block correctly represent the characters in the original image. The system selects one set from multiple sets based on the scores associated to certain blocks determined by accessing remote information over the Internet.
15 Citations
20 Claims
-
1. A server device for use in interpreting information in a document, comprising:
-
a storage component arranged to receive and store an image of a document received from a remote source; and a processor that includes data and instructions configured to perform actions, including; representing the image as text that includes a plurality of characters, some of the characters in the plurality having alternative versions with associated confidence probabilities; generating a set of tokenization'"'"'s, each tokenization comprising a set of unique tokens that comprise collections of characters, wherein different tokens are defined for different versions of a character, and wherein for characters with different versions a single version is included in a tokenization; assigning one or more tags to the tokens, the tags indicating a possible meaning of a corresponding token, and at least some of the tags having a score value indicating a probability of accuracy; parsing each tokenization in the set of tokenizations based on a determined grammar to obtain multiple tokenizations with a single tag being assigned to each token; assigning each tokenization an aggregate score based at least on compliance with the determined grammar; and selecting as a final tokenization one tokenization with tags based on the aggregate score from the multiple tokenizations. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-readable storage medium that includes data and instructions, wherein the execution of the instructions on a server device provides for interpreting information in a document by enabling actions, comprising:
-
receiving an image of the document over a network from a remote source; converting the image into multiple sets of blocks of characters, each block having tags indicating an associated meaning and at least some of the blocks having an associated score representing a probability that the characters in the block correctly represent the image; parsing each set of blocks based on a predetermined grammar to remove certain tags, leaving a single tag per block; and selecting a final set from the multiple sets based on the scores associated with at least some of blocks and based on information provided as a result of accessing remote information over the network. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system that is configured to interpret information in a document, comprising:
-
a receiving component configured to receive an image of the document over a network; and a processor executing instructions on a computer that perform actions, comprising; converting the image into multiple sets of blocks of characters, each block having tags indicating an associated meaning and at least some of the blocks having an associated score representing a probability that the characters in the block correctly represent the image; parsing each set of blocks based on a predetermined grammar to remove certain tags, leaving a single tag per block; and selecting a final set from the multiple sets based on the scores associated with at least some of blocks, and based on information provided as a result of accessing remote content over the network. - View Dependent Claims (18, 19, 20)
-
Specification