System and method for capturing and processing business data

US 7,450,760 B2
Filed: 07/06/2005
Issued: 11/11/2008
Est. Priority Date: 05/18/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of interpreting information in a document comprising:

receiving an image of a document from a remote source;

representing the image as text comprising characters, wherein at least some of the characters have alternative versions with associated confidence probabilities;

representing the text as tokens, wherein the tokens comprise collections of characters and wherein different tokens are defined for different versions of a character;

combining tokens into tokenizations, wherein each tokenization is a set of tokens, wherein for characters with different versions only one version is included in a tokenization;

assigning one or more tags to those tokens, wherein the tags indicate a possible meaning of a corresponding token, and assigning a score value indicating a probability of accuracy of a corresponding tag;

parsing each of said tokenizations based on a predetermined grammar so as to obtain multiple tokenizations wherein only one tag with associated score is assigned to each token based on both dictionary and grammar matching;

assigning each tokenization an aggregate score based on compliance with the grammar and scores of all tokens; and

selecting one tokenization with tags using the aggregated score as a metric of success so as to obtain a final tokenization from the multiple tokenizations with tags.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system for interpreting information in a document are provided, with the system receiving an image of a document from a remote source and converting it into multiple sets of blocks of characters. Tags indicating likely meaning of blocks are assigned to them. At least some of the blocks have an associated score representing the probability that the characters in the block correctly represent the characters in the original image. The system selects one set from multiple sets based on the scores associated to certain blocks determined by accessing remote information over the Internet.

Citations

39 Claims

1. A method of interpreting information in a document comprising:
- receiving an image of a document from a remote source;
  
  representing the image as text comprising characters, wherein at least some of the characters have alternative versions with associated confidence probabilities;
  
  representing the text as tokens, wherein the tokens comprise collections of characters and wherein different tokens are defined for different versions of a character;
  
  combining tokens into tokenizations, wherein each tokenization is a set of tokens, wherein for characters with different versions only one version is included in a tokenization;
  
  assigning one or more tags to those tokens, wherein the tags indicate a possible meaning of a corresponding token, and assigning a score value indicating a probability of accuracy of a corresponding tag;
  
  parsing each of said tokenizations based on a predetermined grammar so as to obtain multiple tokenizations wherein only one tag with associated score is assigned to each token based on both dictionary and grammar matching;
  
  assigning each tokenization an aggregate score based on compliance with the grammar and scores of all tokens; and
  
  selecting one tokenization with tags using the aggregated score as a metric of success so as to obtain a final tokenization from the multiple tokenizations with tags.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein information represented in the document comprises a limited number of identifiable semantic structures and wherein a subset of such structure is presented only once and has a unique meaning.
  - 3. The method of claim 1, wherein said steps of assigning tags and scores include the step of:
    - filtering tokens of the tokenizations.
  - 4. The method of claim 3, wherein the step of filtering includes the step of:
    - looking up a token in a dictionary, wherein such dictionary includes both tag and score representing probability of the tag assignment being correct.
  - 5. The method of claim 3, wherein the step of filtering includes the step of:
    - identifying certain tokens as common words expected in the document and assigning tags to neighboring words based on position relative to such common words.
  - 6. The method of claim 1, wherein the parsing step is configured to begin anywhere in the document and includes searching both forward and backward to satisfy conditions of grammar rules.
  - 7. The method of claim 1, further comprising a step of:
    - converting the final tokenization into a data structure wherein the tags specify located fields of the data structure and the tokens provide data for such fields.
  - 8. The method of claim 1, including:
    - receiving a flag indicating which set of grammars and dictionaries shall be employed for processing the document.
  - 9. The method of claim 1, wherein the step of selecting further comprises providing a first portion of a given tokenization from the multiple tokenizations to an external database so as to find a record matching said first portion.
  - 10. The method of claim 9, further comprising the step of:
    - determining whether a match exists between a second portion of the given tokenization with information in said record.
  - 11. The method of claim 10, further comprising the step of:
    - increasing the score of the given tokenization as the final tokenization if said match exists.
  - 12. The method of claim 10, further comprising the step of:
    - correcting the second portion of the given tokenization if a partial match has been found for the second portion.
  - 13. The method of claim 1, wherein the step of selecting one tokenization further comprises the step of:
    - searching Internet websites using a first portion of a given tokenization from the set of multiple tokenizations so as to find pages matching said first portion.
  - 14. The method of claim 13, further comprising the step of:
    - determining whether there is a match between a second portion of the given tokenization with information in said pages.
  - 15. The method of claim 14, further comprising the step of:
    - increasing probability of selecting the given tokenization as the final tokenization based on a number of pages where there is a match.
  - 16. The method of claim 1, further comprising the steps of:
    - determining the measure of likelihood that the final tokenization and the tags are correct; and
      
      if the measure is insufficient, providing the document and its image for manual processing.

17. A method of interpreting information in a document comprising the steps of:
- receiving an image of a document from a remote source;
  
  converting said image into multiple sets of blocks of characters, wherein said blocks in said sets have been assigned tags indicating their likely meaning and at least some of said blocks have an associated score representing the probability that the characters in the block are assigned the tag correctly representing the meaning of the characters in the image; and
  
  selecting one final set from the multitude of sets based on the scores associated with at least some of the blocks and based on information provided as a result of accessing remote information over the Internet.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 18. The method of claim 17, wherein the step of converting said image comprises the step of:
    - converting the image into text comprising characters, wherein at least some of the characters have alternative versions with assigned confidence probability.
  - 19. The method of claim 18, further comprising the step of:
    - forming sets of groups of characters based on alternative versions provided by the converting step and assigning one or more tags to each set.
  - 20. The method of claim 19, further comprising the step of:
    - parsing each set of blocks based on a predetermined grammar so as to remove or reduce the score of certain tags, leaving only one highest scoring tag per block.
  - 21. The method of claim 20, wherein the parsing step is configured to begin anywhere in the document and includes searching both forward and backward to satisfy conditions of grammar rules.
  - 22. The method of claim 17, wherein the step of selecting one final set further comprises the steps of:
    - providing a first one or more blocks of a set from the multiple sets to an external database so as to find a record matching said one or more blocks; and
      
      assigning this block a score as a result of such match.
  - 23. The method of claim 22, further comprising the step of:
    - determining whether a match exists between a second one or more blocks of the set with information in said record.
  - 24. The method of claim 23, further comprising the step of:
    - increasing the score of s the set as the final set if said match exists.
  - 25. The method of claim 24, further comprising the step of:
    - correcting the second one or more blocks if no exact match has been found for the second one or more blocks.
  - 26. The method of claim 17, wherein the step of selecting one final set further comprises the step of:
    - searching Internet websites using a first one or more blocks of a set from the multiple sets so as to find pages matching said first one or more blocks.
  - 27. The method of claim 26, further comprising the step of:
    - determining whether there is a match between a second one or more blocks of the set with information in said pages.
  - 28. The method of claim 27, further comprising the step of:
    - increasing probability of selecting the set as the final set based on a number of pages where there is a match.

29. A system for interpreting information in a document comprising:
- storage for an image of a document received from a remote source;
  
  software for converting said image into multiple sets of blocks of characters, wherein said blocks in said sets have tags indicating their meaning and at least some of said blocks have an associated score representing probability that the characters in the block are assigned the tag correctly representing the meaning of the characters in the image; and
  
  software for selecting one final set from the multitude of sets based on the scores associated with at least some of blocks and based on information provided as a result of accessing remote information over the Internet.
- View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- - 30. The system of claim 29, wherein the software for converting comprises software for forming sets of groups of characters and assigning one or more tags to each set.
  - 31. The system of claim 30, further comprising:
    - software for parsing each set of blocks based on a predetermined grammar so as to remove certain tags, leaving only one tag per block.
  - 32. The system of claim 31, wherein the parsing software is configured to parse anywhere in the document and includes means for searching both forward and backward to satisfy conditions of grammar rules.
  - 33. The system of claim 29, wherein the software for selecting one final set further comprises software for providing a first one or more blocks of a given set from the multiple sets to an external database so as to find a record matching said one or more blocks.
  - 34. The system of claim 33, further comprising:
    - software for determining whether a match exists between a second one or more blocks from the given set with information in said record.
  - 35. The system of claim 34, further comprising:
    - software for increasing probability of selecting the given set as the final set if said match exists.
  - 36. The system of claim 29, wherein the software for selecting one final set further comprises:
    - software for searching Internet websites using a first one or more blocks of a given set from the multiple sets so as to find pages matching said first one or more blocks.
  - 37. The system of claim 36, further comprising:
    - software for determining whether there is a match between a second one or more blocks of the given set with information in said pages.
  - 38. The system of claim 37, further comprising:
    - software for increasing the score of the given set as the final set based on a number of pages where there is a match.
  - 39. The software of claim 29, further comprising:
    - software for (1) determining the measure of likelihood that a final tokenization and the tags are correct and (2) if the measure is insufficient, providing the document and its image for manual processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Flint Mobile, Inc.
Original Assignee
scanR, Inc.
Inventors
Mutz, Andrew H., Molnar, Joseph, Ferreira, Paulo, Nieuwland, Dan
Primary Examiner(s)
MARIAM, DANIEL G

Application Number

US11/176,592
Publication Number

US 20060262910A1
Time in Patent Office

1,224 Days
Field of Search

382176-177, 382180-181, 382/209, 382/229, 382/231, 382/100, 382/224, 707 1- 10
US Class Current

382/181
CPC Class Codes

G06V 10/95   structured as a network, e....

G06V 30/10   Character recognition

G06V 30/274   Syntactic or semantic conte...

G06V 30/416   Extracting the logical stru...

H04M 1/2755   by optical scanning

H04M 2250/52   including functional featur...

System and method for capturing and processing business data

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for capturing and processing business data

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links