Methods and systems for assessing the quality of automatically generated text

US 8,682,648 B2
Filed: 04/16/2013
Issued: 03/25/2014
Est. Priority Date: 02/05/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of assessing quality of computer-generated text in a document image, the method comprising:

identifying text quality scores associated with a plurality of digital text characters generated from the document image, a text quality score for a target character describing a likelihood of the target character being at a location of the target character within the document image;

identifying a plurality of subsets of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the document image by more than a threshold value;

segmenting the document image into a plurality of segments associated with different text quality scores responsive to the identified plurality of subsets of characters;

determining a representative text quality score for each segment of the document image; and

storing the representative text quality scores in association with the segments.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of ordered characters is received in association with information specifying the locations of the characters within the image of the document. Language-conditional character probabilities for each character are determined based on a set of language models and the ordering of the characters. Neighbor characters associated with a target character are identified based on the locations of the characters. Language-conditional character probabilities associated with the neighbor characters and language-conditional character probabilities associated with the target character are combined to generate a local language-conditional likelihood associated with the target character, the local language-conditional likelihood representing a concordance of the target character to a language model.

15 Citations

View as Search Results

21 Claims

1. A computer-implemented method of assessing quality of computer-generated text in a document image, the method comprising:
- identifying text quality scores associated with a plurality of digital text characters generated from the document image, a text quality score for a target character describing a likelihood of the target character being at a location of the target character within the document image;
  
  identifying a plurality of subsets of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the document image by more than a threshold value;
  
  segmenting the document image into a plurality of segments associated with different text quality scores responsive to the identified plurality of subsets of characters;
  
  determining a representative text quality score for each segment of the document image; and
  
  storing the representative text quality scores in association with the segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein segmenting the document image in a plurality of segments comprises:
    - forming boundaries between a first subset of characters neighboring a second subset of characters in response to a difference between a first text quality score associated with the first subset of characters and a second text quality score associated with the second subset of characters exceeding the threshold value.
  - 3. The method of claim 1, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - identifying neighboring characters in the document image having similar text quality scores; and
      
      adding the neighboring characters in the document image having similar text quality scores to the subset of characters.
  - 4. The method of claim 1, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - combining the text quality scores of the characters in the identified subset to produce a combined text quality score representing the identified subset of characters; and
      
      wherein the segmenting segments the document image responsive to the combined text quality score.
  - 5. The method of claim 1, wherein the threshold value defines a maximum difference of text quality scores for characters to be included in a given subset of characters.
  - 6. The method of claim 5, further comprising adding a neighboring character to the given subset of characters in response to determining that a difference between an average value of text quality scores associated with the characters included in the given subset and a text quality score associated with the neighboring character subset is less than the threshold value.
  - 7. The method of claim 1, wherein storing the representative text quality scores further comprises:
    - storing coordinates specifying a location in the document image of the segment associated with each representative text quality score.

8. A computer system for segmenting a document image, comprising:
- a processor for executing computer program instructions;
  
  a non-transitory computer-readable storage medium storing executable computer program instructions, the computer-readable storage medium comprising;
  
  a score module configured to identify text quality scores associated with a plurality of digital text characters generated from the document image, a text quality score for a target character describing a likelihood of the target character being at a location of the target character within the document image;
  
  a score analysis module configured to;
  
  identify a plurality of subsets of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the document image by more than a threshold value;
  
  segment the document image into a plurality of segments associated with different text quality scores responsive to the identified plurality of subsets of characters; and
  
  determine a representative text quality score for each segment of the document image; and
  
  a text quality database configured to store the representative text quality scores in association with the segments.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein segmenting the document image in a plurality of segments comprises:
    - forming boundaries between a first subset of characters neighboring a second subset of characters in response to a difference between a first text quality score associated with the first subset of characters and a second text quality score associated with the second subset of characters exceeding the threshold value.
  - 10. The system of claim 8, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - identifying neighboring characters in the document image having similar text quality scores; and
      
      adding the neighboring characters in the document image having similar text quality scores to the subset of characters.
  - 11. The system of claim 8, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - combining the text quality scores of the characters in the identified subset to produce a combined text quality score representing the identified subset of characters; and
      
      wherein the segmenting segments the document image responsive to the combined text quality score.
  - 12. The system of claim 8, wherein the threshold value defines a maximum difference of text quality scores for characters to be included in a given subset of characters.
  - 13. The system of claim 12, further comprising adding a neighboring character to the given subset of characters in response to determining that a difference between an average value of text quality scores associated with the characters included in the given subset and a text quality score associated with the neighboring character subset is less than the threshold value.
  - 14. The system of claim 8, wherein storing the representative text quality scores further comprises:
    - storing coordinates specifying a location in the document image of the segment associated with each representative text quality score.

15. A non-transitory computer-readable storage medium storing executable computer program instructions for performing steps comprising:
- identifying text quality scores associated with a plurality of digital text characters generated from the document image, a text quality score for a target character describing a likelihood of the target character being at a location of the target character within the document image;
  
  identifying a plurality of subsets of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the document image by more than a threshold value;
  
  segmenting the document image into a plurality of segments associated with different text quality scores responsive to the identified plurality of subsets of characters;
  
  determining a representative text quality score for each segment of the document image; and
  
  storing the representative text quality scores in association with the segments.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The storage medium of claim 15, wherein segmenting the document image in a plurality of segments comprises:
    - forming boundaries between a first subset of characters neighboring a second subset of characters in response to a difference between a first text quality score associated with the first subset of characters and a second text quality score associated with the second subset of characters exceeding the threshold value.
  - 17. The storage medium of claim 15, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - identifying neighboring characters in the document image having similar text quality scores; and
      
      adding the neighboring characters in the document image having similar text quality scores to the subset of characters.
  - 18. The storage medium of claim 15, wherein identifying a subset of characters having associated text quality scores that differ from text quality scores associated with neighboring characters in the image comprises:
    - combining the text quality scores of the characters in the identified subset to produce a combined text quality score representing the identified subset of characters; and
      
      wherein the segmenting segments the document image responsive to the combined text quality score.
  - 19. The storage medium of claim 15, wherein the threshold value defines a maximum difference of text quality scores for characters to be included in a given subset of characters.
  - 20. The storage medium of claim 19, further comprising adding a neighboring character to the given subset of characters in response to determining that a difference between an average value of text quality scores associated with the characters included in the given subset and a text quality score associated with the neighboring character subset is less than the threshold value.
  - 21. The storage medium of claim 15, wherein storing the representative text quality scores further comprises:
    - storing coordinates specifying a location in the document image of the segment associated with each representative text quality score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Popat, Ashok
Primary Examiner(s)
AZAD, ABUL K

Application Number

US13/864,180
Publication Number

US 20130259378A1
Time in Patent Office

343 Days
Field of Search

704 1- 10, 382/291, 382/292
US Class Current

704/9
CPC Class Codes

G06F 40/253   Grammatical analysis; Style...

G06F 40/51   Translation evaluation

G06V 30/153   using recognition of charac...

G06V 30/246   using linguistic properties...

G06V 30/268   Lexical context

Methods and systems for assessing the quality of automatically generated text

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Methods and systems for assessing the quality of automatically generated text

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others