Display of document image optimized for reading

US 8,254,681 B1
Filed: 06/24/2009
Issued: 08/28/2012
Est. Priority Date: 02/05/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, the method comprising:

applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text;

calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing a quality of optical character recognition for the document segment;

identifying a semantic component of the document comprised of one or more of the document segments;

creating a document representation comprising the document segments and identified semantic components;

storing the document representation in association with an identifier of the image of the document; and

for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Semantically meaningful segments of an image of a document, such as tables of contents, page numbers, footnotes, and the like, are identified. These segments form a model of the document image, which may then be rendered differently for different client devices. The rendering may be based on a display parameter provided by a client device, such as a display resolution of the client device, or a requested display format.

43 Citations

View as Search Results

22 Claims

1. A computer-implemented method for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, the method comprising:
- applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text;
  
  calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing a quality of optical character recognition for the document segment;
  
  identifying a semantic component of the document comprised of one or more of the document segments;
  
  creating a document representation comprising the document segments and identified semantic components;
  
  storing the document representation in association with an identifier of the image of the document; and
  
  for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 21, 22)
- - 2. The computer-implemented method of claim 1, further comprising modifying a background of a textual portion to have a uniform shade of color before applying the optical character recognition algorithm to the image of the document.
  - 3. The computer-implemented method of claim 1, wherein identifying the semantic component comprises identifying a type of a page of the image of the document, the identifying comprising evaluating the page according to a classifier for each of a plurality of possible page types, the classifier for a possible page type being derived from performing machine learning on a set of pages of that type.
  - 4. The computer-implemented method of claim 1, wherein identifying the semantic component comprises:
    - scoring a page of the image of the document based at least in part on degrees to which portions of the page occur within other pages of the image of the document; and
      
      responsive to the score being above a threshold score, determining that the page represents a table of contents.
  - 5. The computer-implemented method of claim 1, wherein the semantic component identified is a page number, the identifying comprising:
    - for a location of a page containing the identified semantic component;
      
      determining a number of words parsing as valid page numbers and located at that location;
      
      determining an expected probability of the location containing page numbers; and
      
      identifying the location as containing a page number based at least in part on the determined number of words and the determined expected probability.
  - 6. The computer-implemented method of claim 1, further comprising:
    - merging a first and a second segment representing paragraphs, responsive at least in part to the first segment being the last paragraph on a page and the second segment being the first paragraph on a next page after the page, and responsive at least in part to indentation and punctuation characteristics of a beginning of the second segment.
  - 7. The computer-implemented method of claim 1, further comprising providing to a client device a representation of the document, wherein a semantic component is displayed differently from its visual representation in the image of the document, responsive at least in part to one or more of a display resolution of the client device, a requested display format, and a designation of whether to paginate the document with page breaks of sizes corresponding to a size of a display of the client device.
  - 8. The computer-implemented method of claim 1, further comprising:
    - distinguishing textual and picture portions of the image of the document; and
      
      excluding the picture portions from processing by the optical character recognition algorithm.
  - 21. The computer-implemented method of claim 1, wherein calculating the text quality score for the document segment comprises:
    - storing a language model for a language of the document, the language model specifying probabilities of different characters appearing in the language of the document;
      
      using the language model to determine a measure of how well text characters in the document match the language of the document; and
      
      calculating the text quality score using the measure.
  - 22. The computer-implemented method of claim 4, further comprising:
    - for a first portion of the table of contents page that occurs within other pages of the document, inserting a link in the document at the portion, the link linking to a second portion of the document containing text that matches text of the first portion.

9. A non-transitory computer-readable storage medium having executable computer program instructions embodied therein for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, actions of the computer program instructions comprising:
- applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text;
  
  calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing quality of optical character recognition for the document segment;
  
  identifying a semantic component of the document comprised of one or more of the document segments;
  
  creating a document representation comprising the document segments and identified semantic components;
  
  storing the document representation in association with an identifier of the image of the document; and
  
  for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer-readable storage medium of claim 9, the actions of the computer program instructions further comprising modifying a background of a textual portion to have a uniform shade of color before applying the optical character recognition algorithm to the image of the document.
  - 11. The computer-readable storage medium of claim 9, wherein identifying the semantic component comprises identifying a type of a page of the image of the document, the identifying comprising evaluating the page according to a classifier for each of a plurality of possible page types, the classifier for a possible page type being derived from performing machine learning on a set of pages of that type.
  - 12. The computer-readable storage medium of claim 9, wherein identifying the semantic component comprises:
    - scoring a page of the image of the document based at least in part on degrees to which portions of the page occur within other pages of the image of the document; and
      
      responsive to the score being above a threshold score, determining that the page represents a table of contents.
  - 13. The computer-readable storage medium of claim 9, wherein the semantic component identified is a page number, the identifying comprising:
    - for a location of a page containing the identified semantic component;
      
      determining a number of words parsing as valid page numbers and located at that location;
      
      determining an expected probability of the location containing page numbers;
      
      identifying the location as containing a page number based at least in part on the determined number of words and the determined expected probability.
  - 14. The computer-readable storage medium of claim 9, the actions of the computer program instructions further comprising:
    - merging a first and a second segment representing paragraphs, responsive at least in part to the first segment being the last paragraph on a page and the second segment being the first paragraph on a next page after the page, and responsive at least in part to indentation and punctuation characteristics of a beginning of the second segment.
  - 15. The computer-readable storage medium of claim 9, the actions of the computer program instructions further comprising providing to a client device a representation of the document, wherein a semantic component is displayed differently from its visual representation in the image of the document, responsive at least in part to one or more of a display resolution of the client device, a requested display format, and a designation of whether to paginate the document with page breaks of sizes corresponding to a size of a display of the client device.
  - 16. The computer-readable storage medium of claim 9, the actions of the computer program instructions further comprising:
    - distinguishing textual and picture portions of the image of the document; and
      
      excluding the picture portions from processing by the optical character recognition algorithm.

17. A computer system for providing a client device with a version of an image of a digital document formatted for the client device, the system comprising:
- a computer processor; and
  
  a computer-readable storage medium having executable computer program instructions embodied therein that when executed by the computer processor perform actions comprising;
  
  applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text;
  
  calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing quality of optical character recognition for the document segment;
  
  identifying a semantic component of the document comprised of one or more of the document segments;
  
  creating a document representation comprising the document segments and identified semantic components;
  
  storing the document representation in association with an identifier of the image of the document; and
  
  for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment.
- View Dependent Claims (18, 19, 20)
- - 18. The computer system of claim 17, the actions of the computer program further comprising modifying a background of a textual portion to have a uniform shade of color before applying the optical character recognition algorithm to the image of the document.
  - 19. The computer system of claim 17, wherein identifying the semantic component comprises:
    - scoring a page of the image of the document based at least in part on degrees to which portions of the page occur within other pages of the image of the document; and
      
      responsive to the score being above a threshold score, determining that the page represents a table of contents.
  - 20. The computer system of claim 17, the actions of the computer program further comprising providing to a client device a representation of the document, wherein a semantic component is displayed differently from its visual representation in the image of the document, responsive at least in part to one or more of a display resolution of the client device, a requested display format, and a designation of whether to paginate the document with page breaks of sizes corresponding to a size of a display of the client device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Poncin, Guillaume, Ratnakar, Viresh
Primary Examiner(s)
Strege, John

Application Number

US12/491,152
Time in Patent Office

1,161 Days
Field of Search

382/173, 382/180, 382/181, 382/182, 382/229
US Class Current

382/180
CPC Class Codes

G06F 40/20   Natural language analysis s...

G06V 30/10   Character recognition

G06V 30/19173   Classification techniques

G06V 30/412   Layout analysis of document...

G06V 30/413   Classification of content, ...

Display of document image optimized for reading

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

43 Citations

22 Claims

Specification

Use Cases

Quick Links

Others

Display of document image optimized for reading

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

22 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others