Display of document image optimized for reading
First Claim
Patent Images
1. A computer-implemented method for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, the method comprising:
- applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text;
calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing a quality of optical character recognition for the document segment;
identifying a semantic component of the document comprised of one or more of the document segments;
creating a document representation comprising the document segments and identified semantic components;
storing the document representation in association with an identifier of the image of the document; and
for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment.
3 Assignments
0 Petitions
Accused Products
Abstract
Semantically meaningful segments of an image of a document, such as tables of contents, page numbers, footnotes, and the like, are identified. These segments form a model of the document image, which may then be rendered differently for different client devices. The rendering may be based on a display parameter provided by a client device, such as a display resolution of the client device, or a requested display format.
43 Citations
22 Claims
-
1. A computer-implemented method for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, the method comprising:
-
applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text; calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing a quality of optical character recognition for the document segment; identifying a semantic component of the document comprised of one or more of the document segments; creating a document representation comprising the document segments and identified semantic components; storing the document representation in association with an identifier of the image of the document; and for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 21, 22)
-
-
9. A non-transitory computer-readable storage medium having executable computer program instructions embodied therein for identifying semantically meaningful segments of an image of a document, the image of the document having a plurality of pages, actions of the computer program instructions comprising:
-
applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text; calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing quality of optical character recognition for the document segment; identifying a semantic component of the document comprised of one or more of the document segments; creating a document representation comprising the document segments and identified semantic components; storing the document representation in association with an identifier of the image of the document; and for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer system for providing a client device with a version of an image of a digital document formatted for the client device, the system comprising:
-
a computer processor; and a computer-readable storage medium having executable computer program instructions embodied therein that when executed by the computer processor perform actions comprising; applying an optical character recognition algorithm to the image of the document to obtain a plurality of document segments, each document segment corresponding to a region of the image of the document and having associated recognized text; calculating a text quality score for at least one document segment of the plurality of document segments, the text quality score characterizing quality of optical character recognition for the document segment; identifying a semantic component of the document comprised of one or more of the document segments; creating a document representation comprising the document segments and identified semantic components; storing the document representation in association with an identifier of the image of the document; and for the document segment, determining, based at least in part on the text quality score associated with the document segment, whether to display the document segment as the associated recognized text of the document segment, or as a portion of the image of the document corresponding to the document segment. - View Dependent Claims (18, 19, 20)
-
Specification