Reordering text from unstructured sources to intended reading flow
First Claim
1. An information handling system comprising:
- one or more processors;
a memory coupled to at least one of the processors; and
a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of;
identifying a plurality of sections from a sequence of characters included in a Portable Document Format (PDF) source file, wherein each section includes a unique set of coordinate positions;
building a plurality of directional links between the plurality of sections based on a relative position of each sections'"'"' coordinate positions in relation to other sections'"'"' coordinate positions along an axis; and
repeatedly merging two or more sections to form increasingly larger sections, wherein the merged two or more sections are selected based on the directional links built between the two or more sections, wherein the repeatedly merging further comprises building one or more new directional links between the increasingly larger sections and one or more remaining sections selected from the plurality of sections, and wherein the repeatedly merging continues until the plurality of sections are exhausted and consolidated into a final larger section, wherein the final larger section is arranged in an intended reading order.
1 Assignment
0 Petitions
Accused Products
Abstract
An approach is provided in which a number of sections from a sequence of characters included in a Portable Document Format (PDF) file are identified. Each of the identified sections includes a unique set of coordinate positions. The approach builds links between the sections based on a relative position of each of the sections in relation to the other sections along an axis. The approach repeatedly merges sections based on the links that were built to form increasingly larger sections until a final larger section is generated with the characters appearing in a manner consistent with human reading of the rendered PDF document rather than the placement of the characters found within the original PDF file.
18 Citations
16 Claims
-
1. An information handling system comprising:
-
one or more processors; a memory coupled to at least one of the processors; and a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of; identifying a plurality of sections from a sequence of characters included in a Portable Document Format (PDF) source file, wherein each section includes a unique set of coordinate positions; building a plurality of directional links between the plurality of sections based on a relative position of each sections'"'"' coordinate positions in relation to other sections'"'"' coordinate positions along an axis; and repeatedly merging two or more sections to form increasingly larger sections, wherein the merged two or more sections are selected based on the directional links built between the two or more sections, wherein the repeatedly merging further comprises building one or more new directional links between the increasingly larger sections and one or more remaining sections selected from the plurality of sections, and wherein the repeatedly merging continues until the plurality of sections are exhausted and consolidated into a final larger section, wherein the final larger section is arranged in an intended reading order. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product stored in a computer readable storage medium, comprising computer program code that, when executed by an information handling system, causes the information handling system to perform actions comprising:
-
identifying a plurality of sections from a sequence of characters included in a Portable Document Format (PDF) source file, wherein each section includes a unique set of coordinate positions; building a plurality of directional links between the plurality of sections based on a relative position of each sections'"'"' coordinate positions in relation to other sections'"'"' coordinate positions along an axis; and repeatedly merging two or more sections to form increasingly larger sections, wherein the merged two or more sections are selected based on the directional links built between the two or more sections, wherein the repeatedly merging further comprises building one or more new directional links between the increasingly larger sections and one or more remaining sections selected from the plurality of sections, and wherein the repeatedly merging continues until the plurality of sections are exhausted and consolidated into a final larger section, wherein the final larger section is arranged in an intended reading order. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification