Apparatus and method for extracting and manipulating the reading order of text to prepare a display document for analysis
First Claim
Patent Images
1. A method for preparing a display document for analysis comprising:
- extracting character data from said display document, wherein a language of said character data in said display document is unknown when said character data is extracted;
determining a first order associated with processing of said character data and a second order associated with a logical order of said character data, including comparing said character data against a set of dictionaries to determine said second order based on a match between said character data and a word listed in a dictionary of said set of dictionaries, each dictionary corresponding to a particular language and listing words of that language, wherein comparing said character data against a set of dictionaries further comprises, if a first comparison of said character data to said dictionaries does not determine a language of said character data, reversing an order of said character data and making a second comparison of said reversed character data against said set of dictionaries;
determining whether said first order is different from said second order; and
reversing at least a portion of said character data in response to said determination that said first order is different from said second order.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for preparing a display document for analysis includes: extracting character data from the display document; determining a first order associated with processing of the character data and a second order associated with a logical order of the character data; determining whether the first order is different from the second order; and reversing at least a portion of the character data in response to the determination that the first order is different from the second order.
30 Citations
23 Claims
-
1. A method for preparing a display document for analysis comprising:
-
extracting character data from said display document, wherein a language of said character data in said display document is unknown when said character data is extracted; determining a first order associated with processing of said character data and a second order associated with a logical order of said character data, including comparing said character data against a set of dictionaries to determine said second order based on a match between said character data and a word listed in a dictionary of said set of dictionaries, each dictionary corresponding to a particular language and listing words of that language, wherein comparing said character data against a set of dictionaries further comprises, if a first comparison of said character data to said dictionaries does not determine a language of said character data, reversing an order of said character data and making a second comparison of said reversed character data against said set of dictionaries; determining whether said first order is different from said second order; and reversing at least a portion of said character data in response to said determination that said first order is different from said second order. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for preparing a display document for analysis comprising:
-
extracting character data from said display document, wherein a language of said character data in said display document is unknown when said character data is extracted; determining a first order associated with processing of said character data and a second order associated with a logical order of said character data; determining whether said first order is different from said second order; and reversing at least a portion of said character data in response to said determination that said first order is different from said second order; wherein determining the second order comprises identifying a punctuation character that is position dependent such that a space character will appear on only one side of the punctuation character, where the side of the punctuation character on which the space character appears depends on said second order; and comparing characters around said punctuation character data against a rule to determine said second order. - View Dependent Claims (14)
-
-
15. An apparatus for preparing a display document for analysis comprising a processor implementing:
-
an extractor for extracting character data from said display document, wherein the character data comprises image data representing an image of a number of characters without including character codes; an order identifier for determining a first order associated with processing of said character data and a second order associated with a logical order of said character data, and for determining whether said first order is different from said second order, wherein determining the second order comprises identifying a punctuation character that is position dependent such that a space character will appear on only one side of the punctuation character, where the side of the punctuation character on which the space character appears depends on said second order; and a reverse component for reversing at least a portion of said character data, responsive to said order identifier determining that said first order is different from said second order. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
22. A computer program product for preparing a Portable Document Format (PDF) document for analysis, the computer program product comprising:
-
a non-transitory computer usable storage medium having computer usable program code embodied therewith, the computer usable program code comprising; non-transitory computer usable program code configured to extract character data from said PDF document, wherein a language of said character data in said PDF document is unknown when said character data is extracted; non-transitory computer usable program code configured to determine a first order associated with processing of said character data and a second order associated with a logical order of said character data, including comparing said character data against a set of dictionaries to determine said second order based on a match between said character data and a word listed in a dictionary of said set of dictionaries, each dictionary corresponding to a particular language and listing words of that language, wherein comparing said character data against a set of dictionaries further comprises, if a first comparison of said character data to said dictionaries does not determine a language of said character data, reversing an order of said character data and making a second comparison of said reversed character data against said set of dictionaries; non-transitory computer usable program code configured to determine whether said first order is different from said second order; and non-transitory computer usable program code configured to reverse said character data in response to a determination that said first order is different from said second order.
-
-
23. An apparatus for preparing a display document for analysis comprising a processor implementing:
-
an extractor to extract character data from said display document, wherein a language of said character data in said display document is unknown when said character data is extracted; an order identifier to determine a first order associated with processing of said character data and a second order associated with a logical order of said character data, including comparing said character data against a set of dictionaries to determine said second order based on a match between said character data and a word listed in a dictionary of said set of dictionaries, each dictionary corresponding to a particular language and listing words of that language, wherein comparing said character data against a set of dictionaries further comprises, if a first comparison of said character data to said dictionaries does not determine a language of said character data, reversing an order of said character data and making a second comparison of said reversed character data against said set of dictionaries; and a reverse component for reversing at least a portion of said character data, responsive to said order identifier determining that said first order is different from said second order.
-
Specification