Systems and methods for processing documents of unknown or unspecified format
First Claim
1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
- reading the document to extract raw encoded text;
analyzing the raw encoded text to identify one or more text chunks; and
for a text chunk of the one or more text chunks;
performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;
performing an encoding identification process to identify a likely character encoding protocol; and
converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes, for each protocol of a plurality of encoding protocols, using a common word selection of words in each protocol to identify matches.
9 Assignments
0 Petitions
Accused Products
Abstract
A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
10 Citations
22 Claims
-
1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
-
reading the document to extract raw encoded text; analyzing the raw encoded text to identify one or more text chunks; and for a text chunk of the one or more text chunks; performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process; performing an encoding identification process to identify a likely character encoding protocol; and converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text, wherein the encoding identification process includes, for each protocol of a plurality of encoding protocols, using a common word selection of words in each protocol to identify matches. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 21, 22)
-
-
12. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
-
reading the document to extract raw encoded text; analyzing the raw encoded text to identify one or more text chunks; and for a text chunk of the one or more text chunks; performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process; performing an encoding identification process to identify a likely character encoding protocol; and converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text, wherein the encoding identification process includes; setting a first language and character encoding protocol combination; scoring the first language and character encoding protocol combination based on an identification of words from a common word selection for the first language and character encoding protocol combination; repeating the scoring for additional language and character encoding protocol combinations; and identifying a likely language and character encoding protocol combination based on the relative scores.
-
-
13. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
-
reading the document to extract raw encoded text; analyzing the raw encoded text to identify one or more text chunks; and for a text chunk of the one or more text chunks; performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process; performing an encoding identification process to identify a likely character encoding protocol; and converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text, wherein the encoding identification process includes, for each text chunk; reading an input portion of the text chunk, the input portion having a first predetermined size; processing the input portion to generate a set of n-grams; for a plurality of dictionaries that each contain known n-grams in a respective language and character encoding protocol combination, tallying matches between the generated set of n-grams and the known n-grams to define a score for each of the plurality of dictionaries; normalizing scores of the plurality of dictionaries; and identifying a likely language and character encoding protocol combination based on the relative scores. - View Dependent Claims (14, 15, 16)
-
-
17. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
-
reading the document to extract raw encoded text; analyzing the raw encoded text to identify one or more text chunks; and for a text chunk of the one or more text chunks; performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process; performing an encoding identification process to identify a likely character encoding protocol; and converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text, wherein the analyzing the raw encoded text to identify the one or more text chucks is based on a recorded set of delimiters/markers derived from a learning method leveraging a set of comparison files generated using an application responsible for generating the document of unknown or unspecified format, the set of comparison files including; an empty document; a first document defined by a first paragraph of text; a second document defined by the first paragraph of text followed immediately by a second paragraph of text; and a third document defined by the first paragraph of text followed immediately by a third paragraph of text, followed immediately by the second paragraph of text.
-
Specification