SYSTEMS AND METHODS FOR PROCESSING DOCUMENTS OF UNKNOWN OR UNSPECIFIED FORMAT
First Claim
1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, the method including the steps of:
- reading the document, thereby to extract raw encoded text;
analysing the raw encoded text, thereby to identify one or more text chunks; and
for a given chunk;
performing compression identification analysis to determine whether compression is likely and, in the event that compression is likely, performing a decompression process;
performing an encoding identification process thereby to identify a likely character encoding protocol; and
converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
9 Assignments
0 Petitions
Accused Products
Abstract
A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely and, in the event that compression. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
10 Citations
23 Claims
-
1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, the method including the steps of:
-
reading the document, thereby to extract raw encoded text; analysing the raw encoded text, thereby to identify one or more text chunks; and for a given chunk; performing compression identification analysis to determine whether compression is likely and, in the event that compression is likely, performing a decompression process; performing an encoding identification process thereby to identify a likely character encoding protocol; and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A learning method for identifying delimiters/markers in raw encoded text created using a native application, the method including:
-
inputting four documents created using the native application, the documents including; an empty document; a document defined by a first paragraph of text; a document defined by the first paragraph of text followed immediately by a second paragraph of text; and a document defined by a document defined by the first paragraph of text followed immediately by a third paragraph of text, followed immediately by the second paragraph of text; and comparing those documents thereby to identify delimiters/markers.
-
-
19. A learning method for identifying delimiters/markers in raw encoded text created using a native application, the method including:
-
inputting a set of document created using the native application; receiving data indicative of known text portions known to exist in each of the documents; processing the documents on the basis of a set of operations thereby to identify the known text portions; and based on the identification of the known text portions, identifying the delimiters/markers.
-
-
20. A method for determining a likely language/encoding protocol combination for a portion of raw encoded text, thereby to allow extraction of meaningful text, the method including:
-
inputting the raw encoded text; setting a first language and encoding protocol combination; scoring the language/protocol combination based on identification of words from a common word selection for that language/protocol combination; repeating the scoring for additional language/protocol combinations; and identifying a likely language/protocol combination based on the relative scores.
-
-
21. A method for determining a likely language/encoding protocol combination for a portion of raw encoded text, thereby to allow extraction of meaningful text, the method including:
-
reading an input portion of the raw encoded text, the input portion having a first predetermined size; processing the input portion, thereby to generate a set of n-grams; for a plurality of dictionaries that each contain known n-grams in a respective language/character encoding combination, tallying the matches between the generated n-grams and known n-grams thereby to define a score for each dictionary; normalising the scores; and identifying a likely language/protocol combination based on the relative scores. - View Dependent Claims (22, 23)
-
Specification