Systems and methods for processing documents of unknown or unspecified format

US 9,122,898 B2
Filed: 03/22/2012
Issued: 09/01/2015
Est. Priority Date: 05/16/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:

reading the document to extract raw encoded text;

analyzing the raw encoded text to identify one or more text chunks; and

for a text chunk of the one or more text chunks;

performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;

performing an encoding identification process to identify a likely character encoding protocol; and

converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes, for each protocol of a plurality of encoding protocols, using a common word selection of words in each protocol to identify matches.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.

10 Citations

View as Search Results

22 Claims

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
- reading the document to extract raw encoded text;
  
  analyzing the raw encoded text to identify one or more text chunks; and
  
  for a text chunk of the one or more text chunks;
  
  performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;
  
  performing an encoding identification process to identify a likely character encoding protocol; and
  
  converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes, for each protocol of a plurality of encoding protocols, using a common word selection of words in each protocol to identify matches.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, wherein the compression identification analysis and the encoding identification process are performed for each text chunk.
  - 3. The method of claim 2, wherein the compression identification analysis and the encoding identification process are performed in a modified manner for one or more later text chunks based on input indicative of a result of the compression identification analysis and the encoding identification process for one or more earlier text chunks.
  - 4. The method of claim 1, wherein the analyzing the raw encoded text to identify the one or more text chunks is based on a recorded set of delimiters/markers.
  - 5. The method of claim 4, wherein the recorded set of delimiters/markers are derived from a learning method.
  - 6. The method of claim 5, wherein the learning method leverages a set of comparison files generated using an application responsible for generating the document of unknown or unspecified format.
  - 7. The method of claim 6, wherein the set of comparison files has controlled content.
  - 8. The method of claim 6, wherein the set of comparison files includes portions of known text.
  - 9. The method of claim 1, wherein the compression identification analysis includes calculating an information density and, if the information density is greater than a predetermined threshold, determining that compression is likely.
  - 10. The method of claim 1, wherein the decompression process includes applying a plurality of decompression algorithms and determining which of those decompression algorithms provides a best result, wherein the best result is issued as input for the encoding identification process.
  - 11. The method of claim 1, wherein the likely encoding protocol is identified as that with the most matches.
  - 18. The method of claim 1, wherein the document of unknown or unspecified format includes non-text data.
  - 19. The method of claim 1, wherein the analyzing includes determining whether the one or more text chunks are indicative of non-text data.
  - 20. The method of claim 19, further comprising excluding the one or more text chunks from the output upon a determination that the one or more text chunks are indicative of non-text data.
  - 21. The method of claim 19, further comprising passing the one or more text chunks to a component operative to recognize a format of the one or more text chunks indicative of non-text data.
  - 22. The method of claim 1, further comprising providing the readable text to at least one of a search engine for searching and a text application available in a client device for rendering the readable text in a text document.

12. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
- reading the document to extract raw encoded text;
  
  analyzing the raw encoded text to identify one or more text chunks; and
  
  for a text chunk of the one or more text chunks;
  
  performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;
  
  performing an encoding identification process to identify a likely character encoding protocol; and
  
  converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes;
  
  setting a first language and character encoding protocol combination;
  
  scoring the first language and character encoding protocol combination based on an identification of words from a common word selection for the first language and character encoding protocol combination;
  
  repeating the scoring for additional language and character encoding protocol combinations; and
  
  identifying a likely language and character encoding protocol combination based on the relative scores.

13. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
- reading the document to extract raw encoded text;
  
  analyzing the raw encoded text to identify one or more text chunks; and
  
  for a text chunk of the one or more text chunks;
  
  performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;
  
  performing an encoding identification process to identify a likely character encoding protocol; and
  
  converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes, for each text chunk;
  
  reading an input portion of the text chunk, the input portion having a first predetermined size;
  
  processing the input portion to generate a set of n-grams;
  
  for a plurality of dictionaries that each contain known n-grams in a respective language and character encoding protocol combination, tallying matches between the generated set of n-grams and the known n-grams to define a score for each of the plurality of dictionaries;
  
  normalizing scores of the plurality of dictionaries; and
  
  identifying a likely language and character encoding protocol combination based on the relative scores.
- View Dependent Claims (14, 15, 16)
- - 14. The method of claim 13, wherein, if there is a plurality of highest scores in a predetermined range, the method includes:
    - reading a further input portion of the text chunk, the further input portion having a second predetermined size which is greater than the first predetermined size;
      
      processing the further input portion to generate a set of n-grams;
      
      for the plurality of dictionaries that each contain known n-grams in a respective language and character encoding protocol combination, tallying matches between the generated set of n-grams and the known n-grams thereby to define a score for each of the plurality of dictionaries;
      
      normalizing scores of the plurality of dictionaries; and
      
      identifying a likely language and character encoding protocol combination based on the relative scores.
  - 15. The method of claim 14, further comprising, repeating the method of claim 14 for a third predetermined size greater than the second predetermined size if there is a plurality of highest scores in a predetermined range.
  - 16. The method of claim 14, wherein the raw encoded text is defined as a single text chunk.

17. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:
- reading the document to extract raw encoded text;
  
  analyzing the raw encoded text to identify one or more text chunks; and
  
  for a text chunk of the one or more text chunks;
  
  performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;
  
  performing an encoding identification process to identify a likely character encoding protocol; and
  
  converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the analyzing the raw encoded text to identify the one or more text chucks is based on a recorded set of delimiters/markers derived from a learning method leveraging a set of comparison files generated using an application responsible for generating the document of unknown or unspecified format, the set of comparison files including;
  
  an empty document;
  
  a first document defined by a first paragraph of text;
  
  a second document defined by the first paragraph of text followed immediately by a second paragraph of text; and
  
  a third document defined by the first paragraph of text followed immediately by a third paragraph of text, followed immediately by the second paragraph of text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hyland Switzerland, Sarl
Original Assignee
Lexmark International Technology, S.A. (Ninestar Corporation)
Inventors
Coles, Scott, Truscott, Ben, Davies, Ian, Murphy, Derek
Primary Examiner(s)
Koziol, Stephen R
Assistant Examiner(s)
LE, TOTAM HA

Application Number

US13/427,506
Publication Number

US 20130077855A1
Time in Patent Office

1,258 Days
Field of Search

382/155
US Class Current

1/1
CPC Class Codes

G06F 18/00 Pattern recognition

G06F 40/237 Lexical tools

Systems and methods for processing documents of unknown or unspecified format

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

10 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for processing documents of unknown or unspecified format

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links