SYSTEMS AND METHODS FOR PROCESSING DOCUMENTS OF UNKNOWN OR UNSPECIFIED FORMAT

US 20130077855A1
Filed: 03/22/2012
Published: 03/28/2013
Est. Priority Date: 05/16/2011
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, the method including the steps of:

reading the document, thereby to extract raw encoded text;

analysing the raw encoded text, thereby to identify one or more text chunks; and

for a given chunk;

performing compression identification analysis to determine whether compression is likely and, in the event that compression is likely, performing a decompression process;

performing an encoding identification process thereby to identify a likely character encoding protocol; and

converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely and, in the event that compression. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.

10 Citations

View as Search Results

23 Claims

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, the method including the steps of:
- reading the document, thereby to extract raw encoded text;
  
  analysing the raw encoded text, thereby to identify one or more text chunks; and
  
  for a given chunk;
  
  performing compression identification analysis to determine whether compression is likely and, in the event that compression is likely, performing a decompression process;
  
  performing an encoding identification process thereby to identify a likely character encoding protocol; and
  
  converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. A method according to claim 1 wherein the compression identification analysis and encoding identification process are performed for each chunk.
  - 3. A method according to claim 2 wherein the compression identification analysis and encoding identification process are performed in a modified manner for one or more later chunks based in input indicative of the results of the compression identification analysis and encoding identification process for one or more earlier chunks.
  - 4. A method according to claim 1 wherein analysing the raw encoded text thereby to identify text chunks is based on a recorded set of delimiters/markers.
  - 5. A method according to claim 4 wherein the delimiters/markers are derived from a learning method.
  - 6. A method according to claim 5 wherein the learning method leverages a set of comparison files generated using the application responsible for generating the document of unknown or unspecified format.
  - 7. A method according to claim 6 wherein the set of comparison files have controlled content.
  - 8. A method according to claim 6 wherein the set of comparison documents include portions of known text.
  - 9. A method according to claim 1 wherein the compression identification analysis includes calculating information density and, in the event that the information density is greater than a predetermined threshold, determining that compression is likely.
  - 10. A method according to claim 1 wherein the decompression process includes applying a plurality of known decompression algorithms, and determining which of those decompression algorithms provides the best result, wherein that best result is issued as input for the encoding identification process.
  - 11. A method according to claim 1 wherein the encoding identification process includes, for each of a plurality of encoding protocols, using a common word selection of words in that protocol to identify matches.
  - 12. A method according to claim 11 wherein the likely encoding protocol is identified as that with the most matches.
  - 13. A method according to claim 1 wherein the encoding identification process includes:
    - setting a first language and encoding protocol combination;
      
      scoring the language/protocol combination based on identification of words from a common word selection for that language/protocol combination;
      
      repeating the scoring for additional language/protocol combinations; and
      
      identifying a likely language/protocol combination based on the relative scores.
  - 14. A method according to claim 1 wherein the encoding identification process includes, for a given chunk:
    - reading an input portion of a chunk, the input portion having a first predetermined size;
      
      processing the input portion, thereby to generate a set of n-grams;
      
      for a plurality of dictionaries that each contain known n-grams in a respective language/character encoding combination, tallying the matches between the generated n-grams and known n-grams thereby to define a score for each dictionary;
      
      normalising the scores; and
      
      identifying a likely language/protocol combination based on the relative scores.
  - 15. A method according to claim 14 wherein, in the case that there are a plurality of highest scores in a predetermined range, the method includes:
    - reading a further input portion of a chunk, the input portion having a second predetermined size which is greater than the first predetermined size;
      
      processing the further input portion, thereby to generate a set of n-grams;
      
      for a plurality of dictionaries that each contain known n-grams in a respective language/character encoding combination, tallying the matches between the generated n-grams and known n-grams thereby to define a score for each dictionary;
      
      normalising the scores; and
      
      identifying a likely language/protocol combination based on the relative scores.
  - 16. A method according to claim 15 including, in the case that there are a plurality of highest scores in a predetermined range, repeating the method of claim 15 for a third predetermined size greater than the second predetermined size.
  - 17. A method according to claim 15 wherein the raw encoded text is defined as a single chunk.

18. A learning method for identifying delimiters/markers in raw encoded text created using a native application, the method including:
- inputting four documents created using the native application, the documents including;
  
  an empty document;
  
  a document defined by a first paragraph of text;
  
  a document defined by the first paragraph of text followed immediately by a second paragraph of text; and
  
  a document defined by a document defined by the first paragraph of text followed immediately by a third paragraph of text, followed immediately by the second paragraph of text; and
  
  comparing those documents thereby to identify delimiters/markers.

19. A learning method for identifying delimiters/markers in raw encoded text created using a native application, the method including:
- inputting a set of document created using the native application;
  
  receiving data indicative of known text portions known to exist in each of the documents;
  
  processing the documents on the basis of a set of operations thereby to identify the known text portions; and
  
  based on the identification of the known text portions, identifying the delimiters/markers.

20. A method for determining a likely language/encoding protocol combination for a portion of raw encoded text, thereby to allow extraction of meaningful text, the method including:
- inputting the raw encoded text;
  
  setting a first language and encoding protocol combination;
  
  scoring the language/protocol combination based on identification of words from a common word selection for that language/protocol combination;
  
  repeating the scoring for additional language/protocol combinations; and
  
  identifying a likely language/protocol combination based on the relative scores.

21. A method for determining a likely language/encoding protocol combination for a portion of raw encoded text, thereby to allow extraction of meaningful text, the method including:
- reading an input portion of the raw encoded text, the input portion having a first predetermined size;
  
  processing the input portion, thereby to generate a set of n-grams;
  
  for a plurality of dictionaries that each contain known n-grams in a respective language/character encoding combination, tallying the matches between the generated n-grams and known n-grams thereby to define a score for each dictionary;
  
  normalising the scores; and
  
  identifying a likely language/protocol combination based on the relative scores.
- View Dependent Claims (22, 23)
- - 22. A method according to claim 21 wherein, in the case that there are a plurality of highest scores in a predetermined range, the method includes:
    - reading a further input portion of a chunk, the input portion having a second predetermined size which is greater than the first predetermined size;
      
      processing the further input portion, thereby to generate a set of n-grams;
      
      for a plurality of dictionaries that each contain known n-grams in a respective language/character encoding combination, tallying the matches between the generated n-grams and known n-grams thereby to define a score for each dictionary;
      
      normalising the scores; and
      
      identifying a likely language/protocol combination based on the relative scores.
  - 23. A method according to claim 21 including, in the case that there are a plurality of highest scores in a predetermined range, repeating the method of claim 21 for a third predetermined size greater than the second predetermined size.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hyland Switzerland, Sarl
Original Assignee
Isys Search Software Pty Ltd. (Kofax Incorporated)
Inventors
Coles, Scott, Truscott, Ben, Murphy, Derek, Davies, Ian

Granted Patent

US 9,122,898 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/155
CPC Class Codes

G06F 18/00 Pattern recognition

G06F 40/237 Lexical tools

SYSTEMS AND METHODS FOR PROCESSING DOCUMENTS OF UNKNOWN OR UNSPECIFIED FORMAT

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

10 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

SYSTEMS AND METHODS FOR PROCESSING DOCUMENTS OF UNKNOWN OR UNSPECIFIED FORMAT

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others