×

SYSTEMS AND METHODS FOR PROCESSING DOCUMENTS OF UNKNOWN OR UNSPECIFIED FORMAT

  • US 20130077855A1
  • Filed: 03/22/2012
  • Published: 03/28/2013
  • Est. Priority Date: 05/16/2011
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, the method including the steps of:

  • reading the document, thereby to extract raw encoded text;

    analysing the raw encoded text, thereby to identify one or more text chunks; and

    for a given chunk;

    performing compression identification analysis to determine whether compression is likely and, in the event that compression is likely, performing a decompression process;

    performing an encoding identification process thereby to identify a likely character encoding protocol; and

    converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×