×

Systems and methods for processing documents of unknown or unspecified format

  • US 9,122,898 B2
  • Filed: 03/22/2012
  • Issued: 09/01/2015
  • Est. Priority Date: 05/16/2011
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer implemented method for extracting meaningful text from a document of unknown or unspecified format, comprising:

  • reading the document to extract raw encoded text;

    analyzing the raw encoded text to identify one or more text chunks; and

    for a text chunk of the one or more text chunks;

    performing a compression identification analysis to determine whether compression is likely and, if compression is likely, performing a decompression process;

    performing an encoding identification process to identify a likely character encoding protocol; and

    converting the text chunk using the identified likely character encoding protocol to output the text chunk as readable text,wherein the encoding identification process includes, for each protocol of a plurality of encoding protocols, using a common word selection of words in each protocol to identify matches.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×