×

Identifying a property of a document

  • US 8,380,488 B1
  • Filed: 04/19/2007
  • Issued: 02/19/2013
  • Est. Priority Date: 04/19/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method, comprising:

  • receiving a sequence of bytes representing text in a document;

    identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;

    identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;

    receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;

    determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and

    identifying a most likely encoding of the document based on a highest encoding score among the encoding scores.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×