Identifying a property of a document
First Claim
1. A computer-implemented method, comprising:
- receiving a sequence of bytes representing text in a document;
identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;
identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;
receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;
determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and
identifying a most likely encoding of the document based on a highest encoding score among the encoding scores.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems and apparatus, including computer program products, for identifying properties of an electronic document. In one aspect, a sequence of bytes representing text in a document is received. A plurality of byte-n-grams are identified from the bytes. For multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings is identified. A respective encoding score for each of the multiple encodings is determined. A most likely encoding of the document is identified based on a highest encoding score among the encoding scores. In another aspect, a sequence of characters, having an encoding, are identified in a document. The sequence is segmented into features, each corresponding to two or more characters. A respective score for each of multiple languages is determined based on the features and a respective language model. A language of the document is identified based on the scores.
121 Citations
15 Claims
-
1. A computer-implemented method, comprising:
-
receiving a sequence of bytes representing text in a document; identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1; identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings; receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings; determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and identifying a most likely encoding of the document based on a highest encoding score among the encoding scores. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product, encoded on a non-transitory computer-readable medium, operable to cause data processing apparatus to:
-
receive a sequence of bytes representing text in a document; identify a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1; identify, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings; receive multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings; determine a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and identify a most likely encoding of the document based on a highest encoding score among the encoding scores. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system comprising:
-
means for receiving a sequence of bytes representing text in a document; means for identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1; means for identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings; means for receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings; means for determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and means for identifying a most likely encoding of the document based on a highest encoding score among the encoding scores. - View Dependent Claims (12, 13, 14, 15)
-
Specification