×

Identifying language attributes through probabilistic analysis

  • US 7,386,438 B1
  • Filed: 08/04/2003
  • Issued: 06/10/2008
  • Est. Priority Date: 08/04/2003
  • Status: Active Grant
First Claim
Patent Images

1. A system for identifying language attributes through probabilistic analysis, comprising:

  • a storage system adapted to store a set of language classes, which each identify a language and a character set encoding, and further adapted to store a plurality of training documents;

    an attribute modeler adapted to train an attribute model by evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for a set of the one or more document properties, the probability conditioned on the occurrence of the language class, the trained attribute model stored in the storage;

    wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;

    a text modeler adapted to train a text model by evaluating byte occurrences within the training documents and, for each language class, calculating a probability for a set of byte occurrences, the probability conditioned on the occurrence of the language class, the trained text model stored in the storage; and

    a training engine adapted to calculate an overall probability for at least one of the set of language classes by evaluating the probability for the document properties set based on the attribute model and the probability for the byte occurrences based on the text model.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×