Identifying language attributes through probabilistic analysis

US 7,386,438 B1
Filed: 08/04/2003
Issued: 06/10/2008
Est. Priority Date: 08/04/2003
Status: Active Grant

First Claim

Patent Images

1. A system for identifying language attributes through probabilistic analysis, comprising:

a storage system adapted to store a set of language classes, which each identify a language and a character set encoding, and further adapted to store a plurality of training documents;

an attribute modeler adapted to train an attribute model by evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for a set of the one or more document properties, the probability conditioned on the occurrence of the language class, the trained attribute model stored in the storage;

wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;

a text modeler adapted to train a text model by evaluating byte occurrences within the training documents and, for each language class, calculating a probability for a set of byte occurrences, the probability conditioned on the occurrence of the language class, the trained text model stored in the storage; and

a training engine adapted to calculate an overall probability for at least one of the set of language classes by evaluating the probability for the document properties set based on the attribute model and the probability for the byte occurrences based on the text model.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for identifying language attributes through probabilistic analysis is described. A set of language classes and a plurality of training documents are defined, Each language class identifies a language and a character set encoding. Occurrences of one or more document properties within each training document are evaluated. For each language class, a probability for the document properties set conditioned on the occurrence of the language class is calculated. Byte occurrences within each training document are evaluated. For each language class, a probability for the byte occurrences conditioned on the occurrence of the language class is calculated.

254 Citations

27 Claims

1. A system for identifying language attributes through probabilistic analysis, comprising:
- a storage system adapted to store a set of language classes, which each identify a language and a character set encoding, and further adapted to store a plurality of training documents;
  
  an attribute modeler adapted to train an attribute model by evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for a set of the one or more document properties, the probability conditioned on the occurrence of the language class, the trained attribute model stored in the storage;
  
  wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;
  
  a text modeler adapted to train a text model by evaluating byte occurrences within the training documents and, for each language class, calculating a probability for a set of byte occurrences, the probability conditioned on the occurrence of the language class, the trained text model stored in the storage; and
  
  a training engine adapted to calculate an overall probability for at least one of the set of language classes by evaluating the probability for the document properties set based on the attribute model and the probability for the byte occurrences based on the text model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A system according to claim 1, further comprising:
    - an assignment module adapted to assign the overall probability for a language class in accordance with the formula;
      
      $\underset{cls}{\arg \max} P (text | cls) \cdot P (props | cls) \cdot P (cls)$ where cls is the language class, text is the byte occurrences set, props are the document properties, and P(text|cis) is the probability for the byte occurrences, and P(props|cls) is the probability for the document properties set.
  - 3. A system according to claim 1, wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags.
  - 4. A system according to claim 3, further comprising:
    - an assignment module adapted to assign the probability for the document properties set based on the attribute model in accordance with the formula;
      
      P(tld,enc|cls)·
      
      P(cls)where tld is the top level domain, enc is the character set encoding and cls is the language class.
  - 5. A system according to claim 1, further comprising:
    - a counting module adapted to count byte co-occurrences within a training document, and determine the probability for the byte occurrences based on the byte co-occurrences.
  - 6. A system according to claim 5, wherein the byte co-occurrences comprise a set of trigrams, further comprising:
    - a probability module adapted to calculate a probability of a trigram as the number of occurrences of the trigram divided by the total number of trigram occurrences in the training documents for a language class.
  - 7. A system according to claim 6, further comprising:
    - an assignment module adapted to assign the probability for the byte occurrences set based on the text model in accordance with the formula;
      
      P(text|cls)where text is the set of trigrams and cls is the language class.
  - 8. A system according to claim 1, further comprising:
    - a training engine adapted to perform iterative training by providing the probability for the document properties set and the probability for the byte occurrences set respectively to the evaluation of byte occurrences and assignment of the set of language classes.
  - 9. A system according to claim 1, further comprising:
    - a back off module adapted to evaluate less frequently occurring document properties by calculating a probability for a less frequently occurring document property conditioned on the occurrence of the language class.
  - 10. A system according to claim 1, wherein at least one training document comprises one of a Web page and a news message.

11. A method for identifying language attributes through probabilistic analysis, comprising:
- defining a set of language classes, which each identify a language and a character set encoding, and a plurality of training documents;
  
  evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for the document properties set conditioned on the occurrence of the language class by an attribute model wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;
  
  evaluating byte occurrences within the training documents and, for each language class, calculating a probability for the byte occurrences conditioned on the occurrence of the language class by a text model; and
  
  calculating an overall probability for ones of the set of language classes by evaluating the probability for the document properties set by the attribute model and the probability for the byte occurrences by the text model.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 12. A method according to claim 11, further comprising:
    - assigning the overall probability for a language class in accordance with the formula;
      
      $\underset{cls}{\arg \max} P (text | cls) \cdot P (props | cls) \cdot P (cls)$ where cls is the language class, text is the byte occurrences set, props are the document properties, and P(text|cls) is the probability for the byte occurrences, and P(props|cis) is the probability for the document properties set.
  - 13. A method according to claim 11, wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags.
  - 14. A method according to claim 13, further comprising:
    - assigning the probability for the document properties set based on the attribute model in accordance with the formula;
      
      P(tld,enc|cls)·
      
      P(cls)where tld is the top level domain, enc is the character set encoding and cls is the language class.
  - 15. A method according to claim 11, further comprising:
    - counting byte co-occurrences within a training document; and
      
      determining the probability for the byte occurrences based on the byte co-occurrences.
  - 16. A method according to claim 15, wherein the byte co-occurrences comprise a set of trigrams, further comprising:
    - calculating a probability of a trigram as the number of occurrences of the trigram divided by the total number of trigram occurrences in the training documents for a language class.
  - 17. A method according to claim 16, further comprising:
    - assigning the probability for the byte occurrences set based on the text model in accordance with the formula;
      
      P(text|cls)where text is the set of trigrams and cls is the language class.
  - 18. A method according to claim 11, further comprising:
    - performing iterative training by providing the probability for the document properties set and the probability for the byte occurrences set respectively to the evaluation of byte occurrences and assignment of the set of language classes.
  - 19. A method according to claim 11, further comprising:
    - evaluating less frequently occurring document properties by calculating a probability for a less frequently occurring document property conditioned on the occurrence of the language class.
  - 20. A method according to claim 11, wherein at least one training document comprises one of a Web page and a news message.
  - 21. A computer-readable storage medium holding code for performing the method according to claim 11.

22. A system for identifying documents by language using probabilistic analysis of language attributes, comprising a set of language classes, each language class comprising a language name and a character set encoding name;
- a training corpora comprising a plurality of training documents;
  
  an attribute modeler adapted to train an attribute model by evaluating a top level domain and character set encoding associated with the training documents and, for each language class, calculating a probability for each such top level domain and character set encoding conditioned on the occurrence of the each language class wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;
  
  a text modeler adapted to train a text model by evaluating co-occurrences of a plurality of bytes within the training documents and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the each language class; and
  
  a training engine adapted to calculate an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model.
- View Dependent Claims (23, 26)
- - 23. A system according to claim 22, further comprising:
    - a plurality of unlabeled documents; and
      
      a classifier classifying one or more unlabeled documents by at least one language class, comprising;
      
      an attribute evaluator determining document properties within the documents and initializing language class probability to each document from the attribute model;
      
      a text evaluator evaluating byte occurrences in the documents and updating the language class probability of the each document from the text model;
      
      a pruner pruning at least one language class falling below a predetermined probability threshold; and
      
      an assignment module assigning at least one language class based on the language class probability of each document.
  - 26. A computer-readable storage medium holding code for performing the method according to claim 22.

24. A method for identifying documents by language using probabilistic analysis of language attributes, comprising:
- defining a set of language classes, each language class comprising a language name and a character set encoding name;
  
  assembling a training corpora comprising a plurality of training documents;
  
  training an attribute model by evaluating a top level domain and character set encoding associated with each training document and, for each language class, calculating a probability for each such top level domain and character set encoding conditioned on the occurrence of the each language class;
  
  training a text model by evaluating co-occurrences of a plurality of bytes within each training document and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the each language class; and
  
  calculating an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model.
- View Dependent Claims (25)
- - 25. A method according to claim 24, further comprising:
    - accessing a plurality of unlabeled documents; and
      
      classifying one or more unlabeled documents by at least one language class, comprising;
      
      determining document properties within the documents and initializing language class probability to each document from the attribute model;
      
      evaluating byte occurrences in each document and updating the language class probability of the document from the text model;
      
      pruning at least one language class failing below a predetermined probability threshold; and
      
      assigning at least one language class based on the language class probability of the document.

27. An apparatus for identifying documents by language using probabilistic analysis of language attributes, comprising:
- means for defining a set of language classes, each language class comprising a language name and a character set encoding name;
  
  means for training an attribute model by assigning at least one top level domain and character set encoding pairing to at least one language class for each of a plurality of training documents and calculating a probability for each such top level domain and character set encoding pairing conditioned on the occurrence of the assigned language class;
  
  means for training a text model by evaluating co-occurrences of a plurality of bytes within each training document and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the language class based on the attribute model; and
  
  means for calculating an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jackson, Eric, Zhou, Jenny, Diament, Benjamin, Franz, Alexander, Milch, Brian
Primary Examiner(s)
Hudspeth; David
Assistant Examiner(s)
HERNANDEZ, JOSIAH J

Application Number

US10/634,616
Time in Patent Office

1,772 Days
Field of Search

704/5, 704/251, 704/1, 704/9, 704/8, 704/6, 704/2, 704/7, 704/257, 704/240, 704/233, 707/6, 707/3, 707/5, 707/102, 707/2, 709/246, 709/224
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

Identifying language attributes through probabilistic analysis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

254 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying language attributes through probabilistic analysis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

254 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links