Identifying language attributes through probabilistic analysis
First Claim
1. A system for identifying language attributes through probabilistic analysis, comprising:
- a storage system adapted to store a set of language classes, which each identify a language and a character set encoding, and further adapted to store a plurality of training documents;
an attribute modeler adapted to train an attribute model by evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for a set of the one or more document properties, the probability conditioned on the occurrence of the language class, the trained attribute model stored in the storage;
wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;
a text modeler adapted to train a text model by evaluating byte occurrences within the training documents and, for each language class, calculating a probability for a set of byte occurrences, the probability conditioned on the occurrence of the language class, the trained text model stored in the storage; and
a training engine adapted to calculate an overall probability for at least one of the set of language classes by evaluating the probability for the document properties set based on the attribute model and the probability for the byte occurrences based on the text model.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying language attributes through probabilistic analysis is described. A set of language classes and a plurality of training documents are defined, Each language class identifies a language and a character set encoding. Occurrences of one or more document properties within each training document are evaluated. For each language class, a probability for the document properties set conditioned on the occurrence of the language class is calculated. Byte occurrences within each training document are evaluated. For each language class, a probability for the byte occurrences conditioned on the occurrence of the language class is calculated.
254 Citations
27 Claims
-
1. A system for identifying language attributes through probabilistic analysis, comprising:
-
a storage system adapted to store a set of language classes, which each identify a language and a character set encoding, and further adapted to store a plurality of training documents; an attribute modeler adapted to train an attribute model by evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for a set of the one or more document properties, the probability conditioned on the occurrence of the language class, the trained attribute model stored in the storage;
wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;a text modeler adapted to train a text model by evaluating byte occurrences within the training documents and, for each language class, calculating a probability for a set of byte occurrences, the probability conditioned on the occurrence of the language class, the trained text model stored in the storage; and a training engine adapted to calculate an overall probability for at least one of the set of language classes by evaluating the probability for the document properties set based on the attribute model and the probability for the byte occurrences based on the text model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for identifying language attributes through probabilistic analysis, comprising:
- defining a set of language classes, which each identify a language and a character set encoding, and a plurality of training documents;
evaluating occurrences of one or more document properties within the training documents and, for each language class, calculating a probability for the document properties set conditioned on the occurrence of the language class by an attribute model wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;evaluating byte occurrences within the training documents and, for each language class, calculating a probability for the byte occurrences conditioned on the occurrence of the language class by a text model; and
calculating an overall probability for ones of the set of language classes by evaluating the probability for the document properties set by the attribute model and the probability for the byte occurrences by the text model. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- defining a set of language classes, which each identify a language and a character set encoding, and a plurality of training documents;
-
22. A system for identifying documents by language using probabilistic analysis of language attributes, comprising a set of language classes, each language class comprising a language name and a character set encoding name;
- a training corpora comprising a plurality of training documents;
an attribute modeler adapted to train an attribute model by evaluating a top level domain and character set encoding associated with the training documents and, for each language class, calculating a probability for each such top level domain and character set encoding conditioned on the occurrence of the each language class wherein the document properties comprise at least one of top level domain, HTTP content character set encoding and language header parameters, and HTML content character set encoding and language metatags;
a text modeler adapted to train a text model by evaluating co-occurrences of a plurality of bytes within the training documents and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the each language class; and
a training engine adapted to calculate an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model. - View Dependent Claims (23, 26)
- a training corpora comprising a plurality of training documents;
-
24. A method for identifying documents by language using probabilistic analysis of language attributes, comprising:
-
defining a set of language classes, each language class comprising a language name and a character set encoding name; assembling a training corpora comprising a plurality of training documents; training an attribute model by evaluating a top level domain and character set encoding associated with each training document and, for each language class, calculating a probability for each such top level domain and character set encoding conditioned on the occurrence of the each language class; training a text model by evaluating co-occurrences of a plurality of bytes within each training document and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the each language class; and calculating an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model. - View Dependent Claims (25)
-
-
27. An apparatus for identifying documents by language using probabilistic analysis of language attributes, comprising:
-
means for defining a set of language classes, each language class comprising a language name and a character set encoding name; means for training an attribute model by assigning at least one top level domain and character set encoding pairing to at least one language class for each of a plurality of training documents and calculating a probability for each such top level domain and character set encoding pairing conditioned on the occurrence of the assigned language class; means for training a text model by evaluating co-occurrences of a plurality of bytes within each training document and, for each language class, calculating a probability for the byte co-occurrences conditioned on the occurrence of the language class based on the attribute model; and means for calculating an overall probability for ones of the set of language classes by evaluating the probability for the top level domain and character set encoding based on the attribute model and the probability for the byte occurrences based on the text model.
-
Specification