AUTOMATED IDENTIFICATION OF DOCUMENTS AS NOT BELONGING TO ANY LANGUAGE
First Claim
1. A method for identifying documents as not belonging to any language in a plurality of candidate languages wherein each candidate language has an associated language model, the method comprising:
- for each language in a plurality of candidate languages, computing a document score for a test document using the language model of that language;
selecting a most likely language for the test document from the plurality of candidate languages based on the respective document scores for each language in the plurality of candidate languages;
accessing an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of one or more impostor languages in an impostor set associated with the most likely language;
comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language;
determining whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores; and
storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination.
10 Assignments
0 Petitions
Accused Products
Abstract
An “impostor profile” for a language is used to determine whether documents are in that language or no language. The impostor profile for a given language provides statistical information about the expected results of applying a language model for one or more other (“impostor”) languages to a document that is in fact in the given language. After a most likely language for a test document is identified, the impostor profile is used together with the scores for the test document in the various impostor languages to determine whether to identify the test document as being in the most likely language or in no language.
-
Citations
32 Claims
-
1. A method for identifying documents as not belonging to any language in a plurality of candidate languages wherein each candidate language has an associated language model, the method comprising:
-
for each language in a plurality of candidate languages, computing a document score for a test document using the language model of that language; selecting a most likely language for the test document from the plurality of candidate languages based on the respective document scores for each language in the plurality of candidate languages; accessing an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of one or more impostor languages in an impostor set associated with the most likely language; comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language; determining whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores; and storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer program product comprising a computer readable storage medium encoded with program code usable to control operation of a computer system, the program code including:
-
program code for computing, for each language in a plurality of candidate languages, a document score for a test document using a language model associated with that language; program code for selecting a most likely language from the plurality of candidate languages based on the document scores for each language; program code for determining whether the test document is in the most likely language or in no language, wherein the determination is based at least in part on comparing the document scores for one or more impostor languages in an impostor set associated with the most likely language to an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of the one or more impostor languages in the impostor set associated with the most likely language; and program code for storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer system comprising:
-
a language model data store configured to store a plurality of language models corresponding to a plurality of languages, each language model including information usable to determine a score reflecting a probability that a document is in the language corresponding to that language model, the language model data store being further configured to store an impostor profile associated with each of the plurality of languages, wherein the impostor profile for each of the plurality of languages includes a parameter set consisting of values characterizing a score distribution expected for documents in that language when scored using the respective language models of one or more impostor languages in an impostor set associated with that language; and control logic coupled to the language model data store configured to compute, for at least some of the plurality of languages, a document score for a test document, the document score being computed based on at least some of the language models stored in the language model data store, and to select a most likely language for the test document based on the computed document scores, wherein document scores are also computed for the impostor languages in the impostor set associated with the most likely language, the control logic being further configured to compare the document scores computed for the impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language and to determine whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32)
-
Specification