AUTOMATED IDENTIFICATION OF DOCUMENTS AS NOT BELONGING TO ANY LANGUAGE

US 20100125448A1
Filed: 11/20/2008
Published: 05/20/2010
Est. Priority Date: 11/20/2008
Status: Active Grant

First Claim

Patent Images

1. A method for identifying documents as not belonging to any language in a plurality of candidate languages wherein each candidate language has an associated language model, the method comprising:

for each language in a plurality of candidate languages, computing a document score for a test document using the language model of that language;

selecting a most likely language for the test document from the plurality of candidate languages based on the respective document scores for each language in the plurality of candidate languages;

accessing an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of one or more impostor languages in an impostor set associated with the most likely language;

comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language;

determining whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores; and

storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An “impostor profile” for a language is used to determine whether documents are in that language or no language. The impostor profile for a given language provides statistical information about the expected results of applying a language model for one or more other (“impostor”) languages to a document that is in fact in the given language. After a most likely language for a test document is identified, the impostor profile is used together with the scores for the test document in the various impostor languages to determine whether to identify the test document as being in the most likely language or in no language.

Citations

32 Claims

1. A method for identifying documents as not belonging to any language in a plurality of candidate languages wherein each candidate language has an associated language model, the method comprising:
- for each language in a plurality of candidate languages, computing a document score for a test document using the language model of that language;
  
  selecting a most likely language for the test document from the plurality of candidate languages based on the respective document scores for each language in the plurality of candidate languages;
  
  accessing an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of one or more impostor languages in an impostor set associated with the most likely language;
  
  comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language;
  
  determining whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores; and
  
  storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1 wherein the language model for each language in the plurality of candidate languages is a bigram-based language model.
  - 3. The method of claim 1 further comprising:
    - for each language in the plurality of candidate languages, defining an impostor profile; and
      
      storing the impostor profile for each language in a computer database,wherein accessing the impostor profile for the most likely language includes reading the impostor profile from the computer database.
  - 4. The method of claim 3 wherein defining the impostor profile for one of the languages (L) in the plurality of candidate languages includes:
    - calculating, using the language model for an alternative language M that is not the language L, a respective alternative training score for each of a plurality of training documents in language L;
      
      calculating, using the language model for the language L, a respective true training score for each of the plurality of training documents;
      
      determining a degree of closeness between the alternative language M and the language L based on the alternative training scores for the alternative language M and the true training scores; and
      
      determining whether to include the alternative language M in the impostor set for the language L based at least in part on the degree of closeness between the alternative language M and the language L.
  - 5. The method of claim 4 wherein determining the degree of closeness between the alternative language M and the language L includes comparing a mean of the alternative training scores for the alternative language M and a mean of the true training scores.
  - 6. The method of claim 5 wherein the acts of calculating a respective alternative training score for each of the plurality of training documents and determining a degree of closeness between the alternative language M and the language L are performed for each of a plurality of alternative languages.
  - 7. The method of claim 6 wherein determining whether to include one of the alternative languages in the impostor set for the language L is based on the respective degrees of closeness of each of the plurality of alternative languages to the language L.
  - 8. The method of claim 7 wherein the number of impostor languages included in the impostor set for the language L is limited to a predetermined maximum number.
  - 9. The method of claim 7 wherein the impostor set for the language L includes all of the alternative languages for which the degree of closeness meets a threshold condition.
  - 10. The method of claim 1 wherein the parameter set for the impostor profile for the most likely language includes a respective mean and standard deviation characterizing the score distribution for each of the impostor languages in the impostor set for the most likely language.
  - 11. The method of claim 10 wherein comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language includes:
    - applying a chi-square test to the document score using the means and standard deviations of all of the impostor languages in the impostor set.
  - 12. The method of claim 10 wherein comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language includes:
    - applying a similarity test to the document score and each impostor language in the impostor set for the most likely language, wherein the similarity test is applied separately for each impostor language.
  - 13. The method of claim 10 wherein comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language includes:
    - applying a likelihood ratio test to the document score and each impostor language in the impostor set for the most likely language, wherein the likelihood ratio test is applied separately for each impostor language.

14. A computer program product comprising a computer readable storage medium encoded with program code usable to control operation of a computer system, the program code including:
- program code for computing, for each language in a plurality of candidate languages, a document score for a test document using a language model associated with that language;
  
  program code for selecting a most likely language from the plurality of candidate languages based on the document scores for each language;
  
  program code for determining whether the test document is in the most likely language or in no language, wherein the determination is based at least in part on comparing the document scores for one or more impostor languages in an impostor set associated with the most likely language to an impostor profile for the most likely language,wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of the one or more impostor languages in the impostor set associated with the most likely language; and
  
  program code for storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 15. The computer program product of claim 14 wherein the language model for each language is a bigram-based language model.
  - 16. The computer program product of claim 14 further comprising:
    - program code for defining an impostor profile for a language L in the plurality of candidate languages and storing the impostor profile for the language L in a data store.
  - 17. The computer program product of claim 16 wherein the program code for defining the impostor profile for the language L includes program code for analyzing a set of training documents known to be in the language L to determine an alternative score for each document in the set of training documents under the language model for a language other than the language L.
  - 18. The computer program product of claim 17 wherein the program code for defining the impostor profile for the language L further includes program code for determining a degree of closeness between the language L and the language other than the language L based at least in part on the alternative scores for the documents in the set of training documents.
  - 19. The computer program product of claim 18 the program code for defining the impostor profile for the language L provides that the number of languages included in the impostor set for the language L is limited to a predetermined maximum number.
  - 20. The computer program product of claim 18 wherein program code for defining the impostor profile for the language L provides that the impostor set for the language L includes all languages in the plurality of candidate languages, other than the language L, for which the degree of closeness meets a threshold condition.
  - 21. The computer program product of claim 14 wherein the parameter set for the impostor profile for the most likely language L₀includes a respective mean and standard deviation characterizing the score distribution for each of the languages in the impostor set for the most likely language L₀.
  - 22. The computer program product of claim 21 wherein the program code for determining whether the test document is in the most likely language or in no language includes program code for applying a chi-square test to the document score using the means and standard deviations of the impostor language in the impostor set, wherein the determination whether the test document is in the most likely language or in no language is based at least in part on a result of the chi-square test.
  - 23. The computer program product of claim 21 wherein the program code for determining whether the test document is in the most likely language or in no language includes program code for applying a similarity test to the document score and each impostor language in the impostor set for the most likely language, wherein the similarity test is applied separately for each impostor language and wherein the determination whether the test document is in the most likely language or in no language is based at least in part on a result of the similarity test.
  - 24. The computer program product of claim 21 wherein the program code for determining whether the test document is in the most likely language or in no language includes program code for applying a likelihood ratio test to the document score and each impostor language in the impostor set for the most likely language, wherein the likelihood ratio test is applied separately for each language M, and wherein the determination whether the test document is in the most likely language or in no language is based at least in part on a result of the likelihood ratio test.

25. A computer system comprising:
- a language model data store configured to store a plurality of language models corresponding to a plurality of languages, each language model including information usable to determine a score reflecting a probability that a document is in the language corresponding to that language model,the language model data store being further configured to store an impostor profile associated with each of the plurality of languages, wherein the impostor profile for each of the plurality of languages includes a parameter set consisting of values characterizing a score distribution expected for documents in that language when scored using the respective language models of one or more impostor languages in an impostor set associated with that language; and
  
  control logic coupled to the language model data store configured to compute, for at least some of the plurality of languages, a document score for a test document, the document score being computed based on at least some of the language models stored in the language model data store, and to select a most likely language for the test document based on the computed document scores, wherein document scores are also computed for the impostor languages in the impostor set associated with the most likely language,the control logic being further configured to compare the document scores computed for the impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language and to determine whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32)
- - 26. The computer system of claim 25 further comprising:
    - a document information data store configured to store information about a plurality of documents including the test document,wherein the control logic is further configured to store, in the document information data store, language information for the test document, the language information including a result of the determination.
  - 27. The computer system of claim 25 wherein the language models are n-gram-based language models.
  - 28. The computer system of claim 25 wherein the control logic is further configured to define the impostor profile for each of the plurality of languages.
  - 29. The computer system of claim 28 wherein the control logic is further configured such that defining the impostor profile for a first one of the plurality of languages includes analyzing a set of documents known to be in the first one of the plurality of languages to determine a score for each document in the set of documents under the language model for a language other than the first one of the plurality of languages.
  - 30. The computer system of claim 25 wherein the control logic is further configured such that determining whether the test document is in the most likely language or in no language includes applying a chi-square test to the computed document score for the most likely language and the respective computed document scores for the languages in the impostor set associated with the most likely language L₀.
  - 31. The computer system of claim 25 wherein the control logic is further configured such that determining whether the test document is in the most likely language or in no language includes applying a similarity test to the computed document score for the most likely language and the respective computed document scores for the languages in the impostor set associated with the most likely language.
  - 32. The computer system of claim 25 wherein the control logic is further configured such that determining whether the test document is in the most likely language or in no language includes applying a likelihood ratio test to the computed document score for the most likely language and the respective computed document scores for the languages in the impostor set associated with the most likely language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Stratify, Inc. (Open Text Corporation)
Inventors
Goswami, Sauraj

Granted Patent

US 8,224,642 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

AUTOMATED IDENTIFICATION OF DOCUMENTS AS NOT BELONGING TO ANY LANGUAGE

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

AUTOMATED IDENTIFICATION OF DOCUMENTS AS NOT BELONGING TO ANY LANGUAGE

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links