×

Automated identification of documents as not belonging to any language

  • US 8,224,642 B2
  • Filed: 11/20/2008
  • Issued: 07/17/2012
  • Est. Priority Date: 11/20/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for identifying documents as not belonging to any language in a plurality of candidate languages wherein each candidate language has an associated language model, the method comprising:

  • for each language in a plurality of candidate languages, computing, by a processor, a document score for a test document using the language model of that language;

    selecting a most likely language for the test document from the plurality of candidate languages based on the respective document scores for each language in the plurality of candidate languages;

    accessing an impostor profile for the most likely language, wherein the impostor profile for the most likely language includes a parameter set consisting of values characterizing a score distribution expected for documents in the most likely language when scored using the respective language models of one or more impostor languages in an impostor set associated with the most likely language;

    comparing the document scores for the one or more impostor languages in the impostor set associated with the most likely language to the impostor profile for the most likely language;

    determining whether the test document is in the most likely language or in no language based at least in part on a result of comparing the document scores; and

    storing, in a computer readable storage medium, language information for the test document, the language information including a result of the determination.

View all claims
  • 10 Assignments
Timeline View
Assignment View
    ×
    ×