×

Language identification for documents containing multiple languages

  • US 8,224,641 B2
  • Filed: 11/19/2008
  • Issued: 07/17/2012
  • Est. Priority Date: 11/19/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:

  • for each language (M) in the set of candidate languages, defining a set of non overlapping languages (N(M)), the set N(M) consisting of one or more languages (L), wherein each language L in the set N(M) does not overlap with the language M;

    obtaining n-gram data for a target document;

    for each language M in the set of candidate languages, using the n-gram data to determine a final score SF(M) based on relative probabilities of a first hypothesis that the target document is entirely in the language M and a second hypothesis that one portion of the target document is in the language M while another portion of the target document is in a language L selected from the set N(M);

    identifying, by a processor, one or more of the candidate languages as being languages of the document based on the final scores SF(M) for different languages M; and

    storing, in a computer readable storage device, information indicating the one or more languages of the document.

View all claims
  • 10 Assignments
Timeline View
Assignment View
    ×
    ×