×

Language identification for documents containing multiple languages

  • US 8,938,384 B2
  • Filed: 07/16/2012
  • Issued: 01/20/2015
  • Est. Priority Date: 11/19/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:

  • dividing the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;

    segmenting the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (At);

    for each segment t, generating a segment score (St(L)) for each language (L) in the active one of the disjoint subsets At;

    identifying, by a processor, one or more languages as being languages of the document based on the segment scores St(L) for all of the segments t and languages L; and

    storing, in a computer readable storage device, information indicating the one or more languages of the document.

View all claims
  • 10 Assignments
Timeline View
Assignment View
    ×
    ×