×

Methods and systems for determining a language of a document

  • US 7,191,116 B2
  • Filed: 06/19/2001
  • Issued: 03/13/2007
  • Est. Priority Date: 06/19/2001
  • Status: Active Grant
First Claim
Patent Images

1. A system for automatically determining a language of a document from a set of candidate languages, the system comprising:

  • a database containing probability data for a plurality of text strings each having a predetermined length equal to each other, each text string of the plurality of text strings having an associated probability value indicating a probability that the text string occurs within a language based on occurrences of the text string in all of the candidate languages;

    logic for setting a negative assumption value for each of the candidate languages indicating the document is not one of the candidate languages;

    an extractor for extracting a character string from the document, the character string having a length equal to the predetermined length of the plurality of text strings contained in the database; and

    a language analyzer for determining a probability value for each of the candidate languages that the character string does not belong to the candidate languages by retrieving the probability value associated to the character string from the database for each or the candidate languages, and includes logic for adjusting the negative assumption value based on the probability value, the language analyzer determining that the document is one language of the candidate languages when the negative assumption value passes a threshold value.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×