×

Language Identification in Multilingual Text

  • US 20120095748A1
  • Filed: 10/14/2010
  • Published: 04/19/2012
  • Est. Priority Date: 10/14/2010
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented system for identifying multilingual text in a document using computer processor, memory, and data storage subsystems, the computer-implemented system comprising:

  • a code-page conversion component to identify the character encoding used by a document and to decode said encoding into a universal representative encoding via the processor;

    a section breaking and classification component to divide plain-text content of the document into one or more weighted sections;

    a language scoring component to obtain language likelihood scores of each word, phrase, or character n-gram in the one or more weighted sections, and to combine the obtained language likelihood scores according to language; and

    an output language selection component to select a primary language for the document based upon a highest combined language likelihood score.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×