×

Language identification

  • US 9,910,847 B2
  • Filed: 09/30/2014
  • Issued: 03/06/2018
  • Est. Priority Date: 09/30/2014
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for language identification, comprising:

  • receiving, by at least one hardware processor, a plurality of documents in each of a plurality of languages;

    creating, by the at least one hardware processor, a Latent Semantic Indexing (LSI) index from the plurality of documents;

    training, by the at least one hardware processor, a language classification model based on the LSI index, where training the language classification model based on the LSI index comprises;

    determining a vector for each of the plurality of documents in each of the plurality of languages from the LSI index, where the vectors are generated from singular value decomposition of the plurality of documents in each of a plurality of languages;

    determining a combination of dimensions in the vector for each document that is indicative of the language of the document;

    determining rules from the combinations of dimensions determined from the vectors for the plurality of documents; and

    generating the language classification model from the rules;

    receiving, by the at least one hardware processor, a document to be identified by language;

    generating, by the at least one hardware processor, a vector in the LSI index for the document to be identified by language;

    evaluating, by the at least one hardware processor, the vector of the document to be identified by language against the language classification model; and

    identifying a language of the received document as being one of the plurality of languages based on the evaluating.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×