Language identification
First Claim
Patent Images
1. A computer-implemented method for language identification, comprising:
- receiving, by at least one hardware processor, a plurality of documents in each of a plurality of languages;
creating, by the at least one hardware processor, a Latent Semantic Indexing (LSI) index from the plurality of documents;
training, by the at least one hardware processor, a language classification model based on the LSI index, where training the language classification model based on the LSI index comprises;
determining a vector for each of the plurality of documents in each of the plurality of languages from the LSI index, where the vectors are generated from singular value decomposition of the plurality of documents in each of a plurality of languages;
determining a combination of dimensions in the vector for each document that is indicative of the language of the document;
determining rules from the combinations of dimensions determined from the vectors for the plurality of documents; and
generating the language classification model from the rules;
receiving, by the at least one hardware processor, a document to be identified by language;
generating, by the at least one hardware processor, a vector in the LSI index for the document to be identified by language;
evaluating, by the at least one hardware processor, the vector of the document to be identified by language against the language classification model; and
identifying a language of the received document as being one of the plurality of languages based on the evaluating.
3 Assignments
0 Petitions
Accused Products
Abstract
A plurality of documents in each of a plurality of languages can be received. A Latent Semantic Indexing (LSI) index can be created from the plurality of documents. A language classification model can be trained from the LSI index. A document to be identified by language can be received. A vector in the LSI index can be generated for the document to be identified by language. The vector can be evaluated against the language classification model.
-
Citations
18 Claims
-
1. A computer-implemented method for language identification, comprising:
-
receiving, by at least one hardware processor, a plurality of documents in each of a plurality of languages; creating, by the at least one hardware processor, a Latent Semantic Indexing (LSI) index from the plurality of documents; training, by the at least one hardware processor, a language classification model based on the LSI index, where training the language classification model based on the LSI index comprises; determining a vector for each of the plurality of documents in each of the plurality of languages from the LSI index, where the vectors are generated from singular value decomposition of the plurality of documents in each of a plurality of languages; determining a combination of dimensions in the vector for each document that is indicative of the language of the document; determining rules from the combinations of dimensions determined from the vectors for the plurality of documents; and generating the language classification model from the rules; receiving, by the at least one hardware processor, a document to be identified by language; generating, by the at least one hardware processor, a vector in the LSI index for the document to be identified by language; evaluating, by the at least one hardware processor, the vector of the document to be identified by language against the language classification model; and identifying a language of the received document as being one of the plurality of languages based on the evaluating. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product, comprising:
a non-transitory computer-readable storage device having computer-executable program instructions embodied thereon that when executed by a computer cause the computer to identify a language of a communication, the computer-executable program instructions comprising; computer-executable program instructions to receive a plurality of documents in each of a plurality of languages; computer-executable program instructions to create a Latent Semantic Indexing (LSI) index from the plurality of documents; computer-executable program instructions to train a language classification model based on the LSI index, wherein to train the language classification model based on the LSI index comprises; determine a vector for each of the plurality of documents in each of the plurality of languages from the LSI index, where the vectors are generated from singular value decomposition of the plurality of documents in each of a plurality of languages; for each document, determine a combination of dimensions in the vector for the document that is indicative of the language of the document; determine rules from the combination of dimensions for each document for identifying a language of a document; and generate the language classification model from the rules; computer-executable program instructions to receive a document to be identified by language; computer-executable program instructions to generate a vector in the LSI index for the document to be identified by language; computer-executable program instructions to evaluate the vector of the document to be identified by language against the language classification model; and computer-executable program instructions to identify a language of the received document as being in one of the plurality of languages based on an evaluation of the vector against the language classification model. - View Dependent Claims (8, 9, 10, 11, 12)
-
13. A system to identify the language of a communication, the system comprising:
-
a storage device; and a processor communicatively coupled to the storage device, wherein the processor executes application code instructions received from the storage device to; receive a plurality of documents in each of a plurality of languages; create a Latent Semantic Indexing (LSI) index from the plurality of documents; train a language classification model based on the LSI index, wherein to train the language classification model based on the LSI index comprises; determine a vector for each of the plurality of documents in each of the plurality of languages from the LSI index, where the vectors are generated from singular value decomposition of the plurality of documents in each of a plurality of languages; for each document, determine a combination of dimensions in the vector for the document that is indicative of the language of the document; determine rules from the combination of dimensions for each document for identifying a language of a document; and generate the language classification model from the rules; receive a document to be identified by language; generate a vector in the LSI index for the document to be identified by language; evaluate the vector of the document to be identified by language against the language classification model; and identify the language of the received document as being in one of the plurality of languages based on an evaluation of the vector against the language classification model. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification