Systems and Methods for Language Detection
First Claim
1. A computer-implemented method of identifying a language in a message, the method comprising:
- obtaining a text message;
removing non-language characters from the text message to generate a sanitized text message;
detecting at least one of an alphabet and a script present in the sanitized text message, wherein detecting comprises at least one of;
(i) performing an alphabet-based language detection test to determine a first set of scores, wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages; and
(ii) performing a script-based language detection test to determine a second set of scores, wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages; and
identifying the language in the sanitized text message based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores.
6 Assignments
0 Petitions
Accused Products
Abstract
Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for identifying a language in a message. Non-language characters are removed from a text message to generate a sanitized text message. An alphabet and/or a script are detected in the sanitized text message by performing at least one of (i) an alphabet-based language detection test to determine a first set of scores and (ii) a script-based language detection test to determine a second set of scores. Each score in the first set of scores represents a likelihood that the sanitized text message includes the alphabet for one of a plurality of different languages. Each score in the second set of scores represents a likelihood that the sanitized text message includes the script for one of the plurality of different languages. The language in the sanitized text message is identified based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores.
35 Citations
20 Claims
-
1. A computer-implemented method of identifying a language in a message, the method comprising:
-
obtaining a text message; removing non-language characters from the text message to generate a sanitized text message; detecting at least one of an alphabet and a script present in the sanitized text message, wherein detecting comprises at least one of; (i) performing an alphabet-based language detection test to determine a first set of scores, wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages; and (ii) performing a script-based language detection test to determine a second set of scores, wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages; and identifying the language in the sanitized text message based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented system for identifying a language in a message, comprising:
-
a sanitizer module, wherein the sanitizer module obtains a text message and removes non-language characters from the text message to generate a sanitized text message; a grouper module, wherein the grouper module detects at least one of an alphabet and a script present in the sanitized text message, and wherein the grouper module is operable to perform operations comprising at least one of; performing an alphabet-based language detection test to determine a first set of scores, wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages; and performing a script-based language detection test to determine a second set of scores, wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages; and a language detector module, wherein the language detector module identifies the language in the sanitized text message based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. An article, comprising:
a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computers, cause the computers to perform operations comprising; obtaining a text message; removing non-language characters from the text message to generate a sanitized text message; detecting at least one of an alphabet and a script present in the sanitized text message, wherein detecting comprises at least one of; (i) performing an alphabet-based language detection test to determine a first set of scores, wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages; and (ii) performing a script-based language detection test to determine a second set of scores, wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages; and identifying the language in the sanitized text message based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores.
Specification