Method and system for language identification
First Claim
Patent Images
1. A system for language identification, comprising:
- at least one processor;
at least one computer readable storage medium;
a feature set of a plurality of character strings of varying length with associated information;
the associated information including one or more significance scores for one of the character strings for one or more of a plurality of languages, wherein the significance scores include a basic significance score and an additional significance score, wherein the additional significance score is for application in response to detection of a characteristic in a syllable other than the character string within a word containing the character string, and wherein the characteristic comprises the syllable containing a letter matching a letter contained in a predetermined set of one or more letters; and
program code executable on the at least one processor and stored on the at least one computer readable storage medium, for detecting the character string from the feature set within a token from an input text and for detecting the characteristic in a syllable other than the character string within a word containing the character string within the input text responsive to detecting the character string within the input text.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for language identification are provided. The system includes a feature set of a plurality of character strings of varying length with associated information. The associated information includes one or more significance scores for a character string for one or more of a plurality of languages. Means are provided for detecting character strings from the feature set within a token from an input text. The system uses a finite-state device and the associated information is provided as glosses at the final nodes of the finite-state device for each character string. The associated information can also include significance scores based on linguistic rules.
-
Citations
35 Claims
-
1. A system for language identification, comprising:
-
at least one processor; at least one computer readable storage medium; a feature set of a plurality of character strings of varying length with associated information; the associated information including one or more significance scores for one of the character strings for one or more of a plurality of languages, wherein the significance scores include a basic significance score and an additional significance score, wherein the additional significance score is for application in response to detection of a characteristic in a syllable other than the character string within a word containing the character string, and wherein the characteristic comprises the syllable containing a letter matching a letter contained in a predetermined set of one or more letters; and program code executable on the at least one processor and stored on the at least one computer readable storage medium, for detecting the character string from the feature set within a token from an input text and for detecting the characteristic in a syllable other than the character string within a word containing the character string within the input text responsive to detecting the character string within the input text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for language identification embodied in at least one computer system, comprising:
-
inputting, by the computer system, a text; dividing, by the computer system, the input text into tokens; detecting, by the computer system, character strings within the tokens from a feature set of a plurality of character strings of varying length with associated information, the associated information including one or more significance scores for a character string for one or more of a plurality of languages, wherein the significance scores include a basic significance score and an additional significance score for at least one of the character strings, wherein the additional significance score is for application in response to detection of a characteristic in a syllable other than the character string within a word containing the character string, and wherein the characteristic comprises the syllable containing a letter matching a letter contained in a predetermined set of one or more letters; and detecting, by the computer system the at least one characteristic in a syllable other than the character string within a word containing the character string within the input text responsive to detecting the character string within the input text. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. A computer program product stored on a computer readable storage medium, the computer readable storage medium having program code stored thereon for language identification, the program code comprising:
-
program code for inputting a text; program code for dividing the input text into tokens; program code for detecting character strings within the tokens from a feature set of a plurality of character strings of varying length with associated information, the associated information including one or more significance scores for a character string for one or more of a plurality of languages, wherein the significance scores include a basic significance score and an additional significance score for at least one of the character strings, wherein the additional significance score is for application in response to detection of a characteristic in a syllable other than the character string within a word containing the character string, and wherein the characteristic comprises the syllable containing a letter matching a letter contained in a predetermined set of one or more letters; and program code for detecting the at least one characteristic in a syllable other than the character string within a word containing the character string within the input text responsive to detecting the character string within the input text.
-
Specification