Method and system for language identification
First Claim
1. A system for language identification, comprising:
- a feature set of a plurality of character strings of varying length with associated information;
the associated information including one or more significance scores for a character string for one or more of a plurality of languages;
means for detecting character strings from the feature set within a token from an input text.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for language identification are provided. The system includes a feature set of a plurality of character strings of varying length with associated information. The associated information includes one or more significance scores for a character string for one or more of a plurality of languages. Means are provided for detecting character strings from the feature set within a token from an input text. The system uses a finite-state device and the associated information is provided as glosses at the final nodes of the finite-state device for each character string. The associated information can also include significance scores based on linguistic rules.
-
Citations
30 Claims
-
1. A system for language identification, comprising:
-
a feature set of a plurality of character strings of varying length with associated information;
the associated information including one or more significance scores for a character string for one or more of a plurality of languages;
means for detecting character strings from the feature set within a token from an input text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for language identification, comprising:
-
inputting a text;
dividing the text into tokens;
detecting character strings within a token from a feature set of a plurality of character strings of varying length with associated information, the associated information including one or more significance scores for a character string for one or more of a plurality of languages. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer program product having a computer readable storage medium, the computer readable storage medium having program code stored thereon for language identification, the program code comprising:
-
program code for inputting a text;
program code for dividing the text into tokens; and
program code for detecting character strings within a token from a feature set of a plurality of character strings of varying length with associated information, the associated information including one or more significance scores for a character string for one or more of a plurality of languages.
-
-
24. A system for compiling a feature set, comprising:
-
means for compiling a plurality of character strings of varying length;
means for associating information with a character string, including means for allocating one or more significance scores for one or more of a plurality of languages.
-
-
25. A method for compiling a feature set, comprising:
-
compiling a plurality of character strings of varying length; and
associating information with a character string, including allocating one or more significance scores for one or more of a plurality of languages. - View Dependent Claims (26, 27)
-
-
28. A computer program product having a computer readable storage medium, the computer readable storage medium having stored thereon program code for compiling a feature set, the program code comprising:
-
program code for compiling a plurality of character strings of varying length; and
program code for associating information with a character string, including allocating one or more significance scores for one or more of a plurality of languages.
-
-
29. A computer data signal embodied in a carrier wave, the computer data signal including program code for language identification, the program code comprising:
-
program code for inputting a text;
program code for dividing the text into tokens; and
program code for detecting character strings within a token from a feature set of a plurality of character strings of varying length with associated information, the associated information including one or more significance scores for a character string for one or more of a plurality of languages.
-
-
30. A computer data signal embodied in a carrier wave, the computer data signal including code for compiling a feature set, the program code comprising:
-
program code for compiling a plurality of character strings of varying length; and
program code for associating information with a character string, including allocating one or more significance scores for one or more of a plurality of languages.
-
Specification