Efficient language identification
First Claim
1. A method of identifying the natural language of text comprising the steps of:
- receiving text documents written in a known natural language;
counting occurrences of unique features in the text documents to generate expected feature counts; and
using a probability distribution and the expected feature counts to generate probability values as a function of actual feature occurrence.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.
73 Citations
20 Claims
-
1. A method of identifying the natural language of text comprising the steps of:
-
receiving text documents written in a known natural language;
counting occurrences of unique features in the text documents to generate expected feature counts; and
using a probability distribution and the expected feature counts to generate probability values as a function of actual feature occurrence. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of identifying the natural language of text comprising the steps of:
-
receiving a text sample written in an unidentified natural language;
determining a current count for at least one feature in at least one window of characters in the text sample;
obtaining expected probability information for the at least one feature for a plurality of candidate languages;
identifying at least one language for the text sample from among the plurality of candidate languages based on the current count and the obtained expected probability information. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer readable medium including instructions which, when implemented, cause a computer to perform language identification, the instructions comprising:
-
a module adapted to construct and store for each of a plurality of natural languages a feature list and expected probability values associated with each of the listed features; and
a module adapted to count actual features in a text sample and access the stored expected probability values associated with the actual features to identify at least one natural language for the text sample. - View Dependent Claims (19, 20)
-
Specification