Trigram-based method of language identification
First Claim
1. A method of determining in what language a body of text is written comprising the steps of:
- (a) parsing said body of text into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text;
(b) comparing each of the trigrams that has been parsed from said body of text in step (a) with a plurality of trigram key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and
(c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text being at least equal to a prescribed value and greater than such ratios for alternative languages, identifying the body of text as being written in the language associated with said respective key set.
1 Assignment
0 Petitions
Accused Products
Abstract
A mechanism for examining a body of text and identifying its language compares successive trigrams into which the body of text is parsed with a library of sets of trigrams. For a respective language-specific key set of trigrams, if the ratio of the number of trigrams in the text, for which a match in the key set has been found, to the total number of trigrams in the text is at least equal to a prescribed value, then the text is identified as being possibly written in the language associated with that respective key set. Each respective trigram key set is associated with a respectively different language and contains those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language. Successive key sets for other languages are processed as above, and the language for which the percentage of matches is greatest, and for which the percentage exceeded the prescribed value as above, is selected as the language in which the body of text is written.
248 Citations
6 Claims
-
1. A method of determining in what language a body of text is written comprising the steps of:
-
(a) parsing said body of text into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text; (b) comparing each of the trigrams that has been parsed from said body of text in step (a) with a plurality of trigram key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and (c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text being at least equal to a prescribed value and greater than such ratios for alternative languages, identifying the body of text as being written in the language associated with said respective key set. - View Dependent Claims (2, 3)
-
-
4. A method of determining in what language a body of text is written, said body of text containing N sequential character/space position, comprising the steps of:
-
(a) parsing said body of text into each of (N-2) trigrams that are sequentially definable by said N sequential character/space positions so that at least some of the trigrams overlap adjacent words; (b) comparing each of the trigrams parsed in step (a) with a plurality of trigrams key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and (c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text, being at least equal to a prescribed value and exceeding such ratios for alterative languages, identifying the body of text as being written in the language associated with said respective set. - View Dependent Claims (5)
-
-
6. A method of generating a key set of trigrams to be used in determining in what language text has been written, based upon a comparison of trigrams, into which said text is to be parsed, with said set of trigrams comprising the steps of:
-
(a) parsing a body of text of a prescribed language into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text; (b) counting the number of occurrences of each of the trigrams that has been parsed from said body of text in step (a); (c) determining the ratio of each of the number of occurrence of the trigrams counted in step (b) with the total number of trigrams into which said body of text has been parsed in step (a), and deriving therefrom a characteristic representative of the frequency of trigram occurrence of each trigram that may be formed using the characters of said prescribed language and a space position; (d) from the characteristic derived in step (c), identifying the frequency of occurrence of trigrams for said prescribed language that is associated with a selected frequency of occurrence; and (e) generating, as said key set of trigrams, those trigrams whose frequency of occurrence is at least equal to the frequency of occurrence identified in step (d).
-
Specification