Detecting writing systems and languages
First Claim
1. A computer-implemented method comprising:
- receiving text at a computer system having at least one processor;
detecting, at the computer system, a first language and a second language represented in the text by segmenting the text into n-grams of size x;
determining, at the computer system, whether the first language is substantially similar to the second language;
when the first language is substantially similar to the second language, processing, at the computer system, the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
x; and
when the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, for detecting writing systems and languages are disclosed. In one implementation, a method is provided. The method includes receiving text; identifying portions of the text as being non-repetitive, the identifying including: compressing underlying data of a first portion of the text, identifying a data compression ratio based on the amount of compression of the underlying data, and determining whether the first portion of the text is non-repetitive based on the data compression ratio; and identifying the first portion of the text as candidate text for use in language detection based on the portions of the text that are determined to be non-repetitive.
41 Citations
5 Claims
-
1. A computer-implemented method comprising:
-
receiving text at a computer system having at least one processor; detecting, at the computer system, a first language and a second language represented in the text by segmenting the text into n-grams of size x; determining, at the computer system, whether the first language is substantially similar to the second language; when the first language is substantially similar to the second language, processing, at the computer system, the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
x; andwhen the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x. - View Dependent Claims (2, 3)
-
-
4. A computer program product, encoded on a tangible, non-transitory computer readable storage medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving text; detecting a first language and a second language represented in the text by segmenting the text into n-grams of size x; determining whether the first language is substantially similar to the second language; when the first language is substantially similar to the second language, processing the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
x; andwhen the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.
-
-
5. A system, comprising:
-
a machine-readable storage device including a program product; and one or more computers operable to execute the program product and perform operations comprising; receiving text; detecting a first language and a second language represented in the text by segmenting the text into n-grams of size x; determining whether the first language is substantially similar to the second language; when the first language is substantially similar to the second language, processing the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
x; andwhen the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.
-
Specification