Detecting writing systems and languages
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving text at a computer system having one or more processors;
detecting, at the computer system, a first segment of the text, where a substantial amount of the first segment represents a first language;
detecting, at the computer system, a second segment of the text, where a substantial amount of the second segment represents a second language;
obtaining, at the computer system, a first language likelihood for each n-gram of size x included in the text;
obtaining, at the computer system, a second language likelihood for each n-gram of size x included in the text;
identifying, at the computer system, a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and
detecting, at the computer system, an edge including;
calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram,calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, andidentifying the edge based on a difference between the first average and the second average.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, for detecting writing systems and languages are disclosed. In one implementation, a method is provided. The method includes receiving text; detecting a first segment of the text, where a substantial amount of the first segment represents a first language; detecting a second segment of the text, where a substantial amount of the second segment represents a second language; identifying scores for each n-gram of size x included in the text; and detecting an edge that identifies a transition from the first language to the second language in the text based on variations of the scores.
-
Citations
3 Claims
-
1. A computer-implemented method comprising:
-
receiving text at a computer system having one or more processors; detecting, at the computer system, a first segment of the text, where a substantial amount of the first segment represents a first language; detecting, at the computer system, a second segment of the text, where a substantial amount of the second segment represents a second language; obtaining, at the computer system, a first language likelihood for each n-gram of size x included in the text; obtaining, at the computer system, a second language likelihood for each n-gram of size x included in the text; identifying, at the computer system, a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and detecting, at the computer system, an edge including; calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram, calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, and identifying the edge based on a difference between the first average and the second average.
-
-
2. A computer program product, encoded on a non-transitory computer-readable storage medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving text; detecting a first segment of the text, where a substantial amount of the first segment represents a first language; detecting a second segment of the text, where a substantial amount of the second segment represents a second language; obtaining a first language likelihood for each n-gram of size x included in the text; obtaining a second language likelihood for each n-gram of size x included in the text; identifying a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and detecting an edge including; calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram, calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, and identifying the edge based on a difference between the first average and the second average.
-
-
3. A system, comprising:
-
a machine-readable storage device including a program product; and one or more computers operable to execute the program product and perform operations comprising; receiving text; detecting a first segment of the text, where a substantial amount of the first segment represents a first language; detecting a second segment of the text, where a substantial amount of the second segment represents a second language; obtaining a first language likelihood for each n-gram of size x included in the text; obtaining a second language likelihood for each n-gram of size x included in the text; identifying a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and detecting an edge including; calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram, calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, and identifying the edge based on a difference between the first average and the second average.
-
Specification