Detecting writing systems and languages

US 8,326,602 B2
Filed: 06/05/2009
Issued: 12/04/2012
Est. Priority Date: 06/05/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving text at a computer system having one or more processors;

detecting, at the computer system, a first segment of the text, where a substantial amount of the first segment represents a first language;

detecting, at the computer system, a second segment of the text, where a substantial amount of the second segment represents a second language;

obtaining, at the computer system, a first language likelihood for each n-gram of size x included in the text;

obtaining, at the computer system, a second language likelihood for each n-gram of size x included in the text;

identifying, at the computer system, a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and

detecting, at the computer system, an edge including;

calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram,calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, andidentifying the edge based on a difference between the first average and the second average.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer program products, for detecting writing systems and languages are disclosed. In one implementation, a method is provided. The method includes receiving text; detecting a first segment of the text, where a substantial amount of the first segment represents a first language; detecting a second segment of the text, where a substantial amount of the second segment represents a second language; identifying scores for each n-gram of size x included in the text; and detecting an edge that identifies a transition from the first language to the second language in the text based on variations of the scores.

Citations

3 Claims

1. A computer-implemented method comprising:
- receiving text at a computer system having one or more processors;
  
  detecting, at the computer system, a first segment of the text, where a substantial amount of the first segment represents a first language;
  
  detecting, at the computer system, a second segment of the text, where a substantial amount of the second segment represents a second language;
  
  obtaining, at the computer system, a first language likelihood for each n-gram of size x included in the text;
  
  obtaining, at the computer system, a second language likelihood for each n-gram of size x included in the text;
  
  identifying, at the computer system, a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and
  
  detecting, at the computer system, an edge including;
  
  calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram,calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, andidentifying the edge based on a difference between the first average and the second average.

2. A computer program product, encoded on a non-transitory computer-readable storage medium, operable to cause data processing apparatus to perform operations comprising:
- receiving text;
  
  detecting a first segment of the text, where a substantial amount of the first segment represents a first language;
  
  detecting a second segment of the text, where a substantial amount of the second segment represents a second language;
  
  obtaining a first language likelihood for each n-gram of size x included in the text;
  
  obtaining a second language likelihood for each n-gram of size x included in the text;
  
  identifying a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and
  
  detecting an edge including;
  
  calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram,calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, andidentifying the edge based on a difference between the first average and the second average.

3. A system, comprising:
- a machine-readable storage device including a program product; and
  
  one or more computers operable to execute the program product and perform operations comprising;
  
  receiving text;
  
  detecting a first segment of the text, where a substantial amount of the first segment represents a first language;
  
  detecting a second segment of the text, where a substantial amount of the second segment represents a second language;
  
  obtaining a first language likelihood for each n-gram of size x included in the text;
  
  obtaining a second language likelihood for each n-gram of size x included in the text;
  
  identifying a score for each n-gram of size x included in the text, where each score represents a difference between the first language likelihood and the second language likelihood; and
  
  detecting an edge including;
  
  calculating a first average of the scores for a first group of consecutive n-grams, where consecutive n-grams are defined as including a third n-gram including a first left context and a first right context and a fourth n-gram including a second left context and a second right context, where the second left context is the first right context, where the first group of consecutive n-grams is defined as including a specified number of consecutive n-grams that includes an ending n-gram,calculating a second average of the scores for a second group of consecutive n-grams, and the second group of consecutive n-grams is defined as including a same number of consecutive n-grams that includes a starting n-gram, where the ending n-gram is adjacent to the starting n-gram, andidentifying the edge based on a difference between the first average and the second average.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Sites, Richard L.
Primary Examiner(s)
Shah, Paras D

Application Number

US12/479,522
Publication Number

US 20100312545A1
Time in Patent Office

1,278 Days
Field of Search

704 1- 9, 715/255, 715/264, 715/265
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

Detecting writing systems and languages

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

3 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting writing systems and languages

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

3 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links