Detecting writing systems and languages

US 8,468,011 B1
Filed: 06/05/2009
Issued: 06/18/2013
Est. Priority Date: 06/05/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving text at a computer system having at least one processor;

detecting, at the computer system, a first language and a second language represented in the text by segmenting the text into n-grams of size x;

determining, at the computer system, whether the first language is substantially similar to the second language;

when the first language is substantially similar to the second language, processing, at the computer system, the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>

x; and

when the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer program products, for detecting writing systems and languages are disclosed. In one implementation, a method is provided. The method includes receiving text; identifying portions of the text as being non-repetitive, the identifying including: compressing underlying data of a first portion of the text, identifying a data compression ratio based on the amount of compression of the underlying data, and determining whether the first portion of the text is non-repetitive based on the data compression ratio; and identifying the first portion of the text as candidate text for use in language detection based on the portions of the text that are determined to be non-repetitive.

41 Citations

View as Search Results

5 Claims

1. A computer-implemented method comprising:
- receiving text at a computer system having at least one processor;
  
  detecting, at the computer system, a first language and a second language represented in the text by segmenting the text into n-grams of size x;
  
  determining, at the computer system, whether the first language is substantially similar to the second language;
  
  when the first language is substantially similar to the second language, processing, at the computer system, the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
  
  x; and
  
  when the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, where the first language and the second language belong to a same language family.
  - 3. The method of claim 1, where the first language and the second language share a common linguistic structure.

4. A computer program product, encoded on a tangible, non-transitory computer readable storage medium, operable to cause data processing apparatus to perform operations comprising:
- receiving text;
  
  detecting a first language and a second language represented in the text by segmenting the text into n-grams of size x;
  
  determining whether the first language is substantially similar to the second language;
  
  when the first language is substantially similar to the second language, processing the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
  
  x; and
  
  when the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.

5. A system, comprising:
- a machine-readable storage device including a program product; and
  
  one or more computers operable to execute the program product and perform operations comprising;
  
  receiving text;
  
  detecting a first language and a second language represented in the text by segmenting the text into n-grams of size x;
  
  determining whether the first language is substantially similar to the second language;
  
  when the first language is substantially similar to the second language, processing the text by segmenting the text into n-grams of size y to identify a particular language that is represented in the text, where y>
  
  x; and
  
  when the first language is not substantially similar to the second language, identifying the particular language that is represented in the text based on the segmenting the text into n-grams of size x.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Sites, Richard L.
Primary Examiner(s)
Shah, Paras D

Application Number

US12/479,564
Time in Patent Office

1,474 Days
Field of Search

704 2- 9, 715/256, 715/264, 715/265
US Class Current

704/8
CPC Class Codes

G06F 40/131 Fragmentation of text files...

G06F 40/263 Language identification

Detecting writing systems and languages

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting writing systems and languages

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links