Identifying language origin of words

US 8,185,376 B2
Filed: 03/20/2006
Issued: 05/22/2012
Est. Priority Date: 03/20/2006
Status: Active Grant

First Claim

Patent Images

1. A method for determining a language of origin of a word comprising analyzing non-uniform letter sequence portions of the word, wherein analyzing comprises:

using one or more processors of a computing system, segmenting the word into strings of letter chunks based on different criteria, the letter chunks being of non-uniform length of one or more letters;

using one or more processors of a computing system, ascertaining a probability of the word belonging to a selected language by using a plurality of N-gram models based directly on the letter chunks segmented with the different criteria for each of a plurality of different languages, and providing results from using the plurality of N-gram models based directly on letter chunks extracted with the different criteria to a combined classifier that merges the results from the plurality of N-gram models to provide a hypothesis of the language of origin, wherein the combined classifier comprises a plurality of Gaussian mixture models wherein scores from multiple letter chunks models are treated as an eigenvector of a word and a Gaussian mixture model is provided for each of the plurality of different languages, and wherein the results from the plurality of N-gram models are scored by each of the Gaussian mixture models; and

outputting the hypothesis of the language of origin of the word provided by the combined classifier.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The language of origin of a word is determined by analyzing non-uniform letter sequence portions of the word.

35 Citations

View as Search Results

12 Claims

1. A method for determining a language of origin of a word comprising analyzing non-uniform letter sequence portions of the word, wherein analyzing comprises:
- using one or more processors of a computing system, segmenting the word into strings of letter chunks based on different criteria, the letter chunks being of non-uniform length of one or more letters;
  
  using one or more processors of a computing system, ascertaining a probability of the word belonging to a selected language by using a plurality of N-gram models based directly on the letter chunks segmented with the different criteria for each of a plurality of different languages, and providing results from using the plurality of N-gram models based directly on letter chunks extracted with the different criteria to a combined classifier that merges the results from the plurality of N-gram models to provide a hypothesis of the language of origin, wherein the combined classifier comprises a plurality of Gaussian mixture models wherein scores from multiple letter chunks models are treated as an eigenvector of a word and a Gaussian mixture model is provided for each of the plurality of different languages, and wherein the results from the plurality of N-gram models are scored by each of the Gaussian mixture models; and
  
  outputting the hypothesis of the language of origin of the word provided by the combined classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the step of ascertaining includes using an N-gram model based on syllable-based letter chunk.
  - 3. The method of claim 1 wherein the step of ascertaining includes using a list of selected syllables for the selected language.
  - 4. The method of claim 1 wherein the step of ascertaining includes using an N-gram model based on a language having a closed set of syllables.
  - 5. The method of claim 1, in which the step of ascertaining further comprises comparing the letter chunks with known closed sets of letter chunks corresponding to certain languages having closed sets of possible syllables, and using the known closed sets of letter chunks to detect whether letter chunks are valid and reject possible words that do not correspond to valid letter chunks in the closed sets.
  - 6. The method of claim 1, further comprising using a finite set of letter chunks with frequencies higher than a pre-set threshold in a list sorted in descending order of frequency, as base units in N-gram training in the N-gram model.
  - 7. The method of claim 1 further comprising selecting the word from within a context in a first language and identifying the word as being out of the vocabulary of the first language.

8. A method for determining a language of origin of a word comprising analyzing non-uniform letter sequence portions of the word wherein analyzing comprises:
- using one or more processors of a computing system, segmenting the word into strings of letter chunks based on different criteria, the letter chunks being of non-uniform length of one or more letters;
  
  using one or more processors of a computing system, ascertaining a probability of the word belonging to a selected language by using a plurality of N-gram models based on the letter chunks segmented with the different criteria for each of a plurality of different languages, and providing results from using the plurality of N-gram models based on letter chunks extracted with the different criteria to a combined classifier that merges the results from the plurality of N-gram models to provide a hypothesis of the language of origin, wherein the combined classifier uses at least one of a first form of adaptive boosting and a second form of adaptive boosting, the first form of adaptive boosting comprising wherein a classifier is provided for and associated with each of the plurality of different languages, each classifier receiving the plurality of results and used to ascertain whether the word is from the associated language or not, and the second form of adaptive boosting comprising calculating a posterior probability for each language; and
  
  outputting a the hypothesis of the language of origin of the word provided by the combined classifier.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8 wherein the step of ascertaining includes using an N-gram model based on at least one of MI (Mutual Information) and MDL (Minimum Description Length) letter chunk.
  - 10. The method of claim 8 wherein the step of ascertaining includes using an N-gram model based on LZ (Lempel-Ziv) letter chunk.
  - 11. The method of claim 8 wherein the step of ascertaining includes using an N-gram model based on syllable-based letter chunk.

12. A method for determining a language of origin of a word comprising analyzing non-uniform letter sequence portions of the word, wherein analyzing comprises:
- using one or more processors of a computing system, segmenting the word into strings of letter chunks based on different criteria, the letter chunks being of non-uniform length of one or more letters;
  
  using one or more processors of a computing system, ascertaining a probability of the word belonging to a selected language by using a plurality of N-gram models based directly on the letter chunks segmented with the different criteria for each of a plurality of different languages, and providing results from using the plurality of N-gram models based directly on letter chunks extracted with the different criteria to a combined classifier that merges the results from the plurality of N-gram models to provide a hypothesis of the language of origin, wherein the combined classifier comprises a plurality of Gaussian mixture models wherein a Gaussian mixture model is provided for each of the plurality of different languages, and wherein the results from the plurality of N-gram models are scored by each of the Gaussian mixture models; and
  
  outputting the hypothesis of the language of origin of the word provided by the combined classifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chu, Min, Chen, Yi Ning, Kuo, Shiun-Zu, He, Xiaodong, Riley, Megan, Feige, Kevin E., Gong, Yifan
Primary Examiner(s)
Hudspeth, David R
Assistant Examiner(s)
BAKER, MATTHEW H

Application Number

US11/384,401
Publication Number

US 20070219777A1
Time in Patent Office

2,255 Days
Field of Search

704 1- 10
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

Identifying language origin of words

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

12 Claims

Specification

Use Cases

Quick Links

Others

Identifying language origin of words

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

12 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others