Efficient language identification

US 20060184357A1
Filed: 02/11/2005
Published: 08/17/2006
Est. Priority Date: 02/11/2005
Status: Active Grant

First Claim

Patent Images

1. A method of identifying the natural language of text comprising the steps of:

receiving text documents written in a known natural language;

counting occurrences of unique features in the text documents to generate expected feature counts; and

using a probability distribution and the expected feature counts to generate probability values as a function of actual feature occurrence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.

73 Citations

View as Search Results

20 Claims

1. A method of identifying the natural language of text comprising the steps of:
- receiving text documents written in a known natural language;
  
  counting occurrences of unique features in the text documents to generate expected feature counts; and
  
  using a probability distribution and the expected feature counts to generate probability values as a function of actual feature occurrence.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein using a probability distribution comprises using a discrete or continuous probability distribution.
  - 3. The method of claim 2, wherein using a probability distribution comprises using a binomial or Gaussian distribution.
  - 4. The method of claim 1, and further comprises constructing a table of probability values for each of a plurality of candidate languages.
  - 5. The method of claim 4, and further comprising:
    - receiving a text sample written in an unidentified natural language;
      
      determining actual feature counts for some of the features in the text sample; and
      
      accessing the tables of probability values to identify at least one of the candidate languages for the text sample based on the actual feature counts.
  - 6. The method of claim 4, and further comprising scoring each candidate language by multiplying probability values associated with the actual feature counts.

7. A method of identifying the natural language of text comprising the steps of:
- receiving a text sample written in an unidentified natural language;
  
  determining a current count for at least one feature in at least one window of characters in the text sample;
  
  obtaining expected probability information for the at least one feature for a plurality of candidate languages;
  
  identifying at least one language for the text sample from among the plurality of candidate languages based on the current count and the obtained expected probability information.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 8. The method of claim 7, wherein obtaining expected probability information comprises receiving probability values based on a binomial distribution for the at least one feature.
  - 9. The method of claim 7, and further comprising sampling a training corpus to estimate the expected probability information comprising average counts of the at least one feature per a selectively sized sample.
  - 10. The method of claim 7, and further comprising using n-gram language profiles of the at least one identified language to identify the most probable language for the test sample.
  - 11. The method of claim 7, and further comprising using Unicode values to identify the plurality of candidate languages.
  - 12. The method of claim 7, wherein identifying the at least one language comprises generating language scores for each of the plurality of candidate languages based on comparing the current count of the at least one feature to the obtained expected probability information.
  - 13. The method of claim 12, wherein generating language scores comprises estimating a joint probability of a plurality of the features having the determined current counts in the text sample.
  - 14. The method of claim 7, wherein generating language scores comprises positively scoring a candidate language when the current count for the at least one character falls within a variance of the obtained expected probability information.
  - 15. The method of claim 7, wherein generating language scores comprises negative scoring a candidate language when the current count for the at least one character falls outside a variance of the obtained expected probability information.
  - 16. The method of claim 7, wherein generating scores comprises negatively scoring a candidate language for a non-occurrence of an expected feature in the sample text.
  - 17. The method of claim 7, further comprising estimating a confidence score for each of the identified at least one language.

18. A computer readable medium including instructions which, when implemented, cause a computer to perform language identification, the instructions comprising:
- a module adapted to construct and store for each of a plurality of natural languages a feature list and expected probability values associated with each of the listed features; and
  
  a module adapted to count actual features in a text sample and access the stored expected probability values associated with the actual features to identify at least one natural language for the text sample.
- View Dependent Claims (19, 20)
- - 19. The computer readable medium of claim 18, and further comprising a module adapted to determine confidence scores for the identified natural languages and to rank natural languages based on the confidence scores.
  - 20. The computer readable medium of claim 18, further comprising a module adapted to access an n-gram language profile for each of the at least one identified natural language to perform language identification on the text sample.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Ramsey, William D., Powell, Kevin R., Schmid, Patricia M.

Granted Patent

US 8,027,832 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/263   Language identification

Y10T 70/30   Hasp

Y10T 70/358   Dead bolt

Y10T 70/5726   With padlock

Y10T 70/5973   Remote control

Y10T 70/7057   Permanent magnet

Efficient language identification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

73 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient language identification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

73 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links