Systems and methods for translating Chinese pinyin to Chinese characters

US 7,478,033 B2
Filed: 03/16/2004
Issued: 01/13/2009
Est. Priority Date: 03/16/2004
Status: Active Grant

First Claim

Patent Images

1. A method for training a Chinese language model from Chinese character inputs, comprising:

segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters;

for each unknown character string;

determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string;

comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value;

comparing the information gain value to a threshold;

identifying the unknown character string as a new valid word when the information gain is greater than the threshold;

adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary;

resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and

generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods to process and translate pinyin to Chinese characters and words are disclosed. A Chinese language model is trained by extracting unknown character strings from Chinese inputs, e.g., documents and/or user inputs/queries, determining valid words from the unknown character strings, and generating a transition matrix based on the Chinese inputs for predicting a word string given the context. A method for translating a pinyin input generally includes generating a set of Chinese character strings from the pinyin input using a Chinese dictionary including words derived from the Chinese inputs and a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyin input. An ambiguous user input may be classified as non-pinyin or pinyin by identifying an ambiguous pinyin/non-pinyin ASCII word in the user input and analyzing the context to classify the user input.

88 Citations

View as Search Results

13 Claims

1. A method for training a Chinese language model from Chinese character inputs, comprising:
- segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters;
  
  for each unknown character string;
  
  determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string;
  
  comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value;
  
  comparing the information gain value to a threshold;
  
  identifying the unknown character string as a new valid word when the information gain is greater than the threshold;
  
  adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary;
  
  resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and
  
  generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the transition matrix of conditional probabilities is generated based on n-gram counts generated from the Chinese character inputs where n≧
    - 1.
  - 3. The method of claim 2, wherein the n-gram counts include the counts of n-tuples of adjacent and non-adjacent words in the set of Chinese character inputs.
  - 4. The method of claim 2, wherein the n-gram counts include the number of occurrences of each n-word sequence.
  - 5. The method of claim 1, wherein the set of Chinese character inputs includes at least one of user Chinese inputs and a set of Chinese input documents.
  - 6. The method of claim 5, wherein the set of Chinese character inputs includes a set of user Chinese character queries to a web search engine.

7. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium on which are stored instructions executable on a computer processor for training a Chinese model from Chinese character inputs, the instructions including:
- segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters;
  
  for each unknown character string;
  
  determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string;
  
  comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value;
  
  comparing the information gain value to a threshold;
  
  identifying the unknown character string as a new valid word when the information gain is greater than the threshold;
  
  adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary;
  
  resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and
  
  generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.

8. A system for training a Chinese language model from Chinese character inputs, comprising:
- a segmenter configured to segment the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary, and the unknown character strings are not entries in the Chinese dictionary and comprise Chinese characters, and the segmenter is further configured to resegment the Chinese character inputs into Chinese words in response to new valid words being identified from the unknown character strings, wherein the Chinese words are entries in an updated Chinese dictionary that includes the new valid words;
  
  a new word analyzer configured to determine a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string, compare the first frequency of occurrence to the second frequency of occurrence to determine an information gain value, compare the information gain value to a threshold, identify the unknown character string as a new valid word when the information gain value is greater than the threshold, and add the new valid word to the Chinese dictionary to create the updated Chinese dictionary; and
  
  a Chinese language model training module configured to generate a transition matrix of conditional probabilities for predicting a word string given a context based on the resegmenting.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system of claim 8, wherein the new word analyzer is further configured to generate n-gram counts from the Chinese character inputs where n≧
    - 1 and to generate the transition matrix of conditional probabilities based on the n-gram counts.
  - 10. The system of claim 9, wherein the n-gram counts include the counts of n-tuples of adjacent and non-adjacent words in the set of Chinese character inputs.
  - 11. The system of claim 9, wherein the n-gram counts include the number of occurrences of each n-word sequence.
  - 12. The system of claim 8, wherein the set of Chinese character inputs includes at least one of user Chinese inputs and a set of Chinese documents.
  - 13. The system of claim 12, wherein the set of Chinese character inputs includes a set of user Chinese character queries to a web search engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Zhu, Hongjun, Zhu, Hulcan, Wu, Jun
Primary Examiner(s)
Han; Qi

Application Number

US10/802,479
Publication Number

US 20050209844A1
Time in Patent Office

1,764 Days
Field of Search

704/2, 704/1, 704/9, 704/10
US Class Current

704/2
CPC Class Codes

G06F 40/129 Handling non-Latin characte...

Systems and methods for translating Chinese pinyin to Chinese characters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

88 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for translating Chinese pinyin to Chinese characters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

88 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links