Systems and methods for translating Chinese pinyin to Chinese characters
First Claim
1. A method for training a Chinese language model from Chinese character inputs, comprising:
- segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters;
for each unknown character string;
determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string;
comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value;
comparing the information gain value to a threshold;
identifying the unknown character string as a new valid word when the information gain is greater than the threshold;
adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary;
resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and
generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods to process and translate pinyin to Chinese characters and words are disclosed. A Chinese language model is trained by extracting unknown character strings from Chinese inputs, e.g., documents and/or user inputs/queries, determining valid words from the unknown character strings, and generating a transition matrix based on the Chinese inputs for predicting a word string given the context. A method for translating a pinyin input generally includes generating a set of Chinese character strings from the pinyin input using a Chinese dictionary including words derived from the Chinese inputs and a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyin input. An ambiguous user input may be classified as non-pinyin or pinyin by identifying an ambiguous pinyin/non-pinyin ASCII word in the user input and analyzing the context to classify the user input.
88 Citations
13 Claims
-
1. A method for training a Chinese language model from Chinese character inputs, comprising:
-
segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters; for each unknown character string; determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string; comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value; comparing the information gain value to a threshold; identifying the unknown character string as a new valid word when the information gain is greater than the threshold; adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary; resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium on which are stored instructions executable on a computer processor for training a Chinese model from Chinese character inputs, the instructions including:
-
segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters; for each unknown character string; determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string; comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value; comparing the information gain value to a threshold; identifying the unknown character string as a new valid word when the information gain is greater than the threshold; adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary; resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.
-
-
8. A system for training a Chinese language model from Chinese character inputs, comprising:
-
a segmenter configured to segment the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary, and the unknown character strings are not entries in the Chinese dictionary and comprise Chinese characters, and the segmenter is further configured to resegment the Chinese character inputs into Chinese words in response to new valid words being identified from the unknown character strings, wherein the Chinese words are entries in an updated Chinese dictionary that includes the new valid words; a new word analyzer configured to determine a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string, compare the first frequency of occurrence to the second frequency of occurrence to determine an information gain value, compare the information gain value to a threshold, identify the unknown character string as a new valid word when the information gain value is greater than the threshold, and add the new valid word to the Chinese dictionary to create the updated Chinese dictionary; and a Chinese language model training module configured to generate a transition matrix of conditional probabilities for predicting a word string given a context based on the resegmenting. - View Dependent Claims (9, 10, 11, 12, 13)
-
Specification