Systems and methods for translating chinese pinyin to chinese characters
First Claim
1. A method for training a Chinese language model from Chinese inputs, comprising:
- extracting unknown character strings from a set of Chinese inputs;
determining valid words from the unknown character strings by comparing frequencies of occurrence of the unknown character strings with frequencies of occurrence of individual characters of the unknown character string; and
generating a transition matrix of conditional probabilities for predicting a word given a context.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods to process and translate pinyin to Chinese characters and words are disclosed. A Chinese language model is trained by extracting unknown character strings from Chinese inputs, e.g., documents and/or user inputs/queries, determining valid words from the unknown character strings, and generating a transition matrix based on the Chinese inputs for predicting a word string given the context. A method for translating a pinyin input generally includes generating a set of Chinese character strings from the pinyin input using a Chinese dictionary including words derived from the Chinese inputs and a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyin input. An ambiguous user input may be classified as non-pinyin or pinyin by identifying an ambiguous pinyin/non-pinyin ASCII word in the user input and analyzing the context to classify the user input.
232 Citations
47 Claims
-
1. A method for training a Chinese language model from Chinese inputs, comprising:
-
extracting unknown character strings from a set of Chinese inputs;
determining valid words from the unknown character strings by comparing frequencies of occurrence of the unknown character strings with frequencies of occurrence of individual characters of the unknown character string; and
generating a transition matrix of conditional probabilities for predicting a word given a context. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium on which are stored instructions executable on a computer processor, the instructions including:
-
extracting unknown character strings from a set of Chinese inputs;
determining valid words from the unknown character strings by comparing frequencies of occurrence of the unknown character strings with frequencies of occurrence of individual characters of the unknown character string; and
generating a transition matrix of conditional probabilities for predicting a word string given a context.
-
-
10. A system for training a Chinese language model, comprising:
-
a segmenter configured to segment unknown character strings from a set of Chinese inputs;
a new word analyzer configured to determine valid words from the unknown character strings by comparing frequencies of occurrence of the unknown character strings with frequencies of occurrence of individual characters of the unknown character string; and
a Chinese language model training module configured to generate a transition matrix of conditional probabilities for predicting a word string given a context. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for translating a pinyin input to at least one Chinese character string, comprising:
generating a set of character strings from the pinyin input, each character string having a weight associated therewith indicating a likelihood that the character string corresponds to the pinyin input, the generating includes utilizing a Chinese dictionary including words extracted from a set of Chinese inputs and a language model trained based on the set of Chinese inputs. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
29. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium on which are stored instructions executable on a computer processor, the instructions including:
generating a set of character strings from the pinyin input, each character string having a weight associated therewith indicating the likelihood that the character string corresponds to the pinyin input, the generating includes utilizing a Chinese dictionary including words extracted from a set of Chinese inputs and a language model trained based on the set of Chinese inputs.
-
30. A system for translating a pinyin input to at least one Chinese character string, comprising:
a pinyin-word decoder configured to generate a set of character strings from the pinyin input, each character string having a weight associated therewith indicating the likelihood that the character string corresponds to the pinyin input, the pinyin-word decoder being further configured to utilize a Chinese dictionary that includes words extracted from a set of Chinese inputs and a language model trained based on the set of Chinese inputs. - View Dependent Claims (31, 32, 33, 34, 35)
-
36. An pinyin classifier for classifying a user input, comprising:
-
a database of words that are valid both in non-pinyin and in pinyin; and
a classification engine configured to identify an ambiguous word in the user input selected from the database of words and to analyze context words of the user input to selectively classify the user input as non-pinyin or as pinyin. - View Dependent Claims (37, 38, 39)
-
-
40. A method for pinyin classification of a user input, comprising:
-
identifying an ambiguous word in the user input, the ambiguous word being selected from a database of n-grams that are valid both in non-pinyin and in pinyin; and
analyzing context words of the user input to selectively classify the user input as non-pinyin or as pinyin. - View Dependent Claims (41, 42, 43)
-
-
44. A method for presenting possible translations of a user input, comprising:
providing a hyperlink for each possible translation of the user input, the user input and each possible translation of the user input being in different languages or language formats. - View Dependent Claims (45, 46, 47)
Specification