×

Systems and methods for translating Chinese pinyin to Chinese characters

  • US 7,478,033 B2
  • Filed: 03/16/2004
  • Issued: 01/13/2009
  • Est. Priority Date: 03/16/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for training a Chinese language model from Chinese character inputs, comprising:

  • segmenting the Chinese character inputs into valid words and unknown character strings, wherein the valid words are entries in a Chinese dictionary and the unknown character strings are not entries in the Chinese dictionary, and wherein the unknown character strings comprise Chinese characters;

    for each unknown character string;

    determining a corresponding first frequency of occurrence for the unknown character string and a corresponding second frequency of occurrence for each of the Chinese characters in the unknown character string;

    comparing the first frequency of occurrence to the second frequency of occurrence to determine an information gain value;

    comparing the information gain value to a threshold;

    identifying the unknown character string as a new valid word when the information gain is greater than the threshold;

    adding the new valid word to the Chinese dictionary to create an updated Chinese dictionary;

    resegmenting the Chinese character inputs into Chinese words, wherein the Chinese words are entries in the updated Chinese dictionary; and

    generating a transition matrix of conditional probabilities for predicting a word given a context based on the resegmenting.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×