Method and apparatus for creating a language model and kana-kanji conversion

US 8,744,833 B2
Filed: 06/23/2006
Issued: 06/03/2014
Est. Priority Date: 06/24/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-readable storage medium having computer-executable instructions stored thereon for creating a language model and performing Kana-Kanji conversion, wherein the computer-executable instructions cause a computer that executes the instructions to:

receive a Kana character string;

divide the Kana character string into substrings and generate Kanji candidates for each substring;

obtain a plurality of trigram probabilities of Kanji candidates;

determine whether any of the plurality of trigram probabilities is above a first threshold and select a Kanji candidate if a trigram probability is above the first threshold;

when it is determined none of the plurality of trigram probabilities is above the first threshold;

obtain a plurality of bigram probabilities of Kanji candidates; and

determine whether any of the plurality of bigram probabilities is above a second threshold and select a Kanji candidate if a bigram probability is above the second threshold;

when it is determined none of the plurality of bigram probabilities is above the second threshold;

select a Kanji candidate based on cluster bigram probabilities of the Kanji candidates, wherein at least one cluster in the cluster bigram probabilities includes combining the same Kanji-Kana pairs with different parts-of-speech; and

display the selected Kanji candidates based on an order of precedence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method for creating a language model capable of preventing deterioration of quality caused by the conventional back-off to unigram. Parts-of-speech with the same display and reading are obtained from a storage device (206). A cluster (204) is created by combining the obtained parts-of-speech. The created cluster (204) is stored in the storage device (206). In addition, when an instruction (214) for dividing the cluster is inputted, the cluster stored in the storage device (206) is divided (210) in accordance with to the inputted instruction (212). Two of the clusters stored in the storage device are combined (218), and a probability of occurrence of the combined clusters in the text corpus is calculated (222). The combined cluster is associated with the bigram indicating the calculated probability and stored into the storage device.

Citations

20 Claims

1. A computer-readable storage medium having computer-executable instructions stored thereon for creating a language model and performing Kana-Kanji conversion, wherein the computer-executable instructions cause a computer that executes the instructions to:
- receive a Kana character string;
  
  divide the Kana character string into substrings and generate Kanji candidates for each substring;
  
  obtain a plurality of trigram probabilities of Kanji candidates;
  
  determine whether any of the plurality of trigram probabilities is above a first threshold and select a Kanji candidate if a trigram probability is above the first threshold;
  
  when it is determined none of the plurality of trigram probabilities is above the first threshold;
  
  obtain a plurality of bigram probabilities of Kanji candidates; and
  
  determine whether any of the plurality of bigram probabilities is above a second threshold and select a Kanji candidate if a bigram probability is above the second threshold;
  
  when it is determined none of the plurality of bigram probabilities is above the second threshold;
  
  select a Kanji candidate based on cluster bigram probabilities of the Kanji candidates, wherein at least one cluster in the cluster bigram probabilities includes combining the same Kanji-Kana pairs with different parts-of-speech; and
  
  display the selected Kanji candidates based on an order of precedence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-readable storage medium of claim 1 wherein the computer-executable instructions further cause the computer that executes the instructions to store a dictionary of words in the Japanese language.
  - 3. The computer-readable storage medium of claim 2 wherein the computer-executable instructions further cause the computer that executes the instructions to:
    - identify subsets of words having the same display and reading but are associated with different parts-of-speech in the dictionary; and
      
      create a cluster for each identified subset of words indicating the different parts-of-speech associated with each identified subset of words.
  - 4. The computer-readable storage medium of claim 3 wherein the computer-executable instructions further cause the computer that executes the instructions to:
    - calculate the plurality of trigram probabilities from the number of occurrences of each word trigram in a training corpus; and
      
      calculate the plurality of bigram probabilities from the number of occurrences of each word bigram in the training corpus.
  - 5. The computer-readable storage medium of claim 3 wherein the computer-executable instructions further cause the computer that executes the instructions to calculate the plurality of cluster bigram probabilities based on the number of occurrences in a training corpus of a word associated with a second cluster and the number of occurrences in the training corpus of a selected part of speech preceding the word associated with the second cluster.
  - 6. The computer-readable storage medium of claim 2 wherein the computer-executable instructions further cause the computer that executes the instructions to:
    - divide the cluster to produce at least one divided cluster containing one or more parts-of-speech from the cluster, when the cluster includes a plurality of parts-of-speech;
      
      convert the cluster and each divided cluster into strings of Kanji characters or Kana characters;
      
      determine if the divided cluster is more reliable than the cluster based on a comparison of the corresponding character strings against a reference character string; and
      
      store the more reliable of the cluster and the divided cluster in the language model.
  - 7. The computer-readable storage medium of claim 6 wherein the computer-executable instructions further cause the computer that executes the instructions to:
    - calculate a character error rate for the cluster and each divided cluster based on a comparison of the character strings against the reference character string representing a known proper character string; and
      
      determine that the divided cluster is more reliable than the cluster if the character error rate for the divided cluster is less than the character error rate for the cluster.
  - 8. The computer-readable storage medium of claim 7 wherein the computer-executable instructions further cause the computer that executes the instructions to discard the divided cluster when the divided cluster has a character error rate greater than or equal to the character error rate for the cluster.

9. A system for converting Kana to Kanji, the system comprising:
- a reading inputting unit that receives a reading associated with Kana character string;
  
  a reading dividing unit that divides the Kana character string into substrings;
  
  a candidate generating unit that generates Kanji candidates for each substring;
  
  a trigram obtaining unit that obtains trigram probabilities representing the probabilities of occurrence of Kanji candidate trigrams, each Kanji candidate trigram being a combination of the Kanji candidates generated for three substrings;
  
  a bigram obtaining unit that obtains bigram probabilities representing the probabilities of occurrence of Kanji candidate bigrams, each Kanji candidate bigram being a combination of the Kanji candidates generated for two substrings;
  
  a cluster bigram obtaining unit that obtains cluster bigram probabilities representing the probabilities of occurrence of combinations of parts-of-speech of the Kanji candidate bigrams where the Kanji candidate for at least one of the substrings is associated with different parts-of-speech;
  
  a decision unit selecting the Kanji candidates based on the Kanji candidate trigram with the highest trigram probability that exceeds a trigram probability threshold, selecting the Kanji candidates based on the Kanji candidate bigram with the highest bigram probability that exceeds a bigram probability threshold if none of the trigram probabilities exceeds the trigram probability threshold, and selecting the Kanji candidates based on the cluster bigram with the highest cluster bigram probability if none of the bigram probabilities exceeds the bigram probability threshold; and
  
  a presentation unit that presents the selected Kanji candidates based on an order of precedence.
- View Dependent Claims (10, 11)
- - 10. The system of claim 9 further comprising:
    - a dictionary storage unit that stores a plurality of words, each word having an associated display, reading, and part-of-speech;
      
      a word obtaining unit that identifies the part-of-speech of the words having the same display and reading from the dictionary storage unit;
      
      a cluster creating unit that creates a cluster for the words having the same display and reading by combining parts-of-speech of the words having the same display and reading; and
      
      a cluster storing control unit storing the cluster in a storage unit.
  - 11. The system of claim 10 further comprising:
    - a cluster dividing unit that divides the clusters stored in the storage unit according to the parts-of-speech;
      
      a combining unit that combines two clusters stored in the storage unit to create a cluster bigram;
      
      the calculation unit that calculates a probability of occurrence of the cluster bigram;
      
      a cluster bigram storing control unit that associates the cluster bigram with the cluster bigram indicating the probability calculated by a calculation unit.

12. A method for converting a string of Kana characters into a string of Kanji characters in a Japanese writing system, the method comprising the acts of:
- receiving the string of Kana characters comprising a number of substrings;
  
  generating Kanji candidates for each of the substrings;
  
  selecting the Kanji candidates if a trigram probability generated for a trigram of Kanji candidates is above a first threshold;
  
  when no trigram probability is above the first threshold, selecting the Kanji candidates if a bigram probability generated for a bigram of Kanji candidates is above a second threshold;
  
  when no bigram probability is above the second threshold, selecting the Kanji candidates based on cluster bigram probabilities calculated for combinations of clusters associated with different display and reading pairs, at least one cluster generated from different parts-of-speech associated with a single display and reading pair; and
  
  displaying the Kanji candidates associated with the corresponding substrings based on an order of precedence.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The method of claim 12 further comprising the act of dividing the string of Kana characters into substrings.
  - 14. The method of claim 12 further comprising the act of storing a Japanese language dictionary comprising a plurality of entries, each entry comprising a display, a reading, and a part-of-speech.
  - 15. The method of claim 14 further comprising the act of:
    - forming clusters from the entries in the Japanese language dictionary having the same displays and readings together with the associated parts-of-speech for those entries; and
      
      storing the clusters.
  - 16. The method of claim 15 further comprising the act of assigning a unique identifier to the cluster.
  - 17. The method of claim 15 wherein the act of forming clusters from the entries in the Japanese language dictionary having the same displays and readings together with the associated parts-of-speech for those entries further comprises the act of:
    - forming a cluster for each unique pair of displays and readings from the entries in the Japanese language dictionary;
      
      joining parts-of-speech associated with each entry matching the unique pair of displays and readings to form an expanded part-of-speech for the unique pair of displays and readings; and
      
      adding the expanded part-of-speech to the cluster for the unique pair of displays and readings.
  - 18. The method of claim 15 further comprising the act of building a language model based on character error rates calculated each cluster using assumptive division.
  - 19. The method of claim 18 wherein the act of building the language model based on the character error rates calculated for each cluster using assumptive division further comprises the acts of:
    - dividing the cluster to produce at least one divided cluster containing one or more parts-of-speech from the cluster, when the cluster includes a plurality of parts-of-speech;
      
      converting the cluster and each divided cluster into character strings of Kanji characters, Kana characters, or both;
      
      calculating the character error rate for the cluster and each divided cluster based on a comparison of the character strings against a reference character string from a training corpus, the reference character string being a known proper character string;
      
      discarding each divided cluster having a character error rate greater than or equal to the character error rate for the cluster;
      
      storing each divided cluster having a character error rate lower than the character error rate for the cluster in the language model; and
      
      storing the cluster in the language model when no divided cluster has a character error rate lower than the character error rate for the cluster.
  - 20. The method of claim 19 wherein the act of dividing the cluster to produce the at least one divided cluster containing the one or more parts-of-speech from the cluster, when the cluster includes the plurality of parts-of-speech further comprises dividing the cluster according to a set of instructions for dividing the cluster into at least two clusters having non-overlapping sets of the parts-of-speech associated with the cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Maeda, Rie, Sato, Yoshiharu, Seki, Miyuki
Primary Examiner(s)
He, Jialong

Application Number

US11/917,657
Publication Number

US 20110106523A1
Time in Patent Office

2,902 Days
Field of Search

704/1, 704/2, 704/4
US Class Current

704/1
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/53   Processing of non-Latin tex...

G10L 15/06   Creation of reference templ...

Method and apparatus for creating a language model and kana-kanji conversion

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for creating a language model and kana-kanji conversion

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links