Method and apparatus for creating a language model and kana-kanji conversion
First Claim
1. A computer-readable storage medium having computer-executable instructions stored thereon for creating a language model and performing Kana-Kanji conversion, wherein the computer-executable instructions cause a computer that executes the instructions to:
- receive a Kana character string;
divide the Kana character string into substrings and generate Kanji candidates for each substring;
obtain a plurality of trigram probabilities of Kanji candidates;
determine whether any of the plurality of trigram probabilities is above a first threshold and select a Kanji candidate if a trigram probability is above the first threshold;
when it is determined none of the plurality of trigram probabilities is above the first threshold;
obtain a plurality of bigram probabilities of Kanji candidates; and
determine whether any of the plurality of bigram probabilities is above a second threshold and select a Kanji candidate if a bigram probability is above the second threshold;
when it is determined none of the plurality of bigram probabilities is above the second threshold;
select a Kanji candidate based on cluster bigram probabilities of the Kanji candidates, wherein at least one cluster in the cluster bigram probabilities includes combining the same Kanji-Kana pairs with different parts-of-speech; and
display the selected Kanji candidates based on an order of precedence.
2 Assignments
0 Petitions
Accused Products
Abstract
Method for creating a language model capable of preventing deterioration of quality caused by the conventional back-off to unigram. Parts-of-speech with the same display and reading are obtained from a storage device (206). A cluster (204) is created by combining the obtained parts-of-speech. The created cluster (204) is stored in the storage device (206). In addition, when an instruction (214) for dividing the cluster is inputted, the cluster stored in the storage device (206) is divided (210) in accordance with to the inputted instruction (212). Two of the clusters stored in the storage device are combined (218), and a probability of occurrence of the combined clusters in the text corpus is calculated (222). The combined cluster is associated with the bigram indicating the calculated probability and stored into the storage device.
-
Citations
20 Claims
-
1. A computer-readable storage medium having computer-executable instructions stored thereon for creating a language model and performing Kana-Kanji conversion, wherein the computer-executable instructions cause a computer that executes the instructions to:
-
receive a Kana character string; divide the Kana character string into substrings and generate Kanji candidates for each substring; obtain a plurality of trigram probabilities of Kanji candidates; determine whether any of the plurality of trigram probabilities is above a first threshold and select a Kanji candidate if a trigram probability is above the first threshold; when it is determined none of the plurality of trigram probabilities is above the first threshold; obtain a plurality of bigram probabilities of Kanji candidates; and determine whether any of the plurality of bigram probabilities is above a second threshold and select a Kanji candidate if a bigram probability is above the second threshold; when it is determined none of the plurality of bigram probabilities is above the second threshold; select a Kanji candidate based on cluster bigram probabilities of the Kanji candidates, wherein at least one cluster in the cluster bigram probabilities includes combining the same Kanji-Kana pairs with different parts-of-speech; and display the selected Kanji candidates based on an order of precedence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for converting Kana to Kanji, the system comprising:
-
a reading inputting unit that receives a reading associated with Kana character string; a reading dividing unit that divides the Kana character string into substrings; a candidate generating unit that generates Kanji candidates for each substring; a trigram obtaining unit that obtains trigram probabilities representing the probabilities of occurrence of Kanji candidate trigrams, each Kanji candidate trigram being a combination of the Kanji candidates generated for three substrings; a bigram obtaining unit that obtains bigram probabilities representing the probabilities of occurrence of Kanji candidate bigrams, each Kanji candidate bigram being a combination of the Kanji candidates generated for two substrings; a cluster bigram obtaining unit that obtains cluster bigram probabilities representing the probabilities of occurrence of combinations of parts-of-speech of the Kanji candidate bigrams where the Kanji candidate for at least one of the substrings is associated with different parts-of-speech; a decision unit selecting the Kanji candidates based on the Kanji candidate trigram with the highest trigram probability that exceeds a trigram probability threshold, selecting the Kanji candidates based on the Kanji candidate bigram with the highest bigram probability that exceeds a bigram probability threshold if none of the trigram probabilities exceeds the trigram probability threshold, and selecting the Kanji candidates based on the cluster bigram with the highest cluster bigram probability if none of the bigram probabilities exceeds the bigram probability threshold; and a presentation unit that presents the selected Kanji candidates based on an order of precedence. - View Dependent Claims (10, 11)
-
-
12. A method for converting a string of Kana characters into a string of Kanji characters in a Japanese writing system, the method comprising the acts of:
-
receiving the string of Kana characters comprising a number of substrings; generating Kanji candidates for each of the substrings; selecting the Kanji candidates if a trigram probability generated for a trigram of Kanji candidates is above a first threshold; when no trigram probability is above the first threshold, selecting the Kanji candidates if a bigram probability generated for a bigram of Kanji candidates is above a second threshold; when no bigram probability is above the second threshold, selecting the Kanji candidates based on cluster bigram probabilities calculated for combinations of clusters associated with different display and reading pairs, at least one cluster generated from different parts-of-speech associated with a single display and reading pair; and displaying the Kanji candidates associated with the corresponding substrings based on an order of precedence. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification