Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion
First Claim
1. A method for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, the method comprising the steps of:
- identifying a set of words in the Japanese language that are associated with a shared display in the Japanese writing system and are associated with a shared reading in the initial writing system, but are associated with different parts-of-speech;
creating a cluster associated with the words in the set of words, the cluster indicating the parts-of-speech associated with the words in the set of words;
using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.
2 Assignments
0 Petitions
Accused Products
Abstract
Method for creating a language model capable of preventing deterioration of quality caused by the conventional back-off to unigram. Parts-of-speech with the same display and reading are obtained from a storage device (206). A cluster (204) is created by combining the obtained parts-of-speech. The created cluster (204) is stored in the storage device (206). In addition, when an instruction (214) for dividing the cluster is inputted, the cluster stored in the storage device (206) is divided (210) in accordance with to the inputted instruction (212). Two of the clusters stored in the storage device are combined (218), and a probability of occurrence of the combined clusters in the text corpus is calculated (222). The combined cluster is associated with the bigram indicating the calculated probability and stored into the storage device.
28 Citations
20 Claims
-
1. A method for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, the method comprising the steps of:
-
identifying a set of words in the Japanese language that are associated with a shared display in the Japanese writing system and are associated with a shared reading in the initial writing system, but are associated with different parts-of-speech; creating a cluster associated with the words in the set of words, the cluster indicating the parts-of-speech associated with the words in the set of words; using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system. - View Dependent Claims (2, 3, 8, 9, 10, 11, 12, 13, 14)
-
-
4. A method for converting a string of characters in an initial writing system into characters in a Japanese writing system, the method comprising:
-
receiving the string of characters in the initial writing system, wherein the string of characters in the initial writing system represents a reading of at least one word in the Japanese language; dividing the reading into a set of substrings; converting the substrings into a set of candidate strings in the Japanese writing system; obtaining, for a given one of the candidate strings, an Ngram indicating a probability of occurrence of a combination of N words included in the given one of the candidate strings; obtaining, for the given one of the candidate strings, a cluster bigram indicating a probability of occurrence of a combination of words associated with a first cluster and a second cluster in the candidate string, the first cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a first shared display in the Japanese writing system and a first shared reading in the initial writing system, the second cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a second shared display in the Japanese writing system and a second shared reading in the initial writing system; determining an order of precedence of the given one of the candidate strings relative to other ones of the candidate strings in accordance with the probabilities indicated by the obtained Ngram and the obtained cluster bigram; and presenting the given one of the candidate strings in accordance with the determined order. - View Dependent Claims (15, 16, 17, 18, 19)
-
-
5. An apparatus for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, comprising:
-
storage means for storing, for each word in a set of words in the Japanese language, a display, a reading, and a part-of-speech; word obtaining means for identifying a subset of words in the storage means that are associated with a shared display in a Japanese writing system and a shared reading in the initial writing system, but are associated with different parts-of-speech; cluster creating means for creating a cluster that indicates the parts-of-speech associated with the words in the subset of words; cluster storage controlling means for storing the cluster into the storage means; and conversion means for using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.
-
-
6. A writing system conversion apparatus, comprising:
-
a storage unit that stores a set of Ngrams, each Ngram indicating a probability of occurrence of a combination of N words, and that stores a set of cluster bigrams, each cluster bigram indicating a probability of occurrence of a combination of two clusters, at least one of the clusters including at least two parts-of-speech, wherein each cluster is associated with a different set of words that consists of words that are associated with a shared display and a shared reading, but are associated with different parts-of-speech; a reading inputting unit that receives a reading of a character string in an initial writing system; a reading dividing unit that divides the reading into a set of substrings; a candidate generating unit that converts the set of substrings into a set of candidate strings in a Japanese writing system; an Ngram obtaining unit that obtains Ngrams indicating probabilities of occurrences of combinations of N words included in the candidate strings; a cluster bigram obtaining unit that obtains cluster bigrams indicating probabilities of occurrences of combinations of words associated with two clusters included in the candidate strings; a determining unit that determines an order of precedence of the candidate strings in accordance with the probabilities indicated by the obtained Ngrams and the obtained cluster bigrams; and a presentation unit that presents each of the candidate strings in accordance with the determined order.
-
-
7. A computer-readable storage medium having computer-executable instructions stored thereon, wherein the computer-executable instructions cause a computer that executes the instructions to:
-
store, in a storage device, a dictionary of words in the Japanese language; store a set of word trigrams, each of which indicates an approximate probability of a combination of three words occurring in the Japanese language; store a set of word bigrams, each of which indicates an approximate probability of a combination of two words occurring in the Japanese language; identify subsets of words in the dictionary, wherein the words in each subset of words are associated with a common display in a Japanese writing system and a common reading in an initial writing system, but are associated with different parts-of-speech; create, for each identified subset of words, a cluster that indicates the parts-of-speech associated with the words in the subset of words; receive a first character string; receive a reading for each word in the first character string; receive a part-of-speech identifier for each word in the first character string; generate a set of cluster bigrams for each ordered pair of clusters in the set of clusters; associate each of the cluster bigrams with an approximate probability of a first word in the first character string being associated with a first cluster in one of the ordered pairs and a second word that immediately follows the first word in the first character string being associated with a second cluster in the ordered pair; receive a second character string in the initial writing system that represents three words in the Japanese language; divide the second character string into a set of three substrings, each of which represents one of the words represented by the second character string; convert each of the substrings into a set of candidate strings in the Japanese writing system; perform the following for each of the candidate strings; retrieve one of the word trigrams that indicates an approximate probability of the three words included in the candidate string; determine whether the approximate probability indicated by the retrieved word trigram is above a first threshold; associate, when it is determined that the approximate probability indicated by retrieved word trigram is above the first threshold, the candidate string with the approximate probability indicated by the retrieved word trigram; retrieve, when it is determined that the approximate probability indicated by the retrieved word trigram is not above the first threshold, one of the word bigrams that indicates approximate probabilities of two of the words in the candidate string; determine whether the approximate probability indicated by the retrieved word bigram is above a second threshold; associate, when it is determined that the approximate probability indicated by the retrieved word bigram is above the second threshold, the candidate string with the approximate probability indicated by the retrieved word bigram; retrieve, when it is determined that the approximate probability indicated by the retrieved word bigram is not above the second threshold, a cluster bigram that indicates an approximate probability of a first cluster following a second cluster, wherein the first cluster is associated with a first word in the candidate string and the second cluster is associated with a word in the candidate string that follows the first word in the candidate string; and associate, after retrieving the applicable cluster bigram, the approximate probability indicated by the retrieved cluster bigram with the candidate string; after associating an approximate probability with each of the candidate strings, determine an order of the candidate strings according to the approximate probabilities associated with the candidate strings; and display the candidate strings in accordance with the determined order. - View Dependent Claims (20)
-
Specification