Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion

US 20110106523A1
Filed: 06/23/2006
Published: 05/05/2011
Est. Priority Date: 06/24/2005
Status: Active Grant

First Claim

Patent Images

1. A method for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, the method comprising the steps of:

identifying a set of words in the Japanese language that are associated with a shared display in the Japanese writing system and are associated with a shared reading in the initial writing system, but are associated with different parts-of-speech;

creating a cluster associated with the words in the set of words, the cluster indicating the parts-of-speech associated with the words in the set of words;

using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method for creating a language model capable of preventing deterioration of quality caused by the conventional back-off to unigram. Parts-of-speech with the same display and reading are obtained from a storage device (206). A cluster (204) is created by combining the obtained parts-of-speech. The created cluster (204) is stored in the storage device (206). In addition, when an instruction (214) for dividing the cluster is inputted, the cluster stored in the storage device (206) is divided (210) in accordance with to the inputted instruction (212). Two of the clusters stored in the storage device are combined (218), and a probability of occurrence of the combined clusters in the text corpus is calculated (222). The combined cluster is associated with the bigram indicating the calculated probability and stored into the storage device.

28 Citations

View as Search Results

20 Claims

1. A method for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, the method comprising the steps of:
- identifying a set of words in the Japanese language that are associated with a shared display in the Japanese writing system and are associated with a shared reading in the initial writing system, but are associated with different parts-of-speech;
  
  creating a cluster associated with the words in the set of words, the cluster indicating the parts-of-speech associated with the words in the set of words;
  
  using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.
- View Dependent Claims (2, 3, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method as claimed in claim 1, further comprising:
    - receiving an instruction for dividing the cluster; and
      
      dividing the cluster in accordance with the instruction.
  - 3. The method for creating the language model as claimed in claim 1,wherein the set of words is a first set of words and the cluster is a first cluster;
    - wherein the method further comprises;
      
      identifying a second set of words in the Japanese language that are associated with a shared display in the Japanese writing system and are associated with a shared reading in the initial writing system, but are associated with different parts-of-speech;
      
      creating a second cluster associated with the words in the second set of words, the second cluster indicating the parts-of-speech associated with the words in the second set of words;
      
      receiving a test character string;
      
      obtaining a text corpus by assigning a part-of-speech to each word included in the test character string;
      
      generating a combined cluster that is associated with combinations of words associated with the first cluster and words associated with the second cluster;
      
      calculating a probability of occurrence of words associated with the combined cluster in the text corpus; and
      
      associating the combined cluster with a cluster bigram indicating the calculated probability; and
      
      wherein using the cluster comprises using the cluster bigram in the process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.
  - 8. The method of claim 1, wherein the initial writing system is a kana writing system and the Japanese writing system is the kanji writing system.
  - 9. The method of claim 1, wherein the initial writing system is a Latin alphabet writing system.
  - 10. The method of claim 2, wherein the instruction specifies how to divide the words associated with the cluster into at least two clusters associated with non-overlapping sets of the words associated with the cluster.
  - 11. The method of claim 2, wherein dividing the cluster comprises identifying a division of the words associated with the cluster into a set of at least two clusters that are associated with a lowest error rate among possible divisions of the cluster.
  - 12. The method of claim 11, wherein identifying a division of the words associated with the cluster comprises:
    - removing the cluster from the set of clusters;
      
      dividing the parts-of-speech associated with the words in the set of words into a first set of parts-of-speech and a second set of parts-of-speech;
      
      creating a first divided cluster that indicates the parts-of-speech in the first set of parts-of-speech and a second divided cluster that indicates that parts-of-speech in the second set of parts-of-speech;
      
      converting the first divided cluster into a test string in the Japanese writing system;
      
      comparing the test string with a pre-stored correct string;
      
      calculating an error rate for the first divided cluster based on the comparison of the test string with the pre-stored correct string;
      
      determining whether the error rate for the first divided cluster is less than the error rate for the cluster;
      
      discarding the first divided cluster when the error rate for the first divided cluster is not less than the error rate for the cluster; and
      
      using the first divided cluster in the process to convert the string of characters into the converted string of characters in the Japanese writing system.
  - 13. The method of claim 1, wherein using the cluster comprises:
    - receiving the string of characters in the initial writing system, wherein the string of characters in the initial writing system represents a reading of a character string;
      
      dividing the reading into a set of substrings;
      
      converting the substrings into a set of candidate strings in the Japanese writing system;
      
      obtaining, for a given one of the candidate strings, an Ngram indicating a probability of occurrence of a combination of N words included in the given one of the candidate strings;
      
      obtaining, for the given one of the candidate strings, a cluster bigram indicating a probability of occurrence of a combination of a first cluster and a second cluster in the candidate string, the first cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a first shared display in the Japanese writing system and a first shared reading in the initial writing system, the second cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a second shared display in the Japanese writing system and a second shared reading in the initial writing system;
      
      determining an order of precedence of the given one of the candidate strings relative to other ones of the candidate string in accordance with the probabilities indicated by the obtained Ngram and the obtained cluster bigram; and
      
      presenting the given one of the candidate strings in accordance with the determined order.
  - 14. The method of claim 1, further comprising storing the cluster in a persistent storage medium.

4. A method for converting a string of characters in an initial writing system into characters in a Japanese writing system, the method comprising:
- receiving the string of characters in the initial writing system, wherein the string of characters in the initial writing system represents a reading of at least one word in the Japanese language;
  
  dividing the reading into a set of substrings;
  
  converting the substrings into a set of candidate strings in the Japanese writing system;
  
  obtaining, for a given one of the candidate strings, an Ngram indicating a probability of occurrence of a combination of N words included in the given one of the candidate strings;
  
  obtaining, for the given one of the candidate strings, a cluster bigram indicating a probability of occurrence of a combination of words associated with a first cluster and a second cluster in the candidate string, the first cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a first shared display in the Japanese writing system and a first shared reading in the initial writing system, the second cluster indicating parts-of-speech associated with words in the Japanese language that are associated with a second shared display in the Japanese writing system and a second shared reading in the initial writing system;
  
  determining an order of precedence of the given one of the candidate strings relative to other ones of the candidate strings in accordance with the probabilities indicated by the obtained Ngram and the obtained cluster bigram; and
  
  presenting the given one of the candidate strings in accordance with the determined order.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 4, further comprising storing a dictionary that associates, for each word in a set of words in the Japanese language, a display of the word with a reading of the word and a part-of-speech of the word.
  - 16. The method of claim 4, wherein the reading includes a plurality of words.
  - 17. The method of claim 4, further comprising:
    - identifying a set of words in the Japanese language that are associated with a shared display in the Japanese writing system and a shared reading in the initial writing system, but are associated with different parts-of-speech; and
      
      creating a cluster that indicates the parts-of-speech associated with the words in the set of words.
  - 18. The method of claim 4, wherein the cluster bigram probability of the occurrence of the combination of the first cluster and the second cluster in the candidate string is equal to a multiplication of:
    - a number of occurrences in a text corpus of words associated with the first cluster followed by words associated with the second cluster divided by a number of occurrences in the text corpus of words associated with the first cluster; and
      
      a number of occurrences in the text corpus of a word associated with the second cluster divided by a number of occurrences in a text corpus of the words associated with the second cluster.
  - 19. The method of claim 4, wherein the Japanese writing system is selected from a group of Japanese writing systems that consists of:
    - kana and kanji.

5. An apparatus for converting a string of characters in an initial writing system into a converted string of characters in a Japanese writing system, comprising:
- storage means for storing, for each word in a set of words in the Japanese language, a display, a reading, and a part-of-speech;
  
  word obtaining means for identifying a subset of words in the storage means that are associated with a shared display in a Japanese writing system and a shared reading in the initial writing system, but are associated with different parts-of-speech;
  
  cluster creating means for creating a cluster that indicates the parts-of-speech associated with the words in the subset of words;
  
  cluster storage controlling means for storing the cluster into the storage means; and
  
  conversion means for using the cluster in a process to convert the string of characters in the initial writing system into the converted string of characters in the Japanese writing system.

6. A writing system conversion apparatus, comprising:
- a storage unit that stores a set of Ngrams, each Ngram indicating a probability of occurrence of a combination of N words, and that stores a set of cluster bigrams, each cluster bigram indicating a probability of occurrence of a combination of two clusters, at least one of the clusters including at least two parts-of-speech, wherein each cluster is associated with a different set of words that consists of words that are associated with a shared display and a shared reading, but are associated with different parts-of-speech;
  
  a reading inputting unit that receives a reading of a character string in an initial writing system;
  
  a reading dividing unit that divides the reading into a set of substrings;
  
  a candidate generating unit that converts the set of substrings into a set of candidate strings in a Japanese writing system;
  
  an Ngram obtaining unit that obtains Ngrams indicating probabilities of occurrences of combinations of N words included in the candidate strings;
  
  a cluster bigram obtaining unit that obtains cluster bigrams indicating probabilities of occurrences of combinations of words associated with two clusters included in the candidate strings;
  
  a determining unit that determines an order of precedence of the candidate strings in accordance with the probabilities indicated by the obtained Ngrams and the obtained cluster bigrams; and
  
  a presentation unit that presents each of the candidate strings in accordance with the determined order.

7. A computer-readable storage medium having computer-executable instructions stored thereon, wherein the computer-executable instructions cause a computer that executes the instructions to:
- store, in a storage device, a dictionary of words in the Japanese language;
  
  store a set of word trigrams, each of which indicates an approximate probability of a combination of three words occurring in the Japanese language;
  
  store a set of word bigrams, each of which indicates an approximate probability of a combination of two words occurring in the Japanese language;
  
  identify subsets of words in the dictionary, wherein the words in each subset of words are associated with a common display in a Japanese writing system and a common reading in an initial writing system, but are associated with different parts-of-speech;
  
  create, for each identified subset of words, a cluster that indicates the parts-of-speech associated with the words in the subset of words;
  
  receive a first character string;
  
  receive a reading for each word in the first character string;
  
  receive a part-of-speech identifier for each word in the first character string;
  
  generate a set of cluster bigrams for each ordered pair of clusters in the set of clusters;
  
  associate each of the cluster bigrams with an approximate probability of a first word in the first character string being associated with a first cluster in one of the ordered pairs and a second word that immediately follows the first word in the first character string being associated with a second cluster in the ordered pair;
  
  receive a second character string in the initial writing system that represents three words in the Japanese language;
  
  divide the second character string into a set of three substrings, each of which represents one of the words represented by the second character string;
  
  convert each of the substrings into a set of candidate strings in the Japanese writing system;
  
  perform the following for each of the candidate strings;
  
  retrieve one of the word trigrams that indicates an approximate probability of the three words included in the candidate string;
  
  determine whether the approximate probability indicated by the retrieved word trigram is above a first threshold;
  
  associate, when it is determined that the approximate probability indicated by retrieved word trigram is above the first threshold, the candidate string with the approximate probability indicated by the retrieved word trigram;
  
  retrieve, when it is determined that the approximate probability indicated by the retrieved word trigram is not above the first threshold, one of the word bigrams that indicates approximate probabilities of two of the words in the candidate string;
  
  determine whether the approximate probability indicated by the retrieved word bigram is above a second threshold;
  
  associate, when it is determined that the approximate probability indicated by the retrieved word bigram is above the second threshold, the candidate string with the approximate probability indicated by the retrieved word bigram;
  
  retrieve, when it is determined that the approximate probability indicated by the retrieved word bigram is not above the second threshold, a cluster bigram that indicates an approximate probability of a first cluster following a second cluster, wherein the first cluster is associated with a first word in the candidate string and the second cluster is associated with a word in the candidate string that follows the first word in the candidate string; and
  
  associate, after retrieving the applicable cluster bigram, the approximate probability indicated by the retrieved cluster bigram with the candidate string;
  
  after associating an approximate probability with each of the candidate strings, determine an order of the candidate strings according to the approximate probabilities associated with the candidate strings; and
  
  display the candidate strings in accordance with the determined order.
- View Dependent Claims (20)
- - 20. The computer-readable medium of claim 7, wherein the Japanese writing system is kanji.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Maeda, Rie, Seki, Miyuki, Sato, Yoshiharu

Granted Patent

US 8,744,833 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/53   Processing of non-Latin tex...

G10L 15/06   Creation of reference templ...

Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

28 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links