Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

US 20080228463A1
Filed: 05/26/2008
Published: 09/18/2008
Est. Priority Date: 07/14/2004
Status: Active Grant

First Claim

Patent Images

7. ) An unknown word model building method comprising the steps of:

selecting one of the words occurring in a character string stored in a storage device;

assigning a pronunciation to the selected word;

storing the word and the pronunciation in a vocabulary dictionary;

assigning pronunciations to each of the characters constituting the word; and

associating and storing a plurality of different occurrence probabilities with the associated pronunciations into a character dictionary.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.

244 Citations

20 Claims

7. ) An unknown word model building method comprising the steps of:
- selecting one of the words occurring in a character string stored in a storage device;
  
  assigning a pronunciation to the selected word;
  
  storing the word and the pronunciation in a vocabulary dictionary;
  
  assigning pronunciations to each of the characters constituting the word; and
  
  associating and storing a plurality of different occurrence probabilities with the associated pronunciations into a character dictionary.
- View Dependent Claims (1, 2, 3, 4, 5, 6, 15, 17, 19, 20)
- - 1. ) A word boundary probability estimating device comprising:
    - means for calculating a probability that a word boundary will exist between a first plurality of characters constituting a first character string stored in a first corpus by invoking information as to whether a word boundary already exists between the first plurality of characters from the first corpus containing the first character string comprising the first plurality of characters or setting up the preliminary information as to whether the word boundary exists, andmeans for estimating the probability that the word boundary will exist in a second plurality of characters constituting a second character string stored in a second corpus by referring to a calculated probability between the second plurality of characters constituting the second character string stored in the second corpus.
  - 2. ) The word boundary probability estimating device according to claim 1, wherein the probability that a word boundary will exist between the plurality of characters is calculated on the basis of relations in the sequence of character types of successive characters in the character string in the first corpus by using the information as to whether a word boundary already exists between characters of those character types.
  - 3. ) The word boundary probability estimating device according to claim 1, wherein the calculated probability that a word boundary will exist includes values between 0 and 1 capable of probabilistically indicating the degree of segmentation as well as values of 0 and 1 indicating whether or not segmentation is already determined, in order to discriminate whether the character string including the plurality of characters is segmented or not.
  - 4. ) The word boundary probability estimating device according to claim 1;
    - further comprising;
      
      word n-gram probability calculating means for calculating the probability that a word boundary will exist on the basis of the relation between the preceding character and the subsequent character at the position of each occurrence of a character string including one or more characters constituting a word n-gram, the probability being a word n-gram probability, and thereby forming a probabilistic language model building device.
  - 5. ) The word boundary probability estimating device according to claim 1;
    - further comprising;
      
      word n-gram probability calculating means for calculating the probability that a word boundary will exist on the basis of a decision tree in which relations between a plurality of characters constituting a word n-gram are described or on the basis of PPM (Prediction by Partial Match), the probability being a word n-gram probability, and thereby forming a probabilistic language model building device.
  - 6. ) The word boundary probability estimating device according to claim 4;
    - further comprising;
      
      a first corpus storing a character string which includes a plurality of characters and is segmented into at least two words as a word including one or more characters, and storing an occurrence probability indicating the likelihood of occurrence of each of segmented words in association with the segmented word;
      
      a second corpus storing a character string including a plurality of characters and a word n-gram probability calculated by the probabilistic language model building device;
      
      a vocabulary dictionary storing, correspondingly to the first corpus, pronunciations associated with each of words segmented as known words;
      
      a character dictionary storing, correspondingly to the first corpus, a pronunciation and spelling of each character in a plurality of conversion candidates that can be converted from an unknown word in association with each other so that that the plurality of conversion candidates can be listed for the unknown word, and storing the occurrence probability of a pronunciation of each character; and
      
      a language decoding section converting a input spelling into a conversion candidate by referring to the occurrence probabilities associated with each word stored in the first corpus and the occurrence probabilities of pronunciations of each character stored in the character dictionary, and word then-gram probability stored in the second corpus for an input pronunciation, and thereby forming a kana-kanji converting device.
  - 15. ) The word boundary probability estimating device according to claim 5;
    - further comprising;
      
      a first corpus storing a character string which includes a plurality of characters and is segmented into at least two words as a word including at least one character, and storing an occurrence probability indicating the likelihood of occurrence of each of segmented words in association with the segmented word;
      
      a second corpus storing a character string including a plurality of characters and a word n-gram probability calculated by the probabilistic language model building device;
      
      a vocabulary dictionary storing, correspondingly to the first corpus, pronunciations associated with each of words segmented as known words;
      
      a character dictionary storing, correspondingly to the first corpus, a pronunciation and spelling of each character in a plurality of conversion candidates that can be converted from an unknown word in association with each other so that that the plurality of conversion candidates can be listed for the unknown word, and storing the occurrence probability of a pronunciation of each character; and
      
      a language decoding section converting a input spelling into a conversion candidate by referring to the occurrence probabilities associated with each word stored in the first corpus and the occurrence probabilities of pronunciations of each character stored in the character dictionary, and word the n-gram probability stored in the second corpus for an input pronunciation, and thereby forming a kana-kanji converting device.
  - 17. ) A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for unknown word model building, said method steps comprising the steps of claim 7.
  - 19. ) A computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing functions of a word boundary probability estimating device, the computer readable program code means in said computer program product comprising computer readable program code means for causing a computer to effect the functions of claim 1.
  - 20. ) A computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing functions of a kana-kanji converting device, the computer readable program code means in said computer program product comprising computer readable program code means for causing a computer to effect the functions of claim 6.

8. ) A word boundary probability estimating method comprising the steps of:
- calculating a probability that a word boundary will exist between a plurality of characters constituting a character string stored in a first corpus by invoking information as to whether a word boundary already exists between the plurality of characters from the first corpus containing the character string including the plurality of characters or setting up the preliminary information as to whether the word boundary exists; and
  
  estimating the probability that a word boundary will exist between a plurality of characters constituting a character string stored in a second corpus by referring to the calculated probability between the plurality of characters constituting the character string stored in the second corpus.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 16, 18)
- - 9. ) The word boundary probability estimating method according to claim 8, wherein the step of calculating the probability that a word boundary will exist between a plurality of characters calculates the probability on the basis of relations in the sequence of character types of successive characters in the character string in the first corpus by using the information as to whether a word boundary already exists between characters of those character types.
  - 10. ) The word boundary probability estimating method according to claim 8, wherein the calculated probability that a word boundary will exist includes values between 0 and 1 capable of probabilistically indicating the degree of segmentation as well as values of 0 and 1 indicating whether or not segmentation is already determined, in order to discriminate whether the character string including the plurality of characters is segmented or not.
  - 11. ) The word boundary probability estimating method according to claim 8, further comprising the step of calculating the probability that a word boundary will exist on the basis of the relation between the preceding character and the subsequent character at the position of each occurrence of a character string including at least one character constituting a word n-gram, the probability being a word n-gram probability, thereby constituting a probabilistic language model building method.
  - 12. ) The word boundary probability estimating method according to claim 8, further comprising the step of calculating the probability that a word boundary will exist on the basis of a decision tree in which relations between a plurality of characters constituting a word n-gram are described or on the basis of PPM (Prediction by Partial Match), the probability being a word n-gram probability, thereby constituting a probabilistic language model building method.
  - 13. ) The word boundary probability estimating method according to claim 11, further comprising the steps of:
    - referring to a first corpus storing a character string which includes a plurality of characters and is segmented into at least two words as the word including at least one character, and storing an occurrence probability indicating the likelihood of occurrence of each of segmented words in association with the segmented word;
      
      referring to a second corpus storing a character string including a plurality of characters and a word n-gram probability calculated by the probabilistic language model building device according to the probabilistic language model building method;
      
      referring to a vocabulary dictionary storing, correspondingly to the first corpus, pronunciations associated with each of words segmented as known words;
      
      referring to a character dictionary storing, correspondingly to the first corpus, pronunciations and spelling of each character in a plurality of conversion candidates that can be converted from an unknown word in association with each other so that that the plurality of conversion candidates can be listed for the unknown word and storing the occurrence probability of a pronunciation of each character; and
      
      converting a input spelling into a conversion candidate by referring to the occurrence probabilities associated with each word stored in the first corpus and the occurrence probabilities of pronunciations of each character stored in the character dictionary, and the word n-gram probability stored in the second corpus for an input pronunciation, thereby forming a kana-kanji conversion method.
  - 14. ) The word boundary probability estimating method according to claim 12, further comprising the steps of:
    - referring to a first corpus storing a character string which includes a plurality of characters and is segmented into at least two words as the word including at least one character, and storing an occurrence probability indicating the likelihood of occurrence of each of segmented words in association with the segmented word;
      
      referring to a second corpus storing a character string including a plurality of characters and a word n-gram probability calculated by the probabilistic language model building device according to the probabilistic language model building method;
      
      referring to a vocabulary dictionary storing, correspondingly to the first corpus, pronunciations associated with each of words segmented as known words;
      
      referring to a character dictionary storing, correspondingly to the first corpus, pronunciations and spelling of each character in a plurality of conversion candidates that can be converted from an unknown word in association with each other so that that the plurality of conversion candidates can be listed for the unknown word and storing the occurrence probability of a pronunciation of each character; and
      
      converting a input spelling into a conversion candidate by referring to the occurrence probabilities associated with each word stored in the first corpus and the occurrence probabilities of pronunciations of each character stored in the character dictionary, and the word n-gram probability stored in the second corpus for an input pronunciation, thereby forming a kana-kanji conversion method.
  - 16. ) An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for causing word boundary probability estimating, the computer readable program code means in said article of manufacture comprising computer readable program code means for causing a computer to effect the steps of claim 8.
  - 18. ) A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for word boundary probability estimating, said method steps comprising the steps of claim 8.

15-1. ) An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for causing unknown word model building, the computer readable program code means in said article of manufacture comprising computer readable program code means for causing a computer to effect the steps of claim 7.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Mori, Shinsuke, Takuma, Daisuke

Granted Patent

US 7,917,350 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/53 Processing of non-Latin tex...

Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

244 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

244 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links