Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
First Claim
7. ) An unknown word model building method comprising the steps of:
- selecting one of the words occurring in a character string stored in a storage device;
assigning a pronunciation to the selected word;
storing the word and the pronunciation in a vocabulary dictionary;
assigning pronunciations to each of the characters constituting the word; and
associating and storing a plurality of different occurrence probabilities with the associated pronunciations into a character dictionary.
1 Assignment
0 Petitions
Accused Products
Abstract
Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.
244 Citations
20 Claims
-
7. ) An unknown word model building method comprising the steps of:
-
selecting one of the words occurring in a character string stored in a storage device; assigning a pronunciation to the selected word; storing the word and the pronunciation in a vocabulary dictionary; assigning pronunciations to each of the characters constituting the word; and associating and storing a plurality of different occurrence probabilities with the associated pronunciations into a character dictionary. - View Dependent Claims (1, 2, 3, 4, 5, 6, 15, 17, 19, 20)
-
-
8. ) A word boundary probability estimating method comprising the steps of:
-
calculating a probability that a word boundary will exist between a plurality of characters constituting a character string stored in a first corpus by invoking information as to whether a word boundary already exists between the plurality of characters from the first corpus containing the character string including the plurality of characters or setting up the preliminary information as to whether the word boundary exists; and estimating the probability that a word boundary will exist between a plurality of characters constituting a character string stored in a second corpus by referring to the calculated probability between the plurality of characters constituting the character string stored in the second corpus. - View Dependent Claims (9, 10, 11, 12, 13, 14, 16, 18)
-
-
15-1. ) An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for causing unknown word model building, the computer readable program code means in said article of manufacture comprising computer readable program code means for causing a computer to effect the steps of claim 7.
Specification