×

Technique for searching out new words that should be registered in dictionary for speech processing

  • US 8,140,332 B2
  • Filed: 12/14/2007
  • Issued: 03/20/2012
  • Est. Priority Date: 12/15/2006
  • Status: Active Grant
First Claim
Patent Images

1. A system for automatically searching out a new word to be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the system comprising:

  • a computer processing device for executing unit, including at least a segmentation candidate generating unit, a sum calculating unit and a search unit;

    the a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;

    the a sum calculating unit for computing a likelihood that each word is a new word by summing up the certainty factors associated with the plurality of segmentation candidates containing the word; and

    the a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word;

    wherein the searching unit comprises;

    an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;

    a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and

    a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andthe searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong,whereinthe information amount calculating unit stores in the memory an information amount calculated for each word,on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andthe word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×