Technique for searching out new words that should be registered in dictionary for speech processing
First Claim
1. A system for automatically searching out a new word to be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the system comprising:
- a computer processing device for executing unit, including at least a segmentation candidate generating unit, a sum calculating unit and a search unit;
the a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;
the a sum calculating unit for computing a likelihood that each word is a new word by summing up the certainty factors associated with the plurality of segmentation candidates containing the word; and
the a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word;
wherein the searching unit comprises;
an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;
a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and
a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andthe searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong,whereinthe information amount calculating unit stores in the memory an information amount calculated for each word,on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andthe word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.
1 Assignment
0 Petitions
Accused Products
Abstract
To search out a new word that should be newly registered in a dictionary contained in a segmentation device for segmenting a text into words. This system inputs a training text into the segmentation device to cause the segmentation device to segment the training text into words, and thereby generates a plurality of segmentation candidates in association with certainty factors of the results of the segmentation, the segmentation candidates respectively containing mutually different combinations of words as results of the segmentation of the training text. Then, this system computes a likelihood that the each word is a new word by summing up some of the certainty factors that are respectively associated with some of the plurality of segmentation candidates that contain the each word. Then, from among combinations of words each contained in at least any one of the segmentation candidates, the system searches combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word.
-
Citations
5 Claims
-
1. A system for automatically searching out a new word to be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the system comprising:
-
a computer processing device for executing unit, including at least a segmentation candidate generating unit, a sum calculating unit and a search unit; the a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation; the a sum calculating unit for computing a likelihood that each word is a new word by summing up the certainty factors associated with the plurality of segmentation candidates containing the word; and the a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word; wherein the searching unit comprises; an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word; a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word, wherein the searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, and the searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong, wherein the information amount calculating unit stores in the memory an information amount calculated for each word, on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory, on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, and the word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory. - View Dependent Claims (2, 3)
-
-
4. A method of searching out a new word that should be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the method comprising the steps of:
-
generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation; computing a likelihood that each word is a new word by summing up some of the certainty factors associated with some of the plurality of segmentation candidates containing the word; searching combinations of words each contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words, assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word; and outputting the found-out combination as the combination of words including the new word, wherein the searching comprises; calculating an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word; excluding a first word from, and adding a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and excluding a fifth word from and adding third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word, wherein the searching causes the excluding and adding repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, and outputting the combination of words on condition that any word that should be excluded or added cannot be found thereamong, further comprising storing in a memory an information amount calculated for each word, on conditions that the first word is excluded and the second word is added, calculating an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory, on conditions that the fifth word is excluded and the third and fourth words are added, calculating information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, and judging whether a word should be excluded or added by using the updated information amounts having been stored in the memory.
-
-
5. A non-transitory computer storage medium storing a program for enabling an information processing system to function as a system for searching out a new word that should be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the program causing the information system to function as:
-
a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation; a sum calculating unit for computing a likelihood that each word is a new word by summing up some of the certainty factors associated with some of the plurality of segmentation candidates containing the word; and a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word, wherein the searching unit comprises; an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word; a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word, wherein the searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, and the searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong, wherein the information amount calculating unit stores in the memory an information amount calculated for each word, on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory, on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, and the word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.
-
Specification