Technique for searching out new words that should be registered in dictionary for speech processing

US 8,140,332 B2
Filed: 12/14/2007
Issued: 03/20/2012
Est. Priority Date: 12/15/2006
Status: Active Grant

First Claim

Patent Images

1. A system for automatically searching out a new word to be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the system comprising:

a computer processing device for executing unit, including at least a segmentation candidate generating unit, a sum calculating unit and a search unit;

the a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;

the a sum calculating unit for computing a likelihood that each word is a new word by summing up the certainty factors associated with the plurality of segmentation candidates containing the word; and

the a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word;

wherein the searching unit comprises;

an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;

a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and

a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andthe searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong,whereinthe information amount calculating unit stores in the memory an information amount calculated for each word,on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andthe word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

To search out a new word that should be newly registered in a dictionary contained in a segmentation device for segmenting a text into words. This system inputs a training text into the segmentation device to cause the segmentation device to segment the training text into words, and thereby generates a plurality of segmentation candidates in association with certainty factors of the results of the segmentation, the segmentation candidates respectively containing mutually different combinations of words as results of the segmentation of the training text. Then, this system computes a likelihood that the each word is a new word by summing up some of the certainty factors that are respectively associated with some of the plurality of segmentation candidates that contain the each word. Then, from among combinations of words each contained in at least any one of the segmentation candidates, the system searches combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word.

Citations

5 Claims

1. A system for automatically searching out a new word to be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the system comprising:
- a computer processing device for executing unit, including at least a segmentation candidate generating unit, a sum calculating unit and a search unit;
  
  the a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;
  
  the a sum calculating unit for computing a likelihood that each word is a new word by summing up the certainty factors associated with the plurality of segmentation candidates containing the word; and
  
  the a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word;
  
  wherein the searching unit comprises;
  
  an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;
  
  a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and
  
  a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andthe searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong,whereinthe information amount calculating unit stores in the memory an information amount calculated for each word,on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andthe word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.
- View Dependent Claims (2, 3)
- - 2. The system according to claim 1, further comprising a segmentation training unit, wherein:
    - the segmentation device includes a memory unit in which an index value for each word and another word written in a manner continuous with the former word is stored, the index value indicating a frequency at which the former word and the latter word are written in a manner continuous with each other, and generates the plurality of segmentation candidates according to the frequencies; and
      
      the segmentation training unit increases an index value corresponding to a word contained in the combination of words containing the new word and searched out by the searching unit when this searched-out word is stored in the memory unit, and newly registers this searched-out word in the memory unit when this searched-out word is not stored in the memory unit.
  - 3. The system according to claim 1, wherein the searching unit searches combinations of words contained in at least any one of the segmentation candidates to find out a combination of words minimizing a sum of a value of the information entropy and an index value of a predetermined index which increases as the number of words belonging to the combination increases.

4. A method of searching out a new word that should be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the method comprising the steps of:
- generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;
  
  computing a likelihood that each word is a new word by summing up some of the certainty factors associated with some of the plurality of segmentation candidates containing the word;
  
  searching combinations of words each contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words, assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word; and
  
  outputting the found-out combination as the combination of words including the new word,wherein the searching comprises;
  
  calculating an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;
  
  excluding a first word from, and adding a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and
  
  excluding a fifth word from and adding third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching causes the excluding and adding repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andoutputting the combination of words on condition that any word that should be excluded or added cannot be found thereamong,further comprisingstoring in a memory an information amount calculated for each word,on conditions that the first word is excluded and the second word is added, calculating an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the fifth word is excluded and the third and fourth words are added, calculating information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andjudging whether a word should be excluded or added by using the updated information amounts having been stored in the memory.

5. A non-transitory computer storage medium storing a program for enabling an information processing system to function as a system for searching out a new word that should be newly registered in a dictionary included in a segmentation device for segmenting an inputted text into a plurality of words, the program causing the information system to function as:
- a segmentation candidate generating unit for generating a plurality of segmentation candidates by inputting a training text into the segmentation device to cause the segmentation device to segment the training text into words, the segmentation candidates containing mutually different combinations of words resulting from the segmentation of the training text, and being associated with certainty factors of the results of the segmentation;
  
  a sum calculating unit for computing a likelihood that each word is a new word by summing up some of the certainty factors associated with some of the plurality of segmentation candidates containing the word; and
  
  a searching unit for searching combinations of words contained in at least any one of the segmentation candidates and containing words with which the entire training text can be written, in order to find out a combination that minimizes an information entropy of words assuming that each word belonging to the combinations appears in the training text at a frequency according to the likelihood corresponding to the word, and thereafter for outputting the found-out combination as the combination of words including the new word,wherein the searching unit comprises;
  
  an information amount calculating unit which calculates an information amount of each word assuming that the word appears at a frequency according to the likelihood corresponding to the word;
  
  a word joining unit which excludes a first word from, and adds a second word, to the combination of words containing the new word, the first word contained in at least any one of the segmentation candidates, and the second word being a character string containing the first word, on condition that a second information amount calculated for the second word is smaller than a first information amount calculated for the first word; and
  
  a word segmenting unit which excludes a fifth word from and adds third and fourth words to the combination of words containing the new word, the third and fourth words contained in at least any one of the segmentation candidates, and the fifth word being a character string obtained by joining a character string of the third word and a character string of the fourth word, on condition that a sum of a third information amount calculated for the third word and a fourth information amount calculated for the fourth word is smaller than a fifth information amount calculated for the fifth word,whereinthe searching unit causes the word joining unit and the word segmenting unit to perform the processing repeatedly and alternately until any more word that should be excluded from or added to the combination of words containing the new word cannot be searched out among the words contained in at least any one of the segmentation candidates, andthe searching unit outputs the combination of words on condition that any word that should be excluded or added cannot be found thereamong,whereinthe information amount calculating unit stores in the memory an information amount calculated for each word,on conditions that the word joining unit excludes the first word and adds the second word, the information amount calculating unit calculates an information amount of the second word assuming that a new likelihood of the second word is a sum of the likelihood of the first word and the current likelihood of the second word, and updates the second information amount stored in the memory,on conditions that the word segmenting unit excludes the fifth word and adds the third and fourth words, the information amount calculating unit calculates information amounts of the third word and the fourth word, assuming that a new likelihood of the third word is a sum of the current likelihood of the third word and the likelihood of the fifth word, and that a new likelihood of the fourth word is a sum of the current likelihood of the fourth word and the likelihood of the fifth word, and updates the third and fourth information amounts, andthe word joining unit and the word segmenting unit judge whether a word should be excluded or added by using the updated information amounts having been stored in the memory.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Itoh, Nobuyasu, Mori, Shinsuke
Primary Examiner(s)
Wozniak, James S.
Assistant Examiner(s)
Serrou, Abdelali

Application Number

US11/956,574
Publication Number

US 20080162118A1
Time in Patent Office

1,558 Days
Field of Search

704 1- 10, 704/251
US Class Current

704/251
CPC Class Codes

G06F 40/242 Dictionaries

G06F 40/284 Lexical analysis, e.g. toke...

Technique for searching out new words that should be registered in dictionary for speech processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Technique for searching out new words that should be registered in dictionary for speech processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links