System and method for word-sense disambiguation by recursive partitioning

US 20060277045A1
Filed: 06/06/2005
Published: 12/07/2006
Est. Priority Date: 06/06/2005
Status: Active Grant

First Claim

Patent Images

1. A device for use with a computer-based system capable of converting text data to synthesized speech, the device comprising:

an identification module for identifying a homograph contained in the text data; and

an assignment module for assigning a pronunciation to the homograph using a statistical test constructed from a recursive partitioning of a plurality of training samples, each training sample comprising a word string containing the homograph;

the recursive partitioning being based on determining for each of a plurality of word indicators an order and a distance of each word indicator relative to the homograph in each training sample, wherein an absence of one of the plurality of word indicators in a training sample is treated as an equivalent to the absent word indicator being more than a predefined distance from the homograph.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A device and related methods for word-sense disambiguation during a text-to-speech conversion are provided. The device, for use with a computer-based system capable of converting text data to synthesized speech, includes an identification module for identifying a homograph contained in the text data. The device also includes an assignment module for assigning a pronunciation to the homograph using a statistical test constructed from a recursive partitioning of training samples, each training sample being a word string containing the homograph. The recursive partitioning is based on determining for each training sample an order and a distance of each word indicator relative to the homograph in the training sample. An absence of one of the word indicators in a training sample is treated as equivalent to the absent word indicator being more than a predefined distance from the homograph.

Citations

18 Claims

1. A device for use with a computer-based system capable of converting text data to synthesized speech, the device comprising:
- an identification module for identifying a homograph contained in the text data; and
  
  an assignment module for assigning a pronunciation to the homograph using a statistical test constructed from a recursive partitioning of a plurality of training samples, each training sample comprising a word string containing the homograph;
  
  the recursive partitioning being based on determining for each of a plurality of word indicators an order and a distance of each word indicator relative to the homograph in each training sample, wherein an absence of one of the plurality of word indicators in a training sample is treated as an equivalent to the absent word indicator being more than a predefined distance from the homograph.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The device of claim 1, wherein the statistical test comprises determining for each of a plurality of partitioned sets generated by the recursive partitioning a respective likelihood that the homograph belongs to a particular one of the plurality of partitioned sets.
  - 3. The device of claim 2, wherein the statistical test is constructed by iteratively constructing different partitions of the plurality of training samples each partition being based upon a different partitioning test and evaluating the different partitioning tests to determine which partitioning test effects a best separation of different pronunciations of the homograph.
  - 4. The device of claim 3, wherein the evaluating of the different partitioning tests is based upon an entropy measure.
  - 5. The device of claim 4, wherein the entropy measure comprises a Shannon entropy.
  - 6. The device of claim 4, wherein the entropy measure comprises a Gini entropy.

7. A method of electronically disambiguating homographs during a computer-based text-to-speech event, the method comprising:
- identifying a homograph contained in a text; and
  
  determining a pronunciation for the homograph using a statistical test constructed from a recursive partitioning of a plurality of training samples, each training sample comprising a word string containing the homograph;
  
  the recursive partitioning being based on determining for each of a plurality of word indicators an order and a distance of each word indicator relative to the homograph in each training sample, wherein an absence of one of the plurality of word indicators in a training sample is treated as an equivalent to the absent word indicator being more than a predefined distance from the homograph.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, wherein the statistical test comprises determining for each of a plurality of partitioned sets generated by the recursive partitioning a respective likelihood that the homograph belongs to a particular one of the plurality of partitioned sets.
  - 9. The method of claim 8, wherein the statistical test is constructed by iteratively constructing different partitions of the plurality of training samples each partition being based upon a different partitioning test and evaluating the different partitioning tests determine which partitioning test effects a best separation of different pronunciations of the homograph.
  - 10. The method of claim 9, wherein evaluating the different partitioning tests comprises determining an entropy measure.
  - 11. The method of claim 10, wherein the entropy measure comprises a Shannon entropy.
  - 12. The method of claim 10, wherein the entropy measure comprises a Gini entropy.

13. A computer-implemented method of constructing a statistical test for determining a pronunciation of a homograph encountered during an electronic text-to-speech conversion event, the method comprising:
- selecting a set of training samples, each training sample comprising a word string containing the homograph; and
  
  recursively partitioning the set of training samples, the recursive partitioning producing a decision tree for determining the pronunciation and being based on determining for each of a plurality of word indicators an order and a distance of each word indicator relative to the homograph in each training sample, wherein an absence of one of the plurality of word indicators in a training sample is treated as an equivalent to the absent word indicator being more than a predefined distance from the homograph.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The method of claim 13, wherein the statistical test comprises determining for each of a plurality of partitioned sets generated by the recursive partitioning a respective likelihood that the homograph belongs to a particular one of the plurality of partitioned sets.
  - 15. The method of claim 14, wherein the statistical test is constructed by iteratively constructing different partitions of the plurality of training samples each partition being based upon a different partitioning test and evaluating the different partitioning tests to determine which partitioning test effects a best separation of different pronunciations of the homograph.
  - 16. The method of claim 15, wherein evaluating the different partitioning tests comprises determining an entropy measure.
  - 17. The method of claim 15, wherein the entropy measure comprises a Shannon entropy.
  - 18. The method of claim 15, wherein the entropy measure comprises a Gini entropy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Inc., Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Gleason, Philip

Granted Patent

US 8,099,281 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

System and method for word-sense disambiguation by recursive partitioning

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for word-sense disambiguation by recursive partitioning

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links