Methods and apparatus for acoustic disambiguation by insertion of disambiguating textual information

US 8,954,329 B2
Filed: 05/23/2012
Issued: 02/10/2015
Est. Priority Date: 05/23/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, wherein the at least one text segment and the at least one acoustically similar word and/or phrase have different spellings;

automatically annotating, using at least one processor, the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase;

and synthesizing a speech signal, at least in part by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment;

wherein the disambiguating information includes text that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein;

annotating the textual representation includes inserting the disambiguating information into the textual representation proximate the at least one text segment to form an annotated textual representation;

and synthesizing the speech signal includes synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the annotated textual representation that includes the at least one text segment and the disambiguating information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for disambiguating at least one text segment from at least one acoustically similar word and/or phrase. The techniques include identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase which has a different spelling, annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment.

18 Citations

View as Search Results

24 Claims

1. A method comprising:
- identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, wherein the at least one text segment and the at least one acoustically similar word and/or phrase have different spellings;
  
  automatically annotating, using at least one processor, the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase;
  
  and synthesizing a speech signal, at least in part by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment;
  
  wherein the disambiguating information includes text that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein;
  
  annotating the textual representation includes inserting the disambiguating information into the textual representation proximate the at least one text segment to form an annotated textual representation;
  
  and synthesizing the speech signal includes synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the annotated textual representation that includes the at least one text segment and the disambiguating information.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the disambiguating information includes at least one prerecorded utterance that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein:
    - annotating the textual representation includes associating the at least one prerecorded utterance with the at least one text segment; and
      
      synthesizing the speech signal includes inserting the at least one prerecorded utterance into the speech signal proximate the portion of the speech signal corresponding to the at least one text segment.
  - 3. The method of claim 1, wherein the disambiguating information includes an indication of a meaning of the at least one text segment.
  - 4. The method of claim 1, wherein the disambiguating information includes a spelling of the at least one text segment.
  - 5. The method of claim 1, wherein the disambiguating information is represented in the speech signal using a different voice font than at least the at least one text segment.
  - 6. The method of claim 1, further comprising audibly rendering the speech signal to the user.
  - 7. The method of claim 1, wherein identifying at least one text segment having at least one acoustically similar word or phrase includes checking whether any text segment in the textual representation is included in a list comprising acoustically ambiguous words and/or phrases.
  - 8. The method of claim 1, wherein the textual representation corresponds to text converted from speech input from the user by performing automatic speech recognition on the speech input, and wherein automatically identifying at least one text segment having at least one acoustically similar word and/or phrase comprises identifying the at least one text segment based, at least in part, on an N-best list generated during automatic speech recognition.

9. At least one non-transitory computer readable medium storing instructions that, when executed on at least one processor, perform a method comprising:
- identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, wherein the at least one text segment and the at least one acoustically similar word and/or phrase have different spellings;
  
  and automatically annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase;
  
  synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment;
  
  wherein the disambiguating information includes text that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein;
  
  annotating the textual representation includes inserting the disambiguating information into the textual representation proximate the at least one text segment to form an annotated textual representation;
  
  and synthesizing the speech signal includes synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the annotated textual representation that includes the at least one text segment and the disambiguating information.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The at least one non-transitory computer readable medium of claim 9, wherein the disambiguating information includes at least one prerecorded utterance that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein:
    - annotating the textual representation includes associating the at least one prerecorded utterance with the at least one text segment; and
      
      synthesizing the speech signal includes inserting the at least one prerecorded utterance into the speech signal proximate the portion of the speech signal corresponding to the at least one text segment.
  - 11. The at least one non-transitory computer readable medium of claim 9, wherein the disambiguating information includes an indication of a meaning of the at least one text segment.
  - 12. The at least one non-transitory computer readable medium of claim 9, wherein the disambiguating information includes a spelling of the at least one text segment.
  - 13. The at least one non-transitory computer readable medium of claim 9, wherein the disambiguating information is represented in the speech signal using a different voice font than at least the at least one text segment.
  - 14. The at least one non-transitory computer readable medium of claim 9, further comprising audibly rendering the speech signal to the user.
  - 15. The at least one non-transitory computer readable medium of claim 9, wherein identifying at least one text segment having at least one acoustically similar word or phrase includes checking whether any text segment in the textual representation is included in a list comprising acoustically ambiguous words and/or phrases.
  - 16. The at least one non-transitory computer readable medium of claim 9, wherein the textual representation corresponds to text converted from speech input from the user by performing automatic speech recognition on the speech input, and wherein automatically identifying at least one text segment having at least one acoustically similar word and/or phrase comprises identifying the at least one text segment based, at least in part, on an N-best list generated during automatic speech recognition.

17. A system comprising:
- at least one input interface for receiving data from the user;
  
  a conversion component configured to convert the data into a textual representation;
  
  and a presentation component configured to provide an audio presentation of at least a portion of the textual representation by performing;
  
  identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, wherein the at least one text segment and the at least one acoustically similar word and/or phrase have different spellings;
  
  automatically annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase;
  
  synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment;
  
  wherein the disambiguating information includes text that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein the presentation component is configured to insert the disambiguating information into the textual representation proximate the at least one text segment to form an annotated textual representation, and synthesize the speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the annotated textual representation that includes the at least one text segment and the disambiguating information.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The system of claim 17, wherein the disambiguating information includes at least one prerecorded utterance that helps disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and wherein the presentation component is configured to associate the at least one prerecorded utterance with the at least one text segment, and insert the at least one prerecorded utterance into the speech signal proximate the portion of the speech signal corresponding to the at least one text segment.
  - 19. The system of claim 17, wherein the disambiguating information includes an indication of a meaning of the at least one text segment.
  - 20. The system of claim 17, wherein the disambiguating information includes a spelling of the at least one text segment.
  - 21. The system of claim 17, wherein the disambiguating information is represented in the speech signal using a different voice font than at least the at least one text segment.
  - 22. The system of claim 17, further comprising at least one speaker for audibly rendering the speech signal to the user.
  - 23. The system of claim 17, wherein the presentation component is configured to identify at least one text segment having at least one acoustically similar word or phrase, at least in part, by checking whether any text segment in the textual representation is included in a list comprising acoustically ambiguous words and/or phrases.
  - 24. The system of claim 17, wherein the input from the user includes speech, wherein the conversion component includes at least one automatic speech recognition engine to convert the data to the textual representation, and wherein the presentation component is configured to identify at least one text segment having at least one acoustically similar word or phrase based, at least in part, on an N-best list generated by the at least one automatic speech recognition engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Labsky, Martin, Kleindienst, Jan, Macek, Tomas, Nahamoo, David, Curin, Jan, Ganong, William F. III
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US13/478,978
Publication Number

US 20120303371A1
Time in Patent Office

993 Days
Field of Search

704/260
US Class Current

704/260
CPC Class Codes

G06F 40/10   Text processing natural lan...

G06F 40/30   Semantic analysis

G10L 13/00   Speech synthesis; Text to s...

G10L 13/08   Text analysis or generation...

G10L 15/01   Assessment or evaluation of...

G10L 15/02   Feature extraction for spee...

G10L 15/06   Creation of reference templ...

G10L 15/14   using statistical models, e...

G10L 15/1822   Parsing for meaning underst...

G10L 15/26   Speech to text systems G10L...

G10L 15/28   Constructional details of s...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

G10L 17/00   Speaker identification or v...

G10L 21/06   Transformation of speech in...

Methods and apparatus for acoustic disambiguation by insertion of disambiguating textual information

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for acoustic disambiguation by insertion of disambiguating textual information

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links