Indexing digitized speech with words represented in the digitized speech

US 8,706,490 B2
Filed: 08/07/2013
Issued: 04/22/2014
Est. Priority Date: 03/20/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method for use with a multimodal digital audio editor operating on a multimodal device supporting multiple modes of user interaction with the multimodal digital audio editor, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor operatively coupled to an automatic speech recognition (ASR) engine, the method comprising:

receiving in the multimodal digital audio editor, recognized speech that the ASR engine generated from digitized speech, that includes a recognized word and information indicating where, in the digitized speech, representation of the recognized word appears;

inserting, by the multimodal digital audio editor, the recognized word into a speech recognition grammar; and

inserting, by the multimodal digital audio editor, into the speech recognition grammar in association with the recognized word, the information indicating where, in the digitized speech, representation of the recognized word appears.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Indexing digitized speech with words represented in the digitized speech, with a multimodal digital audio editor operating on a multimodal device supporting modes of user interaction, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor operatively coupled to an ASR engine, including providing by the multimodal digital audio editor to the ASR engine digitized speech for recognition; receiving in the multimodal digital audio editor from the ASR engine recognized user speech including a recognized word, also including information indicating where, in the digitized speech, representation of the recognized word begins; and inserting by the multimodal digital audio editor the recognized word, in association with the information indicating where, in the digitized speech, representation of the recognized word begins, into a speech recognition grammar, the speech recognition grammar voice enabling user interface commands of the multimodal digital audio editor.

154 Citations

18 Claims

1. A method for use with a multimodal digital audio editor operating on a multimodal device supporting multiple modes of user interaction with the multimodal digital audio editor, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor operatively coupled to an automatic speech recognition (ASR) engine, the method comprising:
- receiving in the multimodal digital audio editor, recognized speech that the ASR engine generated from digitized speech, that includes a recognized word and information indicating where, in the digitized speech, representation of the recognized word appears;
  
  inserting, by the multimodal digital audio editor, the recognized word into a speech recognition grammar; and
  
  inserting, by the multimodal digital audio editor, into the speech recognition grammar in association with the recognized word, the information indicating where, in the digitized speech, representation of the recognized word appears.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising visually displaying the digitized speech with the recognized word as an index of where in the digitized speech the representation of the recognized word appears.
  - 3. The method of claim 1, further comprising recognizing, by the ASR engine, the recognized word in the digitized speech and identifying information indicating where, in the digitized speech, representation of the recognized word appears.
  - 4. The method of claim 3, wherein identifying the information indicating where, in the digitized speech, representation of the recognized word appears further comprises:
    - organizing the digitized speech in sequential sets of time domain amplitude samples grouped in frames, each frame characterized in sequence by a unique and cardinal frame identification number, each frame containing the same number of time domain amplitude samples;
      
      converting the digitized speech containing the recognized word to the frequency domain beginning with one of the frames of time domain amplitude samples; and
      
      deriving an index value indicating where, in the digitized speech, representation of the recognized word appears by multiplying the one of the frame identification numbers by the number of amplitude samples in each frame.
  - 5. The method of claim 1, wherein inserting the recognized word into a speech recognition grammar further comprises associating the recognized word, as a non-optional terminal element in the speech recognition grammar, with a word representing a user interface command of the digital audio editor.
  - 6. The method of claim 1, wherein the inserting the information indicating where, in the digitized speech, representation of the recognized word appears comprises inserting the information as part of a non-optional terminal grammar element.

7. Apparatus for use with a multimodal digital audio editor operating on a multimodal device supporting multiple modes of user interaction with the multimodal digital audio editor, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor being operatively coupled to an automatic speech recognition (ASR) engine, the apparatus comprising:
- at least one computer processor; and
  
  a computer memory operatively coupled to the at least one computer processor,the at least one computer processor being programmed to;
  
  receive, in the multimodal digital audio editor, recognized speech that the ASR engine generated from digitized speech, that includes a recognized word and information indicating where, in the digitized speech, representation of the recognized word appears;
  
  insert, by the multimodal digital audio editor, the recognized word into a speech recognition grammar; and
  
  insert, by the multimodal digital audio editor, into the speech recognition grammar in association with the recognized word, the information indicating where, in the digitized speech, representation of the recognized word appears.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The apparatus of claim 7, wherein the at least one computer processor is programmed to visually display the digitized speech with the recognized word as an index of where in the digitized speech the representation of the recognized word appears.
  - 9. The apparatus of claim 7, wherein the at least one computer processor is programmed to recognize, by the ASR engine, the recognized word in the digitized speech and identify information indicating where, in the digitized speech, representation of the recognized word appears.
  - 10. The apparatus of claim 9, wherein identifying the information indicating where, in the digitized speech, representation of the recognized word appears further comprises:
    - organizing the digitized speech in sequential sets of time domain amplitude samples grouped in frames, each frame characterized in sequence by a unique and cardinal frame identification number, each frame containing the same number of time domain amplitude samples;
      
      converting the digitized speech containing the recognized word to the frequency domain beginning with one of the frames of time domain amplitude samples; and
      
      deriving an index value indicating where, in the digitized speech, representation of the recognized word appears by multiplying the one of the frame identification numbers by the number of amplitude samples in each frame.
  - 11. The apparatus of claim 7, wherein the at least one computer processor is programmed to insert the word into a speech recognition grammar by associating the recognized word, as a non-optional terminal element in the speech recognition grammar, with a word representing a user interface command of the digital audio editor.
  - 12. The apparatus of claim 7, wherein the at least one computer processor is programmed to insert, into the grammar as part of a non-optional terminal grammar element, the information indicating where, in the digitized speech, representation of the recognized word appears.

13. A computer-readable, recordable medium having instructions encoded thereon which, when executed in a system comprising a multimodal digital audio editor operating on a multimodal device supporting multiple modes of user interaction with the multimodal digital audio editor, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor being operatively coupled to an automatic speech recognition (ASR) engine, perform a method comprising:
- receiving, in the multimodal digital audio editor, recognized speech that the ASR engine generated from digitized speech, that includes a recognized word and information indicating where, in the digitized speech, representation of the recognized word appears;
  
  inserting, by the multimodal digital audio editor, the recognized word into a speech recognition grammar; and
  
  inserting, by the multimodal digital audio editor, into the speech recognition grammar in association with the recognized word, the information indicating where, in the digitized speech, representation of the recognized word appears.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer-readable, recordable medium of claim 13, wherein the method further comprises visually displaying the digitized speech with the recognized word as an index of where in the digitized speech the representation of the recognized word appears.
  - 15. The computer-readable, recordable medium of claim 13, wherein the method further comprises recognizing, by the ASR engine, the recognized word in the digitized speech and identifying information indicating where, in the digitized speech, representation of the recognized word appears.
  - 16. The computer-readable, recordable medium of claim 15, wherein identifying the information indicating where, in the digitized speech, representation of the recognized word appears further comprises:
    - organizing the digitized speech in sequential sets of time domain amplitude samples grouped in frames, each frame characterized in sequence by a unique and cardinal frame identification number, each frame containing the same number of time domain amplitude samples;
      
      converting the digitized speech containing the recognized word to the frequency domain beginning with one of the frames of time domain amplitude samples; and
      
      deriving an index value indicating where, in the digitized speech, representation of the recognized word appears by multiplying the one of the frame identification numbers by the number of amplitude samples in each frame.
  - 17. The computer-readable, recordable medium of claim 13, wherein inserting the recognized word into a speech recognition grammar further comprises associating the recognized word, as a non-optional terminal element in the speech recognition grammar, with a word representing a user interface command of the digital audio editor.
  - 18. The computer-readable, recordable medium of claim 13, wherein the inserting the information indicating where, in the digitized speech, representation of the recognized word appears comprises inserting the information as part of a non-optional terminal grammar element.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Cross, Charles W., Jania, Frank L.
Primary Examiner(s)
He, Jialong

Application Number

US13/961,792
Publication Number

US 20140039899A1
Time in Patent Office

258 Days
Field of Search

704/248, 704/251, 704/253, 704/270
US Class Current

704/251
CPC Class Codes

G10L 15/183   using context dependencies,...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/197   Probabilistic grammars, e.g...

G10L 15/22   Procedures used during a sp...

G10L 2015/228   of application context

G10L 21/06   Transformation of speech in...

Indexing digitized speech with words represented in the digitized speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

154 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Indexing digitized speech with words represented in the digitized speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

154 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links