Spoken word spotting queries

US 7,904,296 B2
Filed: 07/22/2004
Issued: 03/08/2011
Est. Priority Date: 07/23/2003
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals;

processing, by a query recognizer of a word spotting system, each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion;

forming, by the query recognizer of the word spotting system, a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals;

accepting, by a word spotting engine of the word spotting system, data representing unknown speech in a second audio signal; and

locating, by the word spotting engine of the word spotting system, putative instances of the entire spoken event of interest in the second audio signal using the representation of the spoken event of interest, wherein the locating includes identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the representation of the entire spoken event of interest,wherein the first instance of the entire spoken event of interest and the second instance of the entire spoken event of interest include a common set of words, andwherein the subword unit representation corresponding to the first portion and the subword unit representation corresponding to the second portion are different.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach to wordspotting (180) using query data from one or more spoken instance of a query (140). The query data is processed to determining a representation of the query (160) that defines multiple sequences of subword (130) units each representing the query. Then putative instances of the query (190) are located in input data from an audio signal using the determined representation of the query.

84 Citations

View as Search Results

19 Claims

1. A method comprising:
- receiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals;
  
  processing, by a query recognizer of a word spotting system, each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion;
  
  forming, by the query recognizer of the word spotting system, a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals;
  
  accepting, by a word spotting engine of the word spotting system, data representing unknown speech in a second audio signal; and
  
  locating, by the word spotting engine of the word spotting system, putative instances of the entire spoken event of interest in the second audio signal using the representation of the spoken event of interest, wherein the locating includes identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the representation of the entire spoken event of interest,wherein the first instance of the entire spoken event of interest and the second instance of the entire spoken event of interest include a common set of words, andwherein the subword unit representation corresponding to the first portion and the subword unit representation corresponding to the second portion are different.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein processing each identified portion of the first set of audio signals comprises applying a computer-implemented speech recognition algorithm to data representing the first set of audio signals.
  - 3. The method of claim 2 wherein locating the putative instances includes applying a computer-implemented word spotting algorithm configured using the representation of the entire spoken event of interest.
  - 4. The method of claim 3 further comprising selecting processing parameter values of the speech recognition algorithm for application to the data representing the first set of audio signals according to characteristics of the word spotting algorithm.
  - 5. The method of claim 4 wherein the selecting of the processing parameter values of the speech recognition algorithm includes optimizing said parameters according to an accuracy of the word spotting algorithm.
  - 6. The method of claim 4 wherein the selecting of the processing parameter values of the speech recognition algorithm includes selecting values for parameters including one or more of an insertion factor, a recognition search beam width, a recognition grammar factor, and a number of recognition hypotheses.
  - 7. The method of claim 1 wherein the subword units include linguistic units.
  - 8. The method of claim 1 wherein the representation of the entire spoken event of interest defines a network of subword units.
  - 9. The method of claim 8 wherein the network of subword units is formed by multiple sequences of subword units that correspond to different paths through the network.
  - 10. The method of claim 1 wherein forming the representation of the entire spoken event of interest includes determining an n-best list of recognition results.
  - 11. The method of claim 10 wherein each sequence of subword units in the representation corresponds to a different one in the n-best list of recognition results.
  - 12. The method of claim 1, further comprising accepting first audio data representing utterances of the entire event of interest spoken by a user, and processing the first audio data to form a processed query.
  - 13. The method of claim 1, further comprising accepting a selection by the user of portions of stored data from the first set of audio signals, and processing the portions of the stored data to form a processed query.
  - 14. The method of claim 13 further comprising, prior to accepting the selection by the user, processing the first set of audio signals according to a first computer-implemented speech recognition algorithm to produce the stored data.
  - 15. The method of claim 14 wherein the first speech recognition algorithm produces data related to presence of the subword units at different times in the first set of audio signals.
  - 16. The method of claim 14, further comprising applying a second speech recognition algorithm to the processed query.

17. A tangible computer-readable medium storing instructions for causing a processing system to:
- receive input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals;
  
  process each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion;
  
  form a representation of the entire spoken event of interest, wherein the instructions for causing the processing system to form the representation include instructions for combining the subword unit representations of the respective identified portions of the first set of audio signals;
  
  accept data representing unknown speech in a second audio signal; and
  
  locate putative instances of the entire spoken event of interest in the second audio signal using the representation of the entire spoken event of interest, wherein the instructions for causing the processing system to locate the putative instances include instructions for identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the specification of the entire spoken event of interest.

18. A system comprising:
- a speech recognizer forreceiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals;
  
  processing each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion, andforming a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals;
  
  a data storage for receiving the representation of the entire spoken event of interest from the speech recognizer;
  
  a word spotter configured to use the representation of the entire spoken event of interest to locate putative instances of the entire spoken event of interest in data representing unknown speech in a second audio signal.
- View Dependent Claims (19)
- - 19. The system of claim 18, wherein the word spotter is further configured to identify time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the representation of the entire spoken event of interest,wherein the first instance of the entire spoken event of interest and the second instance of the entire spoken event of interest include a common set of words, andwherein the subword unit representation corresponding to the first portion and the subword unit representation corresponding to the second portion are different.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nexidia Incorporated (Nice Ltd)
Original Assignee
Nexidia Incorporated (Nice Ltd)
Inventors
Morris, Robert W.
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
BORSETTI, GREG

Application Number

US10/565,570
Publication Number

US 20070033003A1
Time in Patent Office

2,420 Days
Field of Search

704/231, 704/256.5, 704/243, 704/256, 704/255, 704/254, 704/251
US Class Current

704/254
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/08   Speech classification or se...

G10L 15/144   Training of HMMs

G10L 2015/088   Word spotting

Spoken word spotting queries

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

84 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Spoken word spotting queries

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

84 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links