Spoken word spotting queries
First Claim
Patent Images
1. A method comprising:
- receiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals;
processing, by a query recognizer of a word spotting system, each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion;
forming, by the query recognizer of the word spotting system, a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals;
accepting, by a word spotting engine of the word spotting system, data representing unknown speech in a second audio signal; and
locating, by the word spotting engine of the word spotting system, putative instances of the entire spoken event of interest in the second audio signal using the representation of the spoken event of interest, wherein the locating includes identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the representation of the entire spoken event of interest,wherein the first instance of the entire spoken event of interest and the second instance of the entire spoken event of interest include a common set of words, andwherein the subword unit representation corresponding to the first portion and the subword unit representation corresponding to the second portion are different.
10 Assignments
0 Petitions
Accused Products
Abstract
An approach to wordspotting (180) using query data from one or more spoken instance of a query (140). The query data is processed to determining a representation of the query (160) that defines multiple sequences of subword (130) units each representing the query. Then putative instances of the query (190) are located in input data from an audio signal using the determined representation of the query.
84 Citations
19 Claims
-
1. A method comprising:
-
receiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals; processing, by a query recognizer of a word spotting system, each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion; forming, by the query recognizer of the word spotting system, a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals; accepting, by a word spotting engine of the word spotting system, data representing unknown speech in a second audio signal; and locating, by the word spotting engine of the word spotting system, putative instances of the entire spoken event of interest in the second audio signal using the representation of the spoken event of interest, wherein the locating includes identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the representation of the entire spoken event of interest, wherein the first instance of the entire spoken event of interest and the second instance of the entire spoken event of interest include a common set of words, and wherein the subword unit representation corresponding to the first portion and the subword unit representation corresponding to the second portion are different. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A tangible computer-readable medium storing instructions for causing a processing system to:
-
receive input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals; process each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion; form a representation of the entire spoken event of interest, wherein the instructions for causing the processing system to form the representation include instructions for combining the subword unit representations of the respective identified portions of the first set of audio signals; accept data representing unknown speech in a second audio signal; and locate putative instances of the entire spoken event of interest in the second audio signal using the representation of the entire spoken event of interest, wherein the instructions for causing the processing system to locate the putative instances include instructions for identifying time locations of the second audio signal at which the entire spoken event of interest is likely to have occurred based on a comparison of the data representing the unknown speech with the specification of the entire spoken event of interest.
-
-
18. A system comprising:
-
a speech recognizer for receiving input from a user identifying at least a first portion and a second portion of a first set of audio signals as being of interest to the user, wherein the first portion corresponds to a first instance of an entire spoken event of interest in the first set of audio signals and the second portion corresponds to a second instance of the entire spoken event of interest in the first set of audio signals; processing each identified portion of the first set of audio signals to generate a corresponding subword unit representation of the identified portion, and forming a representation of the entire spoken event of interest, wherein the forming includes combining the subword unit representations of the respective identified portions of the first set of audio signals; a data storage for receiving the representation of the entire spoken event of interest from the speech recognizer; a word spotter configured to use the representation of the entire spoken event of interest to locate putative instances of the entire spoken event of interest in data representing unknown speech in a second audio signal. - View Dependent Claims (19)
-
Specification