System and method for indexing and querying audio archives
First Claim
1. A method for processing an audio data file, comprising the steps of:
- segmenting the audio data file into segments based on detected speaker changes;
performing speaker identification for each segment and assigning at least one speaker identification tag to each segment based on an identified speaker;
verifying the identity of the speaker associated with the at least one identification tag for each segment; and
indexing the segments of the audio data file for storage in a database in accordance with the identification tags of verified speakers.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for indexing segments of audio/multimedia files and data streams for storage in a database according to audio information such as speaker identity, the background environment and channel (music, street noise, car noise, telephone, studio noise, speech plus music, speech plus noise, speech over speech), and/or the transcription of the spoken utterances. The content or topic of the transcribed text can also be determined using natural language understanding to index based on the context of the transcription. A user can then retrieve desired segments of the audio file from the database by generating a query having one or more desired parameters based on the indexed information.
-
Citations
34 Claims
-
1. A method for processing an audio data file, comprising the steps of:
-
segmenting the audio data file into segments based on detected speaker changes;
performing speaker identification for each segment and assigning at least one speaker identification tag to each segment based on an identified speaker;
verifying the identity of the speaker associated with the at least one identification tag for each segment; and
indexing the segments of the audio data file for storage in a database in accordance with the identification tags of verified speakers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
generating a voiceprint for each segment of the audio data file; and
storing each voiceprint with its corresponding segment in the database.
-
-
6. The method of claim 5, wherein the user query is a voiceprint associated with the desired speaker, and wherein the retrieving step includes the steps of:
-
comparing the input speaker voiceprint with each of the stored voiceprints of the segments; and
selecting at least one segment having a corresponding voiceprint stored therewith that matches the input voiceprint.
-
-
7. The method of claim 5, wherein the user query is an audio segment of the desired speaker, and wherein the retrieving step includes the steps of:
-
generating a voiceprint from the input audio segment;
comparing the generated voiceprint with each of the stored voiceprints of the segments; and
selecting at least one segment having a corresponding voiceprint stored therewith that matches the generated voiceprint.
-
-
8. The method of claim 5, further including the step of storing for each segment one of a corresponding waveform, acoustic features, and both.
-
9. The method of claim 8, wherein the user query is an audio segment of the desired speaker, and wherein the retrieving step includes the steps of:
-
comparing the audio segment with one of the stored waveforms and stored acoustic features of each of the segments in the database; and
selecting at least one segment having a corresponding one of a waveform and acoustic features stored therewith that match the audio segment of the speaker of interest.
-
-
10. The method of claim 9, wherein the audio segment of the speaker of interest is one of input by the user and selected from the database.
-
11. The method of claim 1, further including the steps of:
-
segmenting the audio data file into segments based on detected changes in environment; and
identifying at least one environment of each segment and assigning at least one environment tag to each segment corresponding to the at least one identified environment;
wherein the indexing step further includes indexing the segments of the audio data file for storage in the database in accordance with the environment tags of the segments.
-
-
12. The method of claim 11, wherein the step of detecting changes in environment includes detecting changes in one of a background noise, a channel, and a combination thereof.
-
13. The method of claim 11, including the step of retrieving at least one segment from the database in accordance with a user query based on one of an identity of a desired speaker, the identity of a desired environment, and a combination thereof.
-
14. The method of claim 1, further including the steps of:
-
recognizing spoken words of each segment; and
storing the recognized words for each corresponding segment in the database.
-
-
15. The method of claim 14, wherein the recognizing step includes the steps of:
-
identifying one of channel acoustic components, background acoustic components, and a combination thereof, for each segment; and
decoding the spoken words of each segment using trained models based on the identified acoustic components.
-
-
16. The method of claim 14, further including the steps of:
-
performing natural language understanding (NLU) of the recognized words of each segment to determine at least one NLU topic of each segment;
wherein the indexing step further includes indexing the segments of the audio data file for storage in the database in accordance with the determined NLU topics.
-
-
17. The method of claim 16, including the step of retrieving at least one segment from the database in accordance with a user query based on one of an identity of a speaker of interest, at least one user-selected keyword, context of the recognized words text, at least one NLU topic, and a combination thereof.
-
18. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for processing an audio data file, the method steps comprising:
-
segmenting the audio data file into segments based on detected speaker changes;
performing speaker identification for each segment and assigning at least one speaker identification tag to each segment based on an identified speaker;
verifying the identity of the speaker associated with the at least one identification tag for each segment; and
indexing the segments of the audio data file for storage in a database in accordance with the identification tags of verified speakers. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
segmenting the audio data file into segments based on detected changes in environment; and
identifying at least one environment of each segment to assign at least one environment tag to each segment corresponding to the at least one identified environment;
wherein the instructions for performing the indexing step further include instructions for indexing the segments of the audio data file for storage in the database in accordance with the environment tags.
-
-
21. The program storage device of claim 20, wherein the step of detecting changes in environment includes detecting changes in one of a background, a channel, and a combination thereof.
-
22. The program storage device of claim 20, further including instructions for performing the step of retrieving at least one segment from the database in accordance with a user query based on one of an identity of a desired speaker, the identity of a desired environment, and a combination thereof.
-
23. The program storage device of claim 18, further including instructions for performing the steps of:
-
recognizing spoken words of each segment; and
storing the recognized words for each corresponding segment in the database.
-
-
24. The program storage device of claim 23, wherein the instructions for performing the recognizing step include instructions for performing the steps of:
-
identifying one of channel acoustic components, background acoustic components, and a combination thereof, for each segment; and
decoding the spoken words of each segment using trained models based on the identified acoustic components.
-
-
25. The program storage device of claim 23, further including instruction for performing the steps of:
-
performing natural language understanding (NLU) of the recognized words of each segment to determine at least one NLU topic of each segment;
wherein the instructions for performing the indexing step include instructions for indexing the segments of the audio data file for storage in the database in accordance with the determined NLU topics.
-
-
26. The program storage device of claim 25, further including instructions for performing the step of retrieving at least one segment from the database in accordance with a user query based on one of an identity of a speaker of interest, at least one user-selected keyword, context of the recognized words text, at least one NLU topic, and a combination thereof.
-
27. A system for managing a database of audio data files, comprising:
-
a segments for dividing an input audio data file into segments by detecting speaker changes in the input audio data file;
a speaker identifier for identifying a speaker of each segment and assigning at least one identity tag to each segment;
a speaker verifier for verifying the at least one identity tag of each segment; and
an indexer for indexing the segments of the input audio data file for storage in the database in accordance with the identity tags of verified speakers. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34)
a speech recognizer for recognizing spoken words of each segment, wherein the recognized words for each segment are stored in the database and indexed to the corresponding segment.
-
-
33. The system of claim 32, further comprising means for performing natural language understanding (NLU) of the recognized words of each segment to determine at least one NLU topic of each segment, wherein the indexer indexes the segments of the audio data file for storage in the database in accordance with the determined NLU topics.
-
34. The system of claim 33, further comprising a search engine for retrieving at least one segment from the database by processing a user query based on one of an identity of a speaker of interest, at least one user-selected keyword, context of the recognized words text, at least one NLU topic, and a combination thereof.
Specification