Systems and methods for automatic acoustic speaker adaptation in computer-assisted transcription systems
First Claim
1. A method for acoustic adaptation comprising the steps of:
- collecting at least one audio file associated with a partial transcript of the audio file;
building a topic language model from the partial transcript;
interpolating the topic language model with a general language model;
using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and
generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine.
9 Assignments
0 Petitions
Accused Products
Abstract
The invention is a system and method for automatic acoustic speaker adaptation in an automatic speech recognition assisted transcription system. Partial transcripts of audio files are generated by a transcriptionist. A topic language model is generated from the partial transcripts. The topic language model is interpolated with a general language model. Automatic speech recognition is performed on the audio files by a speech recognition engine using a speaker independent acoustic model and the interpolated language model to generate semi-literal transcripts of the audio files. The semi-literal transcripts are then used with the corresponding audio files to generate a speaker dependent acoustic model in an acoustic adaptation engine.
-
Citations
33 Claims
-
1. A method for acoustic adaptation comprising the steps of:
- collecting at least one audio file associated with a partial transcript of the audio file;
building a topic language model from the partial transcript;
interpolating the topic language model with a general language model;
using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and
generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- collecting at least one audio file associated with a partial transcript of the audio file;
-
11. A system for acoustic adaptation comprising:
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
a text server for storing at least one transcription associated with the at least one audio file;
a speech recognition engine for receiving audio files, acoustic models, and language models, and outputting text files;
an acoustic adaptation engine for receiving audio files and associated text files and outputting acoustic model files; and
a speech recognition server for sending audio files to the speech recognition engine and the acoustic adaptation engine and for sending text files to the acoustic adaptation engine;
wherein the speech recognition server receives an audio file and an associated partial transcript of the audio file, builds a topic language model from the partial transcript, interpolates the topic language model with a general language model to generate an interpolated language model;
wherein the speech recognition engine uses the interpolate d language model and a speaker independent acoustic model to generate a semi-literal transcript from an audio file; and
wherein the acoustic adaptation engine uses the semi-literal transcript and the audio file to generate a speaker dependent acoustic model. - View Dependent Claims (12, 13, 14, 15, 16)
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
-
17. A system for acoustic adaptation comprising:
- means for collecting at least one audio file associated with a partial transcript of the audio file;
means for building a topic language model from the partial transcript;
means for interpolating the topic language model with a general language model;
means for generating a semi-literal transcript using a speaker-independent acoustic model and the interpolated language model; and
means for generating a speaker dependent acoustic model using the semi-literal transcript and the audio file. - View Dependent Claims (18, 19, 20, 21, 22, 23)
- means for collecting at least one audio file associated with a partial transcript of the audio file;
-
24. A method for creating an interpolated language model for speech recognition, the method comprising the steps of:
- collecting at least one audio file associated with a partial transcript of that audio file;
filtering out predetermined sections of the partial transcript;
normalizing the text of the partial transcript;
creating a first and a second copy of the partial transcript;
removing punctuation from the first copy of the partial transcript;
adding punctuation as words to the second copy of the partial transcript;
merging the first and second copies of the partial transcript to create a semi-literal transcript, wherein the first and second copies of the partial transcript are selectively weighed according to at least one predetermined probability factor;
building a topic language model from the semi-literal transcript; and
interpolating the topic language model with a general language model to create an interpolated language model. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
- collecting at least one audio file associated with a partial transcript of that audio file;
-
32. A method for acoustic adaptation comprising the steps of:
- collecting at least one audio file associated with a partial transcript of the audio file;
counting a number of audio files and associated partial transcripts;
filtering out predetermined sections of the partial transcript;
tokenizing the text of the partial transcript;
removing punctuation from a first copy of the partial transcript;
adding punctuation as words to a second copy of the partial transcript;
building a topic language model from the first and second copies of the partial transcript selectively weighed according to a predetermined probability factor, wherein the topic model comprises trigram word statistics;
interpolating the topic language model with a general language model, wherein the general language model comprises trigram word statistics;
using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and
generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine, wherein the steps of building, interpolating, using, and generating are performed after a predetermined number of audio files and associated partial transcripts have been counted in the counting step.
- collecting at least one audio file associated with a partial transcript of the audio file;
-
33. A system for acoustic adaptation comprising:
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
a text server for storing at least one transcription associated with the at least one audio file;
a counter for counting a number of audio files for a particular speaker;
a speech recognition engine for receiving audio files, acoustic models, and language models, and outputting text files;
an acoustic adaptation engine for receiving audio files and associated text files and outputting acoustic model files; and
a speech recognition server for sending audio files to the speech recognition engine and the acoustic adaptation engine and for sending text files to the acoustic adaptation engine;
wherein the speech recognition server receives an audio file and an associated partial transcript of the audio file, builds a topic language model comprising trigram word statistics from copies of a punctuation text and a no-punctuation text in a predetermined proportion after the counter has counted a predetermined number of audio files for the particular speaker, and interpolates the topic language model with a general language model comprising trigram word statistics to generate and interpolated language model;
wherein the speech recognition engine uses the interpolates language model and a speaker independent acoustic model to generate a semi-literal transcript from an audio file; and
wherein the acoustic adaptation engine uses the semi-literal transcript and the audio file to generate a speaker dependent acoustic model.
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
Specification