Systems and methods for automatic acoustic speaker adaptation in computer-assisted transcription systems

US 7,236,931 B2
Filed: 04/28/2003
Issued: 06/26/2007
Est. Priority Date: 05/01/2002
Status: Active Grant

First Claim

Patent Images

1. A method for acoustic adaptation comprising the steps of:

collecting at least one audio file associated with a partial transcript of the audio file;

building a topic language model from the partial transcript;

interpolating the topic language model with a general language model;

using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and

generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention is a system and method for automatic acoustic speaker adaptation in an automatic speech recognition assisted transcription system. Partial transcripts of audio files are generated by a transcriptionist. A topic language model is generated from the partial transcripts. The topic language model is interpolated with a general language model. Automatic speech recognition is performed on the audio files by a speech recognition engine using a speaker independent acoustic model and the interpolated language model to generate semi-literal transcripts of the audio files. The semi-literal transcripts are then used with the corresponding audio files to generate a speaker dependent acoustic model in an acoustic adaptation engine.

Citations

33 Claims

1. A method for acoustic adaptation comprising the steps of:
- collecting at least one audio file associated with a partial transcript of the audio file;
  
  building a topic language model from the partial transcript;
  
  interpolating the topic language model with a general language model;
  
  using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and
  
  generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising the step of filtering out predetermined sections of the partial transcript to generate a filtered partial transcript.
  - 3. The method of claim 2, further comprising the step of tokenizing the text of the partial transcript.
  - 4. The method of claim 3, further comprising the steps of adding words of punctuation to the partial transcript to generate a punctuation text, removing punctuation from the partial transcript to generate a no-punctuation text, and simulating probabilities of pronounced punctuations in the topic language model by providing copies of the punctuation text and the no-punctuation text in a predetermined proportion.
  - 5. The method of claim 4, further comprising counting a number of audio files and associated partial transcripts.
  - 6. The method of claim 5, wherein the steps of building, interpolating, using, and generating are performed after a predetermined number of audio files and associated partial transcripts have been counted in the counting step.
  - 7. The method of claim 1, further comprising counting a number of audio files and associated partial transcripts.
  - 8. The method of claim 7, wherein the steps of building, interpolating, using, and generating are performed after a predetermined number of audio files and associated partial transcripts have been counted in the counting step.
  - 9. The method of claim 1, wherein the topic language model and the general language model comprise n-gram word statistics.
  - 10. The method of claim 9, wherein the topic language model and the general language model comprise trigram word statistics.

11. A system for acoustic adaptation comprising:
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
  
  a text server for storing at least one transcription associated with the at least one audio file;
  
  a speech recognition engine for receiving audio files, acoustic models, and language models, and outputting text files;
  
  an acoustic adaptation engine for receiving audio files and associated text files and outputting acoustic model files; and
  
  a speech recognition server for sending audio files to the speech recognition engine and the acoustic adaptation engine and for sending text files to the acoustic adaptation engine;
  
  wherein the speech recognition server receives an audio file and an associated partial transcript of the audio file, builds a topic language model from the partial transcript, interpolates the topic language model with a general language model to generate an interpolated language model;
  
  wherein the speech recognition engine uses the interpolate d language model and a speaker independent acoustic model to generate a semi-literal transcript from an audio file; and
  
  wherein the acoustic adaptation engine uses the semi-literal transcript and the audio file to generate a speaker dependent acoustic model.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, further comprising a counter for counting a number of audio files for a particular speaker, wherein the topic language model is generated after the counter has counted a predetermined number of audio files for the particular speaker.
  - 13. The system of claim 11, further comprising a counter for counting a number of audio files for a plurality of speakers, wherein the topic language model is generated after the counter has counted a predetermined number of audio files for the plurality of speakers.
  - 14. The system of claim 11, wherein the topic language model and the general language model comprise n-gram word statistics.
  - 15. The system of claim 14, wherein the topic language model and the general language model comprise trigram word statistics.
  - 16. The system of claim 11, wherein the topic language model is created using copies of a punctuation text and a no-punctuation text in a predetermined proportion.

17. A system for acoustic adaptation comprising:
- means for collecting at least one audio file associated with a partial transcript of the audio file;
  
  means for building a topic language model from the partial transcript;
  
  means for interpolating the topic language model with a general language model;
  
  means for generating a semi-literal transcript using a speaker-independent acoustic model and the interpolated language model; and
  
  means for generating a speaker dependent acoustic model using the semi-literal transcript and the audio file.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The system of claim 17, further comprising a means for filtering out predetermined sections of the partial transcript to generate a filtered partial transcript.
  - 19. The system of claim 17, further comprising a means for tokenizing the text of the partial transcript.
  - 20. The system of claim 19, further comprising a means for simulating the probabilities of pronounced punctuations in the topic language model.
  - 21. The system of claim 17, further comprising a means for counting a number of audio files and associated partial transcripts.
  - 22. The system of claim 17, wherein the topic language model and the general language model comprise n-gram word statistics.
  - 23. The system of claim 22, wherein the topic language model and the general language model comprise trigram word statistics.

24. A method for creating an interpolated language model for speech recognition, the method comprising the steps of:
- collecting at least one audio file associated with a partial transcript of that audio file;
  
  filtering out predetermined sections of the partial transcript;
  
  normalizing the text of the partial transcript;
  
  creating a first and a second copy of the partial transcript;
  
  removing punctuation from the first copy of the partial transcript;
  
  adding punctuation as words to the second copy of the partial transcript;
  
  merging the first and second copies of the partial transcript to create a semi-literal transcript, wherein the first and second copies of the partial transcript are selectively weighed according to at least one predetermined probability factor;
  
  building a topic language model from the semi-literal transcript; and
  
  interpolating the topic language model with a general language model to create an interpolated language model.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
- - 25. The method of claim 24, wherein the language model is used in unsupervised acoustic adaptation.
  - 26. The method of claim 25, further comprising the step of counting a number of audio files and associated partial transcripts.
  - 27. The method of claim 26, wherein the steps of building and interpolating are carried out after a predetermined number of audio files and associated partial transcripts have been counted in the counting step.
  - 28. The method of claim 27, wherein all of the audio files are from a single speaker.
  - 29. The method of claim 27, wherein the audio files are from two or more speakers.
  - 30. The method of claim 28, wherein the topic language model and the general language model comprise n-gram word statistics.
  - 31. The method of claim 30, wherein the topic language model and the general language model comprise trigram word statistics.

32. A method for acoustic adaptation comprising the steps of:
- collecting at least one audio file associated with a partial transcript of the audio file;
  
  counting a number of audio files and associated partial transcripts;
  
  filtering out predetermined sections of the partial transcript;
  
  tokenizing the text of the partial transcript;
  
  removing punctuation from a first copy of the partial transcript;
  
  adding punctuation as words to a second copy of the partial transcript;
  
  building a topic language model from the first and second copies of the partial transcript selectively weighed according to a predetermined probability factor, wherein the topic model comprises trigram word statistics;
  
  interpolating the topic language model with a general language model, wherein the general language model comprises trigram word statistics;
  
  using a speaker-independent acoustic model and the interpolated language model in a speech recognition engine on the audio file to generate a semi-literal transcript; and
  
  generating a speaker dependent acoustic model using the semi-literal transcript and the audio file in an acoustic adaptation engine, wherein the steps of building, interpolating, using, and generating are performed after a predetermined number of audio files and associated partial transcripts have been counted in the counting step.

33. A system for acoustic adaptation comprising:
- a voice server for storing at least one audio file, wherein the audio file is stored according to the identity of the speaker;
  
  a text server for storing at least one transcription associated with the at least one audio file;
  
  a counter for counting a number of audio files for a particular speaker;
  
  a speech recognition engine for receiving audio files, acoustic models, and language models, and outputting text files;
  
  an acoustic adaptation engine for receiving audio files and associated text files and outputting acoustic model files; and
  
  a speech recognition server for sending audio files to the speech recognition engine and the acoustic adaptation engine and for sending text files to the acoustic adaptation engine;
  
  wherein the speech recognition server receives an audio file and an associated partial transcript of the audio file, builds a topic language model comprising trigram word statistics from copies of a punctuation text and a no-punctuation text in a predetermined proportion after the counter has counted a predetermined number of audio files for the particular speaker, and interpolates the topic language model with a general language model comprising trigram word statistics to generate and interpolated language model;
  
  wherein the speech recognition engine uses the interpolates language model and a speaker independent acoustic model to generate a semi-literal transcript from an audio file; and
  
  wherein the acoustic adaptation engine uses the semi-literal transcript and the audio file to generate a speaker dependent acoustic model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
USB AG Stamford Branch
Inventors
He, Chuang, Wu, Jianxiong
Primary Examiner(s)
CHAWAN, VIJAY B

Application Number

US10/424,140
Publication Number

US 20040088162A1
Time in Patent Office

1,520 Days
Field of Search

704/235, 704/260, 704255-257, 704/231, 704/10, 704/270.1, 704/251
US Class Current

704/235
CPC Class Codes

G10L 15/063   Training

G10L 15/065   Adaptation

G10L 15/183   using context dependencies,...

Systems and methods for automatic acoustic speaker adaptation in computer-assisted transcription systems

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for automatic acoustic speaker adaptation in computer-assisted transcription systems

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links