Discriminative training of document transcription system
First Claim
1. In a system including a first document containing at least some information in common with a spoken audio stream, a method comprising steps of:
- (A) identifying text in the first document representing a concept having a plurality of spoken forms;
(B) replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document;
(C) generating a first language model based on the second document;
(D) using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;
(E) filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and
(F) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;
(F) (1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
(F) (2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
(F) (3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
14 Assignments
0 Petitions
Accused Products
Abstract
A system is provided for training an acoustic model for use in speech recognition. In particular, such a system may be used to perform training based on a spoken audio stream and a non-literal transcript of the spoken audio stream. Such a system may identify text in the non-literal transcript which represents concepts having multiple spoken forms. The system may attempt to identify the actual spoken form in the audio stream which produced the corresponding text in the non-literal transcript, and thereby produce a revised transcript which more accurately represents the spoken audio stream. The revised, and more accurate, transcript may be used to train the acoustic model using discriminative training techniques, thereby producing a better acoustic model than that which would be produced using conventional techniques, which perform training based directly on the original non-literal transcript.
-
Citations
16 Claims
-
1. In a system including a first document containing at least some information in common with a spoken audio stream, a method comprising steps of:
-
(A) identifying text in the first document representing a concept having a plurality of spoken forms;
(B) replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document;
(C) generating a first language model based on the second document;
(D) using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;
(E) filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and
(F) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;
(F) (1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
(F) (2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
(F) (3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document. - View Dependent Claims (2, 3, 4)
-
-
5. A system comprising:
-
a first document containing at least some information in common with a spoken audio stream;
means for identifying text in the first document representing a concept having a plurality of spoken forms;
means for replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document;
means for generating a first language model based on the second document;
means for using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;
means for filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and
means for using the filtered document and the spoken audio stream to train an acoustic model, comprising;
means for applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
means for applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
means for performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document. - View Dependent Claims (6, 7, 8)
-
-
9. A method comprising steps of:
-
(A) identifying a normalized document containing at least some information in common with a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
(B) identifying a language model based on the normalized document;
(C) using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document;
(D) filtering text from the second document by reference to the normalized document to produce a filtered document in which text filtered from the second document is marked as unreliable; and
(E) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;
(E) (1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
(E) (2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
(E) (3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document. - View Dependent Claims (10, 11, 12)
-
-
13. A system comprising:
-
means for identifying a normalized document containing at least some information in common with a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
means for identifying a language model based on the normalized document;
means for using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document;
means for filtering text from the second document by reference to the normalized document to produce a filtered document; and
means for using the filtered document and the spoken audio stream to train an acoustic model, comprising;
means for applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
means for applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
means for performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document. - View Dependent Claims (14, 15, 16)
-
Specification