Discriminative training of document transcription system

US 8,412,521 B2
Filed: 09/16/2005
Issued: 04/02/2013
Est. Priority Date: 08/20/2004
Status: Active Grant

First Claim

Patent Images

1. In a computer system including a first document tangibly stored in a first computer-readable medium and containing at least some information in common with a spoken audio stream, a method performed by at least one computer processor executing computer program instructions tangibly stored in a second computer-readable medium, the method comprising steps of:

(A) identifying first text tangibly stored in the first document on a third computer-readable medium, wherein the first text represents a first instance of a concept;

(B) identifying, based on the identified first text, a first plurality of at least three spoken forms of the first instance of the concept, including at least one spoken form not contained in the first document;

(C) replacing the identified first text with a first context-free grammar specifying the first plurality of spoken forms of the first instance of the concept to produce a second document tangibly stored in a fourth computer-readable medium;

(D) identifying second text tangibly stored in the first document on the third computer-readable medium, wherein the second text represents a second instance of the concept;

(E) identifying, based on the identified second text, a second plurality of at least three spoken forms of the second instance of the concept, wherein the first plurality of spoken forms differs from the second plurality of spoken forms;

(F) replacing the identified second text with a second context-free grammar specifying the second plurality of spoken forms of the second instance of the concept within the second document;

(G) generating a first language model, tangibly stored in a fifth computer-readable medium, based on the second document;

(H) using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document tangibly stored in a sixth computer-readable medium;

(I) filtering text from the third document by reference to the second document to produce a filtered document, tangibly stored in a seventh computer-readable medium, in which text filtered from the third document is marked as unreliable; and

(J) using the filtered document and the spoken audio stream to train an acoustic model, tangibly stored in an eighth computer-readable medium, by performing steps of;

(J)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures tangibly stored in a ninth computer-readable medium;

(J)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures tangibly stored in a tenth computer-readable medium; and

(J)(3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.

View all claims

14 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is provided for training an acoustic model for use in speech recognition. In particular, such a system may be used to perform training based on a spoken audio stream and a non-literal transcript of the spoken audio stream. Such a system may identify text in the non-literal transcript which represents concepts having multiple spoken forms. The system may attempt to identify the actual spoken form in the audio stream which produced the corresponding text in the non-literal transcript, and thereby produce a revised transcript which more accurately represents the spoken audio stream. The revised, and more accurate, transcript may be used to train the acoustic model using discriminative training techniques, thereby producing a better acoustic model than that which would be produced using conventional techniques, which perform training based directly on the original non-literal transcript.

74 Citations

View as Search Results

18 Claims

1. In a computer system including a first document tangibly stored in a first computer-readable medium and containing at least some information in common with a spoken audio stream, a method performed by at least one computer processor executing computer program instructions tangibly stored in a second computer-readable medium, the method comprising steps of:
- (A) identifying first text tangibly stored in the first document on a third computer-readable medium, wherein the first text represents a first instance of a concept;
  
  (B) identifying, based on the identified first text, a first plurality of at least three spoken forms of the first instance of the concept, including at least one spoken form not contained in the first document;
  
  (C) replacing the identified first text with a first context-free grammar specifying the first plurality of spoken forms of the first instance of the concept to produce a second document tangibly stored in a fourth computer-readable medium;
  
  (D) identifying second text tangibly stored in the first document on the third computer-readable medium, wherein the second text represents a second instance of the concept;
  
  (E) identifying, based on the identified second text, a second plurality of at least three spoken forms of the second instance of the concept, wherein the first plurality of spoken forms differs from the second plurality of spoken forms;
  
  (F) replacing the identified second text with a second context-free grammar specifying the second plurality of spoken forms of the second instance of the concept within the second document;
  
  (G) generating a first language model, tangibly stored in a fifth computer-readable medium, based on the second document;
  
  (H) using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document tangibly stored in a sixth computer-readable medium;
  
  (I) filtering text from the third document by reference to the second document to produce a filtered document, tangibly stored in a seventh computer-readable medium, in which text filtered from the third document is marked as unreliable; and
  
  (J) using the filtered document and the spoken audio stream to train an acoustic model, tangibly stored in an eighth computer-readable medium, by performing steps of;
  
  (J)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures tangibly stored in a ninth computer-readable medium;
  
  (J)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures tangibly stored in a tenth computer-readable medium; and
  
  (J)(3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein (J) further comprises a step of:
    - (J)(4) prior to (J)(1), training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 3. The method of claim 2, wherein step (J)(4) comprises training the set of base acoustic models using maximum likelihood optimization training.
  - 4. The method of claim 1, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.
  - 5. The method of claim 1, wherein (F) comprises:
    - (F)(1) identifying a generic context-free grammar having at least one parameter;
      
      (F)(2) specifying the first instance of the concept as a value of the at least one parameter to produce a parameterized context-free grammar that specifies alternative spoken forms of the first instance of the concept; and
      
      (F)(3) replacing the identified first text with the parameterized context-free grammar.
  - 6. The method of claim 5, wherein the at least one parameter comprises a plurality of parameters, and wherein (F)(2) comprises specifying the first instance of the concept as a plurality of values of the plurality of parameters to produce the parameterized context-free grammar.

7. A computer-implemented system comprising:
- a first document, tangibly stored in a first non-transitory computer-readable medium, containing at least some information in common with a spoken audio stream; and
  
  computer-readable program instructions tangibly stored in a second non-transitory computer-readable medium, said computer-readable program instructions adapted to be executed by a computer processor to perform a method comprising;
  
  identifying first text tangibly stored in the first document on a third non-transitory computer-readable medium, wherein the first text represents a first instance of a concept;
  
  identifying, based on the identified first text, a first plurality of at least three spoken forms of the first instance of the concept, including at least one spoken form not contained in the first document;
  
  replacing the identified first text with a first context-free grammar specifying the first plurality of spoken forms of the first instance of the concept to produce a second document tangibly stored in a fourth non-transitory computer-readable medium;
  
  identifying second text tangibly stored in the first document on the third non-transitory computer-readable medium, wherein the second text represents a second instance of the concept;
  
  identifying, based on the identified second text, a second plurality of at least three spoken forms of the second instance of the concept, wherein the first plurality of spoken forms differs from the second plurality of spoken forms;
  
  replacing the identified second text with a second context-free grammar specifying the second plurality of spoken forms of the second instance of the concept within the second document;
  
  generating a first language model, tangibly stored in a fifth non-transitory computer-readable medium, based on the second document;
  
  using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document tangibly stored in a sixth non-transitory computer-readable medium;
  
  filtering text from the third document by reference to the second document to produce a filtered document, tangibly stored in a seventh non-transitory computer-readable medium, in which text filtered from the third document is marked as unreliable; and
  
  using the filtered document and the spoken audio stream to train an acoustic model, tangibly stored in an eighth non-transitory computer-readable medium, by;
  
  applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures tangibly stored in a ninth non-transitory computer-readable medium;
  
  applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures tangibly stored in a tenth non-transitory computer-readable medium; and
  
  performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (8, 9, 10)
- - 8. The system of claim 7, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.
  - 9. The system of claim 7, wherein the computer program instructions further comprise instructions for:
    - training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 10. The system of claim 9, wherein the instructions for training the set of base acoustic models comprise instructions for training the set of base acoustic models using maximum likelihood optimization training.

11. A method performed by at least one computer processor executing computer program instructions tangibly stored in a first computer-readable medium, the method comprising steps of:
- (A) identifying a normalized document tangibly stored in a second computer-readable medium and containing at least some information in common with a spoken audio stream, the normalized document including;
  
  (1) a first context-free grammar specifying a first plurality of at least three spoken forms of a first instance of a concept; and
  
  (2) a second context-free grammar specifying a second plurality of at least three spoken forms of a second instance of the concept, wherein the first plurality of spoken forms differs from the second plurality of spoken forms;
  
  (B) identifying a language model, tangibly stored in a third computer-readable medium, based on the normalized document;
  
  (C) using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document tangibly stored in a fourth computer-readable medium;
  
  (D) filtering text from the second document by reference to the normalized document to produce a filtered document tangibly stored in a fifth computer-readable medium and in which text filtered from the second document is marked as unreliable; and
  
  (E) using the filtered document and the spoken audio stream to train an acoustic model, tangibly stored in a sixth computer-readable medium, by performing steps of;
  
  (E)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures, tangibly stored in a seventh computer-readable medium;
  
  (E)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures tangibly stored in an eighth computer-readable medium; and
  
  (E)(3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (12, 13, 14)
- - 12. The method of claim 11, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.
  - 13. The method of claim 11, further comprising a step of:
    - (E)(4) prior to (E)(1), training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 14. The method of claim 13, wherein step (E)(4) comprises training the set of base acoustic models using maximum likelihood optimization training.

15. A computer program product comprising computer-readable computer program instructions, tangibly stored in a first non-transitory computer-readable medium, said computer-readable program instructions adapted to be executed by a computer processor to perform a method comprising:
- identifying a normalized document, tangibly stored in a second non-transitory computer-readable medium, containing at least some information in common with a spoken audio stream, the normalized document including;
  
  (1) a first context-free grammar specifying a first plurality of at least three spoken forms of a first instance of a concept; and
  
  (2) a second context-free grammar specifying a second plurality of at least three spoken forms of a second instance of the concept, wherein the first plurality of spoken forms differs from the second plurality of spoken forms;
  
  identifying a language model, tangibly stored in a third non-transitory computer-readable medium, based on the normalized document;
  
  using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document tangibly stored in a fourth non-transitory computer-readable medium;
  
  filtering text from the second document by reference to the normalized document to produce a filtered document tangibly stored in a fifth non-transitory computer-readable medium and in which text filtered from the second document is marked as unreliable; and
  
  using the filtered document and the spoken audio stream to train an acoustic model, tangibly stored in a sixth non-transitory computer-readable medium, comprising;
  
  applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures, tangibly stored in a seventh non-transitory computer-readable medium;
  
  applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures tangibly stored in an eighth non-transitory computer-readable medium; and
  
  performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (16, 17, 18)
- - 16. The computer program product of claim 15, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.
  - 17. The computer program product of claim 15, wherein the instructions further comprise instructions for:
    - training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 18. The computer program product of claim 17, wherein the instructions for training the set of base acoustic models comprises instructions for training the set of base acoustic models using maximum likelihood optimization training.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Solventum Intellectual Properties Company (Solventum Corp.)
Original Assignee
Multimodal Technologies Incorporated (3M Company)
Inventors
Mathias, Lambert, Yegnanarayanan, Girija, Fritsch, Juergen
Primary Examiner(s)
Armstrong, Angela A

Application Number

US11/228,607
Publication Number

US 20060074656A1
Time in Patent Office

2,755 Days
Field of Search

704/235, 704/251, 704/257
US Class Current

704/235
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/40   Processing or translation o...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/183   using context dependencies,...

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/26   Speech to text systems G10L...

G10L 2015/0631   Creating reference template...

G10L 2015/0633   using lexical or orthograph...

G16H 15/00   ICT specially adapted for m...

Discriminative training of document transcription system

First Claim

14 Assignments

0 Petitions

Accused Products

Abstract

74 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Discriminative training of document transcription system

First Claim

14 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

74 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links