Discriminative training of document transcription system

US 8,694,312 B2
Filed: 02/22/2013
Issued: 04/08/2014
Est. Priority Date: 08/20/2004
Status: Active Grant

First Claim

Patent Images

1. A method for use with a system including a first document containing at least some information in common with a spoken audio stream, the method performed by at least one computer processor executing computer program instructions to perform steps of:

(A) identifying text in the first document, wherein the text represents a concept;

(B) identifying, based on the identified text and a repository of finite state grammars, a plurality of spoken forms of the concept, including at least one spoken form not contained in the first document, wherein all of the plurality of spoken forms have the same content as each other;

(C) replacing the identified text with a finite state grammar specifying the plurality of spoken forms of the concept to produce a second document;

(D) generating a document-specific language model based on the second document, comprising generating at least some of the document-specific language model based on the finite state grammar;

(E) using the document-specific language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;

(F) filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and

(G) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;

(G)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;

(G)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and

(G)(3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is provided for training an acoustic model for use in speech recognition. In particular, such a system may be used to perform training based on a spoken audio stream and a non-literal transcript of the spoken audio stream. Such a system may identify text in the non-literal transcript which represents concepts having multiple spoken forms. The system may attempt to identify the actual spoken form in the audio stream which produced the corresponding text in the non-literal transcript, and thereby produce a revised transcript which more accurately represents the spoken audio stream. The revised, and more accurate, transcript may be used to train the acoustic model using discriminative training techniques, thereby producing a better acoustic model than that which would be produced using conventional techniques, which perform training based directly on the original non-literal transcript.

14 Citations

View as Search Results

8 Claims

1. A method for use with a system including a first document containing at least some information in common with a spoken audio stream, the method performed by at least one computer processor executing computer program instructions to perform steps of:
- (A) identifying text in the first document, wherein the text represents a concept;
  
  (B) identifying, based on the identified text and a repository of finite state grammars, a plurality of spoken forms of the concept, including at least one spoken form not contained in the first document, wherein all of the plurality of spoken forms have the same content as each other;
  
  (C) replacing the identified text with a finite state grammar specifying the plurality of spoken forms of the concept to produce a second document;
  
  (D) generating a document-specific language model based on the second document, comprising generating at least some of the document-specific language model based on the finite state grammar;
  
  (E) using the document-specific language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;
  
  (F) filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and
  
  (G) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;
  
  (G)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
  
  (G)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
  
  (G)(3) performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein (G) further comprises a step of:
    - (G)(4) prior to (G)(1), training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 3. The method of claim 2, wherein step (G)(4) comprises training the set of base acoustic models using maximum likelihood optimization training.
  - 4. The method of claim 1, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.

5. A non-transitory computer-readable medium comprising computer program instructions executable by at least one computer processor to perform a method for use with a system, the system including a first document containing at least some information in common with a spoken audio stream, the method comprising:
- (A) identifying text in the first document, wherein the text represents a concept;
  
  (B) identifying, based on the identified text and a repository of finite state grammars, a plurality of spoken forms of the concept, including at least one spoken form not contained in the first document, wherein all of the plurality of spoken forms have the same content as each other;
  
  (C) replacing the identified text with a finite state grammar specifying the plurality of spoken forms of the concept to produce a second document;
  
  (D) generating a document-specific language model based on the second document, comprising generating at least some of the document-specific language model based on the finite state grammar;
  
  (E) using the document-specific language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document;
  
  (F) filtering text from the third document by reference to the second document to produce a filtered document in which text filtered from the third document is marked as unreliable; and
  
  (G) using the filtered document and the spoken audio stream to train an acoustic model by performing steps of;
  
  comprising;
  
  (G)(1) applying a first speech recognition process to the spoken audio stream using a set of base acoustic models and a grammar network based on the filtered document to produce a first set of recognition structures;
  
  (G)(2) applying a second speech recognition process to the spoken audio stream using the set of base acoustic models and a second language model to produce a second set of recognition structures; and
  
  (G)(3) means for performing discriminative training of the acoustic model using the first set of recognition structures, the second set of recognition structures, the filtered document, and only those portions of the spoken audio stream corresponding to text not marked as unreliable in the filtered document.
- View Dependent Claims (6, 7, 8)
- - 6. The non-transitory computer-readable medium of claim 5, wherein (G) further comprises a step of:
    - (G)(4) prior to (G)(1), training the set of base acoustic models using the spoken audio stream and the filtered document.
  - 7. The non-transitory computer-readable medium of claim 6, wherein the step (G)(4) comprises training the set of base acoustic models using maximum likelihood optimization training.
  - 8. The non-transitory computer-readable medium of claim 5, wherein the discriminative training comprises maximum mutual information estimation training, wherein the first set of recognition structures comprises a “
    - correct”
      
      lattice, and wherein the second set of recognition structures comprises a “
      
      general”
      
      lattice.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Solventum Intellectual Properties Company (Solventum Corp.)
Original Assignee
MModal IP LLC (3M Company)
Inventors
Mathias, Lambert, Yegnanarayanan, Girija, Fritsch, Juergen
Primary Examiner(s)
Armstrong, Angela A

Application Number

US13/773,928
Publication Number

US 20130166297A1
Time in Patent Office

410 Days
Field of Search

704/235, 704/243, 704/251, 704/257
US Class Current

704/243
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/40   Processing or translation o...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/183   using context dependencies,...

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/26   Speech to text systems G10L...

G10L 2015/0631   Creating reference template...

G10L 2015/0633   using lexical or orthograph...

G16H 15/00   ICT specially adapted for m...

Discriminative training of document transcription system

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Discriminative training of document transcription system

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links