Document transcription system training

US 20060041427A1
Filed: 08/20/2004
Published: 02/23/2006
Est. Priority Date: 08/20/2004
Status: Active Grant

First Claim

Patent Images

1. In a system including a first document containing at least some information in common with a spoken audio stream, a method comprising steps of:

(A) identifying text in the first document representing a concept having a plurality of spoken forms;

(B) replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document; and

(C) generating a first language model based on the second document.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is provided for training an acoustic model for use in speech recognition. In particular, such a system may be used to perform training based on a spoken audio stream and a non-literal transcript of the spoken audio stream. Such a system may identify text in the non-literal transcript which represents concepts having multiple spoken forms. The system may attempt to identify the actual spoken form in the audio stream which produced the corresponding text in the non-literal transcript, and thereby produce a revised transcript which more accurately represents the spoken audio stream. The revised, and more accurate, transcript may be used to train the acoustic model, thereby producing a better acoustic model than that which would be produced using conventional techniques, which perform training based directly on the original non-literal transcript.

Citations

83 Claims

1. In a system including a first document containing at least some information in common with a spoken audio stream, a method comprising steps of:
- (A) identifying text in the first document representing a concept having a plurality of spoken forms;
  
  (B) replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document; and
  
  (C) generating a first language model based on the second document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 27)
- - 2. The method of claim 1, wherein the concept comprises a semantic concept.
  - 3. The method of claim 2, wherein the concept comprises a date.
  - 4. The method of claim 1, wherein the concept comprises a syntactic concept.
  - 5. The method of claim 4, wherein the concept comprises a sentence.
  - 6. The method of claim 4, wherein the concept comprises the entire second document.
  - 7. The method of claim 1, further comprising a step of:
    - (D) using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document.
  - 8. The method of claim 7, further comprising a step of:
    - (E) using the third document and the spoken audio stream to train an acoustic model.
  - 9. The method of claim 8, wherein the step (E) comprises steps of:
    - (E) (1) filtering text from the third document by reference to the second document to produce a filtered document; and
      
      (E)(2) using the filtered document and the spoken audio stream to train the acoustic model.
  - 10. The method of claim 3, wherein the step (E)(1) comprises applying a robust parser to the second and third documents to produce the filtered document.
  - 11. The method of claim 7, wherein the step (D) comprises steps of:
    - (D)(1) interpolating the first language model with a second language model to produce a third language model; and
      
      (D)(2) using the third language model in the speech recognition process to recognize the spoken audio stream and thereby to produce the third document.
  - 12. The method of claim 1, wherein the first document comprises a document generated based on the spoken audio stream.
  - 13. The method of claim 1, further comprising a step of:
    - (D) prior to step (C), normalizing the second document.
  - 14. The method of claim 1, further comprising a step of:
    - (D) prior to step (C), repeating steps (A) and (B) for each of a plurality of texts in the first document.
  - 15. The method of claim 1, wherein the step (B) comprises steps of:
    - (B)(1) generating probabilities for the plurality of spoken forms specified by the context-free grammar; and
      
      (B)(2) including the probabilities in the context-free grammar.
  - 16. The method of claim 15, wherein the step (B) further comprises a step of:
    - (B)(3) including the plurality of spoken forms in the context-free grammar.
  - 17. The method of claim 1, wherein the context-free grammar comprises a finite state grammar.
  - 18. The method of claim 1, wherein the first document comprises one of a first plurality of documents, and wherein the method further comprises a step of:
    - (D) repeating steps (A) and (B) for each of the plurality of documents to produce a second plurality of documents including the second document; and
      
      wherein the step (C) comprises a step of generating the first language model based on the second plurality of documents.
  - 27. The method of claim 1, wherein the first document comprises one of a first plurality of documents, and wherein the method further comprises a step of:
    - (E) repeating steps (A) and (B) for each of the plurality of documents to produce a second plurality of documents including the second document; and
      
      wherein the step (C) comprises a step of generating the first language model based on the second plurality of documents.

19. A system comprising:
- a first document containing at least some information in common with a spoken audio stream;
  
  means for identifying text in the first document representing a concept having a plurality of spoken forms;
  
  means for replacing the identified text with a context-free grammar specifying the plurality of spoken forms of the concept to produce a second document; and
  
  means for generating a first language model based on the second document.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
- - 20. The system of claim 19, wherein the concept comprises a semantic concept.
  - 21. The system of claim 19, wherein the concept comprises a syntactic concept.
  - 22. The system of claim 19, further comprising:
    - means for using the first language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a third document.
  - 23. The system of claim 22, further comprising:
    - means for using the third document and the spoken audio stream to train an acoustic model.
  - 24. The system of claim 22, wherein the means for using the first language model comprises:
    - means for interpolating the first language model with a second language model to produce a third language model; and
      
      means for using the third language model in the speech recognition process to recognize the spoken audio stream and thereby to produce the third document.
  - 25. The system of claim 19, wherein the first document comprises a document generated based on the spoken audio stream.
  - 26. The system of claim 19, further comprising:
    - means for normalizing the second document.

28. A method comprising steps of:
- (A) generating a first language model based on a first document including text and a context-free grammar specifying a plurality of spoken forms of a concept; and
  
  (B) using the first language model in a speech recognition process to recognize a spoken audio stream and thereby to produce a transcript of the spoken audio stream.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 40)
- - 29. The method of claim 28, further comprising a step of:
    - (C) repeating steps (A) and (B) for a plurality of documents.
  - 30. The method of claim 28, further comprising a step of:
    - (C) filtering text from the transcript by reference to the first document to produce a filtered document.
  - 31. The method of claim 30, further comprising a step of:
    - (D) using the filtered document and the spoken audio stream to train an acoustic model.
  - 32. The method of claim 28, wherein the concept comprises a semantic concept.
  - 33. The method of claim 28, wherein the concept comprises a syntactic concept.
  - 34. The method of claim 28, wherein the step (B) comprises steps of:
    - (B)(1) interpolating the first language model with a second language model to produce a third language model; and
      
      (B)(2) using the third language model in the speech recognition process to recognize the spoken audio stream and thereby to produce the third document.
  - 35. The method of claim 28, wherein the context-free grammar comprises a finite state grammar.
  - 40. The device of claim 28, wherein the concept comprises a syntactic concept.

36. A device comprising:
- means for generating a first language model based on a first document including text and a context-free grammar specifying a plurality of spoken forms of a concept; and
  
  means for using the first language model in a speech recognition process to recognize a spoken audio stream and thereby to produce a transcript of the spoken audio stream.
- View Dependent Claims (37, 38, 39)
- - 37. The device of claim 36, further comprising:
    - means for filtering text from the transcript by reference to the first document to produce a filtered document.
  - 38. The device of claim 37, further comprising:
    - means for using the filtered document and the spoken audio stream to train an acoustic model.
  - 39. The device of claim 36, wherein the concept comprises a semantic concept.

41. A method comprising steps of:
- (A) applying a speech recognition process to recognize a spoken audio stream and thereby to produce a first document using a first language model based on a second document including text and a context-free grammar specifying a plurality of spoken forms of a concept; and
  
  (B) using the first document and the audio stream to train an acoustic model.
- View Dependent Claims (42, 43, 44, 45, 46, 47)
- - 42. The method of claim 41, wherein the concept comprises a semantic concept.
  - 43. The method of claim 41, wherein the concept comprises a syntactic concept.
  - 44. The method of claim 41, further comprising a step of:
    - (C) prior to step (B), filtering text from the first document by reference to the second document to produce a filtered document; and
      
      wherein the step (B) comprises a step of using the filtered document and the spoken audio stream to train the acoustic model.
  - 45. The method of claim 44, wherein the step (C) comprises applying a robust parser to the first and second documents to produce the filtered document.
  - 46. The method of claim 41, further comprising a step of:
    - (C) prior to the step (A), interpolating the first language model with a second language model to produce a third language model; and
      
      wherein the step (A) comprises a step of using the third language model in the speech recognition process to recognize the spoken audio stream and thereby to produce the third document.
  - 47. The method of claim 41, wherein the context-free grammar comprises a finite state grammar.

48. A device comprising:
- means for applying a speech recognition process to recognize a spoken audio stream and thereby to produce a first document using a first language model based on a second document including text and a context-free grammar specifying a plurality of spoken forms of a concept; and
  
  means for using the first document and the audio stream to train an acoustic model.
- View Dependent Claims (49, 50, 51, 52)
- - 49. The device of claim 48, wherein the concept comprises a semantic concept.
  - 50. The device of claim 48, wherein the concept comprises a syntactic concept.
  - 51. The device of claim 48, further comprising:
    - means for filtering text from the first document by reference to the second document to produce a filtered document; and
      
      wherein the means for using comprises means for using the filtered document and the spoken audio stream to train the acoustic model.
  - 52. The device of claim 48, further comprising:
    - means for interpolating the first language model with a second language model to produce a third language model; and
      
      wherein the means for applying comprises means for using the third language model in the speech recognition process to recognize the spoken audio stream and thereby to produce the third document.

53. A method comprising steps of:
- (A) identifying a first document containing at least some information in common with a spoken audio stream;
  
  (B) identifying text in the first document representing a concept;
  
  (C) replacing the identified text with a context-free grammar specifying a plurality of spoken forms of the concept to produce a second document; and
  
  (D) replacing text in the second document with normalized text to produce a normalized document including samples of text belonging to spoken forms in the finite state grammar.
- View Dependent Claims (54, 55, 56)
- - 54. The method of claim 53, wherein the concept comprises a semantic concept.
  - 55. The method of claim 53, wherein the concept comprises a syntactic concept.
  - 56. The method of claim 53, wherein the context-free grammar comprises a finite state grammar.

57. A device comprising:
- means for identifying a first document containing at least some information in common with a spoken audio stream;
  
  means for identifying text in the first document representing a concept;
  
  means for replacing the identified text with a context-free grammar specifying a plurality of spoken forms of the concept to produce a second document; and
  
  means for replacing text in the second document with normalized text to produce a normalized document including samples of text belonging to spoken forms in the finite state grammar.
- View Dependent Claims (58, 59)
- - 58. The device of claim 57, wherein the concept comprises a semantic concept.
  - 59. The device of claim 57, wherein the concept comprises a syntactic concept.

60. A method comprising steps of:
- (A) identifying a normalized document of a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
  
  (B) generating a language model based on the normalized document;
  
  (C) using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document; and
  
  (D) filtering text from the second document by reference to the normalized document to produce a filtered document.
- View Dependent Claims (61, 62, 63, 64, 65, 66, 67)
- - 61. The method of claim 60, further comprising a step of:
    - (E) using the filtered document and the audio stream to train an acoustic model.
  - 62. The method of claim 60, further comprising a step of:
    - (E) using the filtered document to train a language model.
  - 63. The method of claim 60, wherein the concept comprises a semantic concept.
  - 64. The method of claim 60, wherein the concept comprises a syntactic concept.
  - 65. The method of claim 60, wherein the context-free grammar comprises a finite state grammar.
  - 66. The method of claim 60, wherein the step (D) comprises steps of:
    - (D)(1) determining whether a portion of the second document matches any of the plurality of spoken forms in the context-free grammar; and
      
      (D)(2) providing an indication of a possible defect in the context-free grammar if it is determined that the portion of the second document does not match any of the plurality of spoken forms in the context-free grammar.
  - 67. The method of claim 60, wherein the step (D) comprises steps of:
    - (D)(1) determining whether a phrase in the second document matches any of the plurality of spoken forms in the context-free grammar; and
      
      (D)(2) providing an indication of a possible defect in a dictionary entry for the phrase if it is determined that the portion of the second document does not match any of the plurality of spoken forms in the context-free grammar.

68. A device comprising:
- means for identifying a normalized document of a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
  
  means for generating a language model based on the normalized document;
  
  means for using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document; and
  
  means for filtering text from the second document by reference to the normalized document to produce a filtered document.
- View Dependent Claims (69, 70, 71)
- - 69. The device of claim 68, further comprising:
    - means for using the filtered document and the audio stream to train an acoustic model.
  - 70. The device of claim 68, wherein the concept comprises a semantic concept.
  - 71. The device of claim 68, wherein the concept comprises a syntactic concept.

72. A method comprising steps of:
- (A) identifying a normalized document of a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
  
  (B) using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document; and
  
  (C) using a robust parser to filter text from the second document by reference to the normalized document to produce a filtered document.
- View Dependent Claims (73, 74, 75, 76, 77, 78, 79)
- - 73. The method of claim 72, further comprising a step of:
    - (D) using the filtered document and the audio stream to train an acoustic model.
  - 74. The method of claim 72, further comprising a step of:
    - (D) using the filtered document to train a language model.
  - 75. The method of claim 72, wherein the concept comprises a semantic concept.
  - 76. The method of claim 72, wherein the concept comprises a syntactic concept.
  - 77. The method of claim 72, wherein the context-free grammar comprises a finite state grammar.
  - 78. The method of claim 72, wherein the step (C) comprises steps of:
    - (C)(1) determining whether a portion of the second document matches any of the plurality of spoken forms in the context-free grammar; and
      
      (C)(2) providing an indication of a possible defect in the context-free grammar if it is determined that the portion of the second document does not match any of the plurality of spoken forms in the context-free grammar.
  - 79. The method of claim 72, wherein the step (C) comprises steps of:
    - (C)(1) determining whether a phrase in the second document matches any of the plurality of spoken forms in the context-free grammar; and
      
      (C)(2) providing an indication of a possible defect in a dictionary entry for the phrase if it is determined that the portion of the second document does not match any of the plurality of spoken forms in the context-free grammar.

80. A device comprising:
- means for identifying a normalized document of a spoken audio stream, the normalized document including a context-free grammar specifying a plurality of spoken forms of a concept;
  
  means for using the language model in a speech recognition process to recognize the spoken audio stream and thereby to produce a second document; and
  
  means for using a robust parser to filter text from the second document by reference to the normalized document to produce a filtered document.
- View Dependent Claims (81, 82, 83)
- - 81. The device of claim 80, further comprising:
    - means for using the filtered document and the audio stream to train an acoustic model.
  - 82. The device of claim 80, wherein the concept comprises a semantic concept.
  - 83. The device of claim 80, wherein the concept comprises a syntactic concept.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Solventum Intellectual Properties Company (Solventum Corp.)
Original Assignee
Multimodal Technologies Incorporated (3M Company)
Inventors
Koll, Detlef, Finke, Michael, Fritsch, Juergen, Yegnanarayanan, Girija, Woszczyna, Monika

Granted Patent

US 8,335,688 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 15/063   Training

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/26   Speech to text systems G10L...

Document transcription system training

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

83 Claims

Specification

Solutions

Use Cases

Quick Links

Document transcription system training

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

83 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links