Automated extraction of semantic content and generation of a structured document from speech

US 20060041428A1
Filed: 08/20/2004
Published: 02/23/2006
Est. Priority Date: 08/20/2004
Status: Active Grant

First Claim

Patent Images

1. A method comprising steps of:

(A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of sub-structures of a document; and

(B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into the plurality of sub-structures, wherein the content in each of the plurality of sub-structures is produced by recognizing speech using the probabilistic language model associated with the sub-structure.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are disclosed for automatically generating structured documents based on speech, including identification of relevant concepts and their interpretation. In one embodiment, a structured document generator uses an integrated process to generate a structured textual document (such as a structured textual medical report) based on a spoken audio stream. The spoken audio stream may be recognized using a language model which includes a plurality of sub-models arranged in a hierarchical structure. Each of the sub-models may correspond to a concept that is expected to appear in the spoken audio stream. Different portions of the spoken audio stream may be recognized using different sub-models. The resulting structured textual document may have a hierarchical structure that corresponds to the hierarchical structure of the language sub-models that were used to generate the structured textual document.

225 Citations

95 Claims

1. A method comprising steps of:
- (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of sub-structures of a document; and
  
  (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into the plurality of sub-structures, wherein the content in each of the plurality of sub-structures is produced by recognizing speech using the probabilistic language model associated with the sub-structure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 47, 48, 49, 50)
- - 2. The method of claim 1, wherein the step (B) comprises a step of:
    - (B)(1) for each of a plurality of segments S of a spoken audio stream, performing steps of;
      
      (a) recognizing segment S with at least two of the plurality of probabilistic language models to identify at least two candidate contents for segment S;
      
      (b) selecting one of the at least two candidate contents as final content corresponding to segment S; and
      
      (c) inserting the final content for segment S into the document sub-structure associated with the probabilistic language model which produced the candidate content selected in step (B)(1)(b).
  - 3. The method of claim 2, wherein step (B)(1)(b) comprises steps of:
    - (i) applying a metric to the at least two candidate contents to produce fitness scores for the at least two candidate contents, the fitness scores representing probabilities that the candidate contents represent the spoken audio stream; and
      
      (ii) selecting the candidate content having the highest fitness score.
  - 4. The method of claim 3, wherein the step (B)(1)(b) further comprises a step of:
    - (iii) selecting the probabilistic language model that was used to generate the candidate content selected in step (B)(1)(b)(ii).
  - 5. The method of claim 1, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 6. The method of claim 1, wherein the plurality of probabilistic language models includes at least one finite state language model.
  - 7. The method of claim 6, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 8. The method of claim 1, wherein the plurality of sub-structures of the document comprises a plurality of sections of the document.
  - 9. The method of claim 1, wherein the plurality of sections comprises a plurality of paragraphs.
  - 10. The method of claim 1, wherein the plurality of sub-structures includes a sub-structure representing a semantic concept.
  - 11. The method of claim 10, wherein the semantic concept comprises a date.
  - 12. The method of claim 10, wherein the semantic concept comprises a medication.
  - 13. The method of claim 10, wherein the semantic concept is represented in the document in a computer-readable form.
  - 14. The method of claim 1, further comprising a step of:
    - (C) rendering the document to produce a rendition indicating the structure of the document.
  - 15. The method of claim 1, wherein the plurality of probabilistic language models are organized in a hierarchy, and wherein the step (B) comprises steps of:
    - (B)(1) identifying a path through the hierarchy;
      
      (B)(2) generating a document having a structure corresponding to the path identified in step (B)(1).
  - 16. The method of claim 15, wherein the step (B)(1) comprises a step of identifying a path through the hierarchy which, when applied by the speech recognition decoder to recognize the spoken audio stream, produces an optimal recognition result with respect to the hierarchy of the plurality of probabilistic language models.
  - 17. The method of claim 15, wherein step (B)(1) comprises steps of:
    - (B)(1)(a) identifying a plurality of paths through the hierarchy;
      
      (B)(1)(b) for each of the plurality of paths P, producing a candidate structured document for the spoken audio stream by using the speech recognition decoder to recognize the spoken audio stream using the language models on path P;
      
      (B)(1)(c) applying a metric to the plurality of candidate structured documents produced in step (B)(1)(b) to produce a plurality of fitness scores for the plurality of candidate structured documents; and
      
      (B)(1)(d) selecting the path which produces the candidate structured document having the highest fitness score.
  - 18. The method of claim 1, wherein the speech recognition decoder includes a plurality of speech recognition decoders, and wherein the step (B) includes steps of:
    - (B)(1) identifying a segment of the spoken audio stream;
      
      (B)(2) identifying one of the plurality of probabilistic language models;
      
      (B)(3) identifying one of the plurality of speech recognition decoders having an association with the identified one of the plurality of probabilistic language models; and
      
      (B)(4) using the identified speech recognition decoder to apply the identified probabilistic language model to the identified segment to produce content.
  - 19. The method of claim 18, wherein the identified one of the plurality of probabilistic language models comprises an n-gram language model, and wherein the identified speech recognition decoder comprises an n-gram speech recognition decoder.
  - 20. The method of claim 18, wherein the identified one of the plurality of probabilistic language models comprises a context-free grammar, and wherein the identified speech recognition decoder comprises a context-free grammar speech recognition decoder.
  - 21. The method of claim 1, wherein step (B) comprises steps of:
    - (B)(1) identifying a mapping between the plurality of probabilistic language models and a plurality of segments in the audio stream;
      
      (B)(2) for each of the plurality of segments, performing steps of;
      
      (B)(2)(a) identifying a corresponding one of the plurality of probabilistic language models using the mapping;
      
      (B)(2)(b) identifying one of the plurality of sub-structures associated with the identified probabilistic language model;
      
      (B)(3)(b) using the speech recognition decoder to recognize the segment using the identified one of the probabilistic language models thereby to produce content in the identified sub-structure.
  - 22. The method of claim 21, wherein the steps (B)(1) and (B)(2) are performed at least in part concurrently.
  - 23. The method of claim 1, wherein the step (B) comprises steps of:
    - (B)(1) identifying a portion of the spoken audio stream representing semantic information; and
      
      (B)(2) storing a representation of the semantic information in the document in a machine-readable form.
  - 47. A data structure comprising the language model identified in step (A) of claim 1.
  - 48. The data structure of claim 47, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 49. The data structure of claim 47, wherein the plurality of probabilistic language models includes at least one context-free grammar.
  - 50. The data structure of claim 49, wherein the plurality of probabilistic language models includes at least one n-gram language model.

24. An apparatus comprising:
- first identification means for identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of sub-structures of a document;
  
  document production means for using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into the plurality of sub-structures, wherein the content in each of the plurality of sub-structures is produced by recognizing speech using the probabilistic language model associated with the sub-structure.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 25. The apparatus of claim 24, wherein the document production means comprises:
    - iteration means comprising, for each of a plurality of segments S of a spoken audio stream;
      
      recognition means for recognizing segment S with at least two of the plurality of probabilistic language models to identify at least two candidate contents for segment S;
      
      first selection means for selecting one of the at least two candidate contents as final content corresponding to segment S; and
      
      insertion means for inserting the final content for segment S into the document sub-structure associated with the probabilistic language model which produced the candidate content selected by the selection means.
  - 26. The apparatus of claim 25, wherein the first selection means comprises:
    - means for applying a metric to the at least two candidate contents to produce fitness scores for the at least two candidate contents, the fitness scores representing probabilities that the candidate contents represent the spoken audio stream; and
      
      second selection means for selecting the candidate content having the highest fitness score.
  - 27. The apparatus of claim 26, wherein the first selection means further comprises:
    - third selection means for selecting the probabilistic language model that was used to generate the candidate content selected by the second selection means.
  - 28. The apparatus of claim 24, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 29. The apparatus of claim 24, wherein the plurality of probabilistic language models includes at least one finite state language model.
  - 30. The apparatus of claim 29, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 31. The apparatus of claim 24, wherein the plurality of sub-structures of the document comprises a plurality of sections of the document.
  - 32. The apparatus of claim 24, wherein the plurality of sections comprises a plurality of paragraphs.
  - 33. The apparatus of claim 24, wherein the plurality of sub-structures includes a sub-structure representing a semantic concept.
  - 34. The apparatus of claim 33, wherein the semantic concept comprises a date.
  - 35. The apparatus of claim 33, wherein the semantic concept comprises a medication.
  - 36. The apparatus of claim 33, wherein the semantic concept is represented in the document in a computer-readable form.
  - 37. The apparatus of claim 24, further comprising:
    - means for rendering the document to produce a rendition indicating the structure of the document.
  - 38. The apparatus of claim 24, wherein the plurality of probabilistic language models are organized in a hierarchy, and wherein the document production means comprises:
    - second identification means for identifying a path through the hierarchy; and
      
      means for generating a document having a structure corresponding to the path identified by the second identification means.
  - 39. The apparatus of claim 38, wherein the second identification means comprises means for identifying a path through the hierarchy which, when applied by the speech recognition decoder to recognize the spoken audio stream, produces an optimal recognition result with respect to the hierarchy of the plurality of probabilistic language models.
  - 40. The apparatus of claim 38, wherein the second identification means comprises:
    - means for identifying a plurality of paths through the hierarchy;
      
      candidate production means for producing, for each of the plurality of paths P, a candidate structured document for the spoken audio stream by using the speech recognition decoder to recognize the spoken audio stream using the language models on path P;
      
      means for applying a metric to the plurality of candidate structured documents produced by the candidate production means to produce a plurality of fitness scores for the plurality of candidate structured documents; and
      
      means for selecting the path which produces the candidate structured document having the highest fitness score.
  - 41. The apparatus of claim 24, wherein the speech recognition decoder includes a plurality of speech recognition decoders, and wherein the document production means includes:
    - means for identifying a segment of the spoken audio stream;
      
      means for identifying one of the plurality of probabilistic language models;
      
      means for identifying one of the plurality of speech recognition decoders having an association with the identified one of the plurality of probabilistic language models; and
      
      means for using the identified speech recognition decoder to apply the identified probabilistic language model to the identified segment to produce content.
  - 42. The apparatus of claim 41, wherein the identified one of the plurality of probabilistic language models comprises an n-gram language model, and wherein the identified speech recognition decoder comprises an n-gram speech recognition decoder.
  - 43. The apparatus of claim 41, wherein the identified one of the plurality of probabilistic language models comprises a context-free grammar, and wherein the identified speech recognition decoder comprises a context-free grammar speech recognition decoder.
  - 44. The apparatus of claim 24, wherein the document production means comprises:
    - second identification means for identifying a mapping between the plurality of probabilistic language models and a plurality of segments in the audio stream;
      
      iteration means comprising, for each of the plurality of segments;
      
      means for identifying a corresponding one of the plurality of probabilistic language models using the mapping;
      
      means for identifying one of the plurality of sub-structures associated with the identified probabilistic language model; and
      
      means for using the speech recognition decoder to recognize the segment using the identified one of the probabilistic language models thereby to produce content in the identified sub-structure.
  - 45. The apparatus of claim 44, wherein the second identification means and the iteration means are configured to operate at least in part concurrently.
  - 46. The apparatus of claim 24, wherein the document production means comprises:
    - means for identifying a portion of the spoken audio stream representing semantic information; and
      
      means for storing a representation of the semantic information in the document in a machine-readable form.

51. A data structure comprising:
- a plurality of language models logically organized in a hierarchy, the plurality of language models including a first language model and a second language model;
  
  wherein the first language model is a parent of the second language model in the hierarchy;
  
  wherein the first language model is suitable for recognizing speech representing a first concept associated with a substructure of a document; and
  
  wherein the second language model is suitable for recognizing speech representing a second concept associated with a subset of the substructure of the document.
- View Dependent Claims (52, 53, 54, 55)
- - 52. The data structure of claim 51, wherein the first language model comprises an n-gram language model.
  - 53. The data structure of claim 51, wherein the first language model comprises a context-free grammar.
  - 54. The data structure of claim 53, wherein the second language model comprises an n-gram language model.
  - 55. The data structure of claim 51, wherein the substructure of the document comprises a section of the document.

56. A method comprising steps of:
- (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of concepts logically organized in a first hierarchy;
  
  (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into a plurality of sub-structures logically organized in a second hierarchy having a logical structure defined by a path through the first hierarchy.
- View Dependent Claims (57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75)
- - 57. The method of claim 56, wherein the step (B) comprises a step of traversing the path through the first hierarchy to produce the document.
  - 58. The method of claim 56, wherein the step (B) comprises a step of:
    - (B)(1) for each of a plurality of segments S of a spoken audio stream, performing steps of;
      
      (a) recognizing segment S with at least two of the plurality of probabilistic language models to identify at least two candidate contents for segment S;
      
      (b) selecting one of the at least two candidate contents as final content corresponding to segment S; and
      
      (c) inserting the final content for segment S into the document sub-structure associated with the probabilistic language model which produced the candidate content selected in step (B)(1)(b).
  - 59. The method of claim 56, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 60. The method of claim 56, wherein the plurality of probabilistic language models includes at least one finite state language model.
  - 61. The method of claim 60, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 62. The method of claim 56, wherein the plurality of sub-structures includes a sub-structure representing a semantic concept.
  - 63. The method of claim 62, wherein the semantic concept comprises a date.
  - 64. The method of claim 62, wherein the semantic concept comprises a medication.
  - 65. The method of claim 62, wherein the semantic concept is represented in the document in a computer-readable form.
  - 66. The method of claim 56, further comprising a step of:
    - (C) rendering the document to produce a rendition indicating the structure of the document.
  - 67. The method of claim 56, wherein the step (B) comprises steps of:
    - (B)(1) identifying a path through the hierarchy;
      
      (B)(2) generating a document having a structure corresponding to the path identified in step (B)(1).
  - 68. The method of claim 67, wherein the step (B)(1) comprises a step of identifying a path through the hierarchy which, when applied by a speech recognition decoder to recognize the spoken audio stream, produces an optimal recognition result with respect to the hierarchy of the plurality of probabilistic language models.
  - 69. The method of claim 67, wherein step (B)(1) comprises steps of:
    - (B)(1)(a) identifying a plurality of paths through the hierarchy;
      
      (B)(1)(b) for each of the plurality of paths P, producing a candidate structured document for the spoken audio stream by using the speech recognition decoder to recognize the spoken audio stream using the language models on path P;
      
      (B)(1)(c) applying a metric to the plurality of candidate structured documents produced in step (B)(1)(b) to produce a plurality of fitness scores for the plurality of candidate structured documents; and
      
      (B)(1)(d) selecting the path which produces the candidate structured document having the highest fitness score.
  - 70. The method of claim 56, wherein the speech recognition decoder includes a plurality of speech recognition decoders, and wherein the step (B) includes steps of:
    - (B)(1) identifying a segment of the spoken audio stream;
      
      (B)(2) identifying one of the plurality of probabilistic language models;
      
      (B)(3) identifying one of the plurality of speech recognition decoders having an association with the identified one of the plurality of probabilistic language models; and
      
      (B)(4) using the identified speech recognition decoder to apply the identified probabilistic language model to the identified segment to produce content.
  - 71. The method of claim 70, wherein the identified one of the plurality of probabilistic language models comprises an n-gram language model, and wherein the identified speech recognition decoder comprises an n-gram speech recognition decoder.
  - 72. The method of claim 70, wherein the identified one of the plurality of probabilistic language models comprises a context-free grammar, and wherein the identified speech recognition decoder comprises a context-free grammar speech recognition decoder.
  - 73. The method of claim 56, wherein step (B) comprises steps of:
    - (B)(1) identifying a mapping between the plurality of probabilistic language models and a plurality of segments in the audio stream;
      
      (B)(2) for each of the plurality of segments, performing steps of;
      
      (B)(2)(a) identifying a corresponding one of the plurality of probabilistic language models using the mapping;
      
      (B)(2)(b) identifying one of the plurality of sub-structures associated with the identified probabilistic language model;
      
      (B)(3)(b) using the speech recognition decoder to recognize segment using the identified one of the probabilistic language models thereby to produce content in the identified sub-structure.
  - 74. The method of claim 73, wherein the steps (B)(1) and (B)(2) are performed at least in part concurrently.
  - 75. The method of claim 56, wherein the step (B) comprises steps of:
    - (B)(1) identifying a portion of the spoken audio stream representing semantic information; and
      
      (B)(2) storing a representation of the semantic information in the document in a machine-readable form.

76. An apparatus comprising:
- identification means for identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of concepts logically organized in a first hierarchy; and
  
  document production means for using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into a plurality of sub-structures logically organized in a second hierarchy having a logical structure defined by a path through the first hierarchy.
- View Dependent Claims (77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95)
- - 77. The apparatus of claim 76, wherein the document production means comprises means for traversing the path through the first hierarchy to produce the document.
  - 78. The apparatus of claim 76, wherein the document production means comprises:
    - iteration means comprising, for each of a plurality of segments S of a spoken audio stream;
      
      means for recognizing segment S with at least two of the plurality of probabilistic language models to identify at least two candidate contents for segment S;
      
      first selection means for selecting one of the at least two candidate contents as final content corresponding to segment S; and
      
      insertion means for inserting the final content for segment S into the document sub-structure associated with the probabilistic language model which produced the candidate content selected by the first selection means.
  - 79. The apparatus of claim 76, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 80. The apparatus of claim 76, wherein the plurality of probabilistic language models includes at least one finite state language model.
  - 81. The apparatus of claim 80, wherein the plurality of probabilistic language models includes at least one n-gram language model.
  - 82. The apparatus of claim 76, wherein the plurality of sub-structures includes a sub-structure representing a semantic concept.
  - 83. The apparatus of claim 82, wherein the semantic concept comprises a date.
  - 84. The apparatus of claim 82, wherein the semantic concept comprises a medication.
  - 85. The apparatus of claim 82, wherein the semantic concept is represented in the document in a computer-readable form.
  - 86. The apparatus of claim 76, further comprising:
    - means for rendering the document to produce a rendition indicating the structure of the document.
  - 87. The apparatus of claim 76, wherein the document production means comprises:
    - second identification means for identifying a path through the hierarchy; and
      
      means for generating a document having a structure corresponding to the path identified by the second identification means.
  - 88. The apparatus of claim 87, wherein the second identification means comprises means for identifying a path through the hierarchy which, when applied by a speech recognition decoder to recognize the spoken audio stream, produces an optimal recognition result with respect to the hierarchy of the plurality of probabilistic language models.
  - 89. The apparatus of claim 87, wherein the second identification means comprises:
    - means for identifying a plurality of paths through the hierarchy;
      
      candidate production means for producing, for each of the plurality of paths P, a candidate structured document for the spoken audio stream by using the speech recognition decoder to recognize the spoken audio stream using the language models on path P;
      
      means for applying a metric to the plurality of candidate structured documents produced by the candidate production means to produce a plurality of fitness scores for the plurality of candidate structured documents; and
      
      means for selecting the path which produces the candidate structured document having the highest fitness score.
  - 90. The apparatus of claim 76, wherein the speech recognition decoder includes a plurality of speech recognition decoders, and wherein the document production means includes:
    - means for identifying a segment of the spoken audio stream;
      
      means for identifying one of the plurality of probabilistic language models;
      
      means for identifying one of the plurality of speech recognition decoders having an association with the identified one of the plurality of probabilistic language models; and
      
      means for using the identified speech recognition decoder to apply the identified probabilistic language model to the identified segment to produce content.
  - 91. The apparatus of claim 90, wherein the identified one of the plurality of probabilistic language models comprises an n-gram language model, and wherein the identified speech recognition decoder comprises an n-gram speech recognition decoder.
  - 92. The apparatus of claim 90, wherein the identified one of the plurality of probabilistic language models comprises a context-free grammar, and wherein the identified speech recognition decoder comprises a context-free grammar speech recognition decoder.
  - 93. The apparatus of claim 76, wherein the document production means comprises:
    - second identification means for identifying a mapping between the plurality of probabilistic language models and a plurality of segments in the audio stream;
      
      iteration means comprising, for each of the plurality of segments;
      
      means for identifying a corresponding one of the plurality of probabilistic language models using the mapping;
      
      means for identifying one of the plurality of sub-structures associated with the identified probabilistic language model; and
      
      means for using the speech recognition decoder to recognize segment using the identified one of the probabilistic language models thereby to produce content in the identified sub-structure.
  - 94. The apparatus of claim 93, wherein the second identification means and the iteration means are configured to operate at least in part concurrently.
  - 95. The apparatus of claim 76, wherein the document production means comprises:
    - means for identifying a portion of the spoken audio stream representing semantic information; and
      
      means for storing a representation of the semantic information in the document in a machine-readable form.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Multimodal Technologies Incorporated (3M Company)
Original Assignee
Multimodal Technologies Incorporated (3M Company)
Inventors
Koll, Detlef, Finke, Michael, Fritsch, Juergen, Yegnanarayanan, Girija, Woszczyna, Monika

Granted Patent

US 7,584,103 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/257
CPC Class Codes

G10L 15/1815 Semantic context, e.g. disa...

G16H 15/00 ICT specially adapted for m...

Automated extraction of semantic content and generation of a structured document from speech

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

225 Citations

95 Claims

Specification

Solutions

Use Cases

Quick Links

Automated extraction of semantic content and generation of a structured document from speech

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

225 Citations

95 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links