Data shredding for speech recognition acoustic model training under data retention restrictions

US 9,514,741 B2
Filed: 03/13/2013
Issued: 12/06/2016
Est. Priority Date: 03/13/2013
Status: Active Grant

First Claim

Patent Images

1. A method of enabling training of an acoustic model, the method comprising:

dynamically shredding a speech corpus to produce text segments and depersonalized audio features corresponding to the text segments, the depersonalized audio features including filtered audio data remaining after speaker vocal characteristics and other audio characteristics have been removed, the speech corpus comprising a plurality of messages that each contain audio and corresponding text content, the shredding splitting each of the plurality of messages into strips, each strip comprising text segments and corresponding depersonalized audio features;

mixing up the strips of the text segments and corresponding depersonalized audio features to produce strips mixed up in randomized order; and

enabling a system to train an acoustic model using the strips mixed up in randomized order.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Training speech recognizers, e.g., their language or acoustic models, using actual user data is useful, but retaining personally identifiable information may be restricted in certain environments due to regulations. Accordingly, a method or system is provided for enabling training of an acoustic model which includes dynamically shredding a speech corpus to produce text segments and depersonalized audio features corresponding to the text segments. The method further includes enabling a system to train an acoustic model using the text segments and the depersonalized audio features. Because the data is depersonalized, actual data may be used, enabling speech recognizers to keep up-to-date with user trends in speech and usage, among other benefits.

78 Citations

View as Search Results

20 Claims

1. A method of enabling training of an acoustic model, the method comprising:
- dynamically shredding a speech corpus to produce text segments and depersonalized audio features corresponding to the text segments, the depersonalized audio features including filtered audio data remaining after speaker vocal characteristics and other audio characteristics have been removed, the speech corpus comprising a plurality of messages that each contain audio and corresponding text content, the shredding splitting each of the plurality of messages into strips, each strip comprising text segments and corresponding depersonalized audio features;
  
  mixing up the strips of the text segments and corresponding depersonalized audio features to produce strips mixed up in randomized order; and
  
  enabling a system to train an acoustic model using the strips mixed up in randomized order.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1, further comprising:
    - extracting audio features from the speech corpus; and
      
      depersonalizing the audio features.
  - 3. The method according to claim 2, wherein depersonalizing the audio features includes applying cepstral mean subtraction (CMS), cepstral variance normalization, or Gaussianisation to the audio features.
  - 4. The method according to claim 2, wherein depersonalizing the audio features includes applying vocal tract length normalization (VTLN) to the audio features.
  - 5. The method according to claim 2, wherein depersonalizing the audio features includes using a neural network to depersonalize the audio features.
  - 6. The method according to claim 2, wherein depersonalizing the audio features includes applying a set of speaker-specific transforms to the audio features to remove speaker information.
  - 7. The method according to claim 1, wherein dynamically shredding the speech corpus includes aligning text and audio of the speech corpus and splitting the text and audio at convenient places.
  - 8. The method according to claim 7, wherein the convenient places include natural breaks in the speech corpus corresponding to pauses or phrase boundaries.
  - 9. The method according to claim 1, further comprising filtering the depersonalized audio features by removing depersonalized audio features longer than a certain length.
  - 10. The method according to claim 1, further comprising filtering the depersonalized audio features by examining the content of the text segments and removing depersonalized audio features based on the content of the corresponding text segments.
  - 11. The method according to claim 10, wherein removing the depersonalized audio features includes removing the depersonalized audio features whose corresponding text segments contain a phone number and at least two more words.
  - 12. The method according to claim 1, further comprising maintaining a store of the text segments and the corresponding depersonalized audio features.
  - 13. The method according to claim 12, wherein maintaining the store includes:
    - storing each text segment together with its corresponding depersonalized audio feature; and
      
      randomizing the text segments and corresponding depersonalized audio features.

14. A system for enabling training of an acoustic model, the system comprising:
- a shredding module configured to shred a speech corpus dynamically to produce text segments and depersonalized audio features corresponding to the text segments, the depersonalized audio features including filtered audio data remaining after speaker vocal characteristics and other audio characteristics have been removed, the speech corpus comprising a plurality of messages that each contain audio and corresponding text content, the shredding splitting each of the plurality of messages into strips, each strip comprising text segments and corresponding depersonalized audio features;
  
  the shredding module further configured to mix up the strips of the text segments and corresponding depersonalized audio features to produce strips mixed up in randomized order; and
  
  an enabling module configured to enable a system to train an acoustic model using the strips mixed up in randomized order.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The system according to claim 14, further comprising a depersonalization module configured to:
    - extract audio features from the speech corpus; and
      
      depersonalize the audio features.
  - 16. The system according to claim 14, further comprising a filtering module configured to filter the depersonalized audio features by removing depersonalized audio features longer than a certain length.
  - 17. The system according to claim 16, wherein the filtering module is further configured to filter the depersonalized audio features by examining the content of the text segments and removing depersonalized audio features based on the content of the corresponding text segments.
  - 18. The system according to claim 14, further comprising a storage module configured to maintain a store of the text segments and the corresponding depersonalized audio features.
  - 19. The system according to claim 18, wherein the storage module is configured to:
    - store each text segment together with its corresponding depersonalized audio feature; and
      
      randomize the text segments and corresponding depersonalized audio features.

20. A computer program product comprising a non-transitory computer-readable medium storing instructions for performing a method of enabling training of an acoustic model, the instructions, when loaded and executed by a processor, cause the processor to:
- dynamically shred a speech corpus to produce text segments and depersonalized audio features corresponding to the text segments, the depersonalized audio features including filtered audio data remaining after speaker vocal characteristics and other audio characteristics have been removed, the speech corpus comprising a plurality of messages that each contain audio and corresponding text content, the shredding splitting each of the plurality of messages into strips, each strip comprising text segments and corresponding depersonalized audio features;
  
  mix up the strips of the text segments and corresponding depersonalized audio features to produce strips mixed up in randomized order; and
  
  enable a system to train an acoustic model using the strips mixed up in randomized order.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Jost, Uwe Helmut, Woodland, Philip Charles, Katz, Marcel, Shahid, Syed Raza, Vozila, Paul J., Ganong, William F. III
Primary Examiner(s)
Zhu, Richard

Application Number

US13/800,764
Publication Number

US 20140278426A1
Time in Patent Office

1,364 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 21/00   Security arrangements for p...

G06F 21/6245   Protecting personal data, e...

G06F 21/6254   by anonymising data, e.g. d...

G10L 15/02   Feature extraction for spee...

G10L 15/06   Creation of reference templ...

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

Data shredding for speech recognition acoustic model training under data retention restrictions

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

78 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Data shredding for speech recognition acoustic model training under data retention restrictions

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links