Data shredding for speech recognition language model training under data retention restrictions

US 9,514,740 B2
Filed: 03/13/2013
Issued: 12/06/2016
Est. Priority Date: 03/13/2013
Status: Active Grant

First Claim

Patent Images

1. A method for training a language model of an automatic speech recognition system, the method comprising:

producing segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state, the producing including dynamically shredding the text corpus into the segments of text in the depersonalized state;

further depersonalizing the segments of text based on the corresponding counts, each count representing a number of occurrences of a respective segment of text in the text corpus; and

enabling an automatic speech recognition system to train a language model using the segments of text in the depersonalized state and the counts.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Training speech recognizers, e.g., their language or acoustic models, using actual user data is useful, but retaining personally identifiable information may be restricted in certain environments due to regulations. Accordingly, a method or system is provided for enabling training of a language model which includes producing segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state. The method further includes enabling a system to train a language model using the segments of text in the depersonalized state and the counts. Because the data is depersonalized, actual data may be used, enabling speech recognizers to keep up-to-date with user trends in speech and usage, among other benefits.

Citations

22 Claims

1. A method for training a language model of an automatic speech recognition system, the method comprising:
- producing segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state, the producing including dynamically shredding the text corpus into the segments of text in the depersonalized state;
  
  further depersonalizing the segments of text based on the corresponding counts, each count representing a number of occurrences of a respective segment of text in the text corpus; and
  
  enabling an automatic speech recognition system to train a language model using the segments of text in the depersonalized state and the counts.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, wherein the segments of text are non-overlapping.
  - 3. The method according to claim 1, further comprising maintaining a store of the segments of text and the counts.
  - 4. The method according to claim 3, wherein maintaining the store includes removing all segments of text whose corresponding counts are less than N, and maintaining only the remaining segments of text and the counts.
  - 5. The method according to claim 1, further comprising depersonalizing the corpus to change it from a personalized state to the depersonalized state.
  - 6. The method according to claim 5, wherein depersonalizing the corpus includes replacing personally identifiable information in the corpus with class labels, wherein the personally identifiable information being replaced is personally identifiable information whose type can be identified by the class labels.
  - 7. The method according to claim 6, wherein the personally identifiable information being replaced includes at least one of the following:
    - a phone number, credit card number, name of a person, name of a business, or location.
  - 8. The method according to claim 6, further comprising maintaining a list of the class labels and counts corresponding to the class labels, the list not being linked to the corpus.
  - 9. The method according to claim 1, further comprising filtering the segments of text by removing from the segments of text those segments that contain personally identifiable information.
  - 10. The method according to claim 1, further comprising labeling the text segments and counts with metadata.
  - 11. The method according to claim 10, wherein the metadata include at least one of the following:
    - time of day of the message, area code of the sender, area code of the recipient, call duration, device type, or message type.
  - 12. The method according to claim 1, further comprising:
    - replacing one or more words of the corpus with corresponding one or more word indices; and
      
      generating each word index through use of a random hash.
  - 13. The method according to claim 12, further comprising keeping a map to the random hashes secure.
  - 14. The method according to claim 1, further comprising further depersonalizing the segments of text, used by the system to train the language model, as a function of the counts.

15. A system for training a language model of an automated speech recognition system, the system comprising:
- at least one processor configured to implement;
  
  a segmentation module configured to produce segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state, the segments of text produced by dynamically shredding the text corpus into the segments of text in the depersonalized state;
  
  a depersonalization module configured to further depersonalize the segments of text based on the corresponding counts, each count representing a number of occurrences of a respective segment of text in the text corpus; and
  
  an enabling module configured to enable an automated speech recognition system to train a language model using the segments of text in the depersonalized state and the counts.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The system according to claim 15, wherein the at least one processor is operatively coupled to associated memory, and the at least one processor is further configured to implement a storage module configured to maintain a store of the segments of text and the counts in the associated memory.
  - 17. The system according to claim 15, wherein the depersonalization module is further configured to depersonalize the corpus to change it from a personalized state to the depersonalized state.
  - 18. The system according to claim 15, wherein the at least one processor is further configured to implement a filtering module configured to filter the segments of text by removing from the segments of text those segments that contain personally identifiable information.
  - 19. The system according to claim 15, wherein the at least one processor is further configured to implement a labeling module configured to label the text segments and counts with metadata.
  - 20. The system according to claim 15, wherein the at least one processor is further configured to implement an indexing module configured to:
    - replace one or more words of the corpus with corresponding one or more word indices; and
      
      generate each word index through use of a random hash.
  - 21. The system according to claim 15, further comprising a depersonalization module configured to further depersonalize the segments of text, used by the system to train the language model, as a function of the counts.

22. A computer program product comprising a non-transitory computer-readable medium storing instructions for performing a method for training a language model of an automatic speech recognition system, the instructions, when loaded and executed by a processor, cause the processor to:
- produce segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state, the segments of text produced by dynamically shredding the text corpus into the segments of text in the depersonalized state;
  
  further depersonalize the segments of text based on the corresponding counts, each count representing a number of occurrences of a respective segment of text in the text corpus; and
  
  enable an automated speech recognition system to train a language model using the segments of text in the depersonalized state and the counts.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Jost, Uwe Helmut, Woodland, Philip Charles, Katz, Marcel, Shahid, Syed Raza, Vozila, Paul J., Ganong, William F. III
Primary Examiner(s)
Yang, Qian

Application Number

US13/800,738
Publication Number

US 20140278425A1
Time in Patent Office

1,364 Days
Field of Search

704/257, 704/9
US Class Current

1/1
CPC Class Codes

G10L 15/063 Training

G10L 15/183 using context dependencies,...

Data shredding for speech recognition language model training under data retention restrictions

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Data shredding for speech recognition language model training under data retention restrictions

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links