Sampling training data for an automatic speech recognition system based on a benchmark classification distribution

US 9,202,461 B2
Filed: 01/18/2013
Issued: 12/01/2015
Est. Priority Date: 04/26/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining a benchmark classification distribution of topic classifications for benchmark text strings;

selecting, by a computing device, training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×

B(i)/N(i); and

training a language model of an automatic speech recognition (ASR) system using the training text strings.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of benchmark text strings may be classified to provide a set of benchmark classifications. The benchmark text strings in the set may correspond to a benchmark corpus of benchmark utterances in a particular language. A benchmark classification distribution of the set of benchmark classifications may be determined. A respective classification for each text string in a corpus of text strings may also be determined. Text strings from the corpus of text strings may be sampled to form a training corpus of training text strings such that the classifications of the training text strings have a training text string classification distribution that is based on the benchmark classification distribution. The training corpus of training text strings may be used to train an automatic speech recognition (ASR) system.

Citations

17 Claims

1. A method comprising:
- obtaining a benchmark classification distribution of topic classifications for benchmark text strings;
  
  selecting, by a computing device, training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
  
  B(i)/N(i); and
  
  training a language model of an automatic speech recognition (ASR) system using the training text strings.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein obtaining the benchmark classification distribution comprises:
    - transcribing benchmark utterances to respective benchmark text strings; and
      
      determining the benchmark classification distribution from the benchmark text strings.
  - 3. The method of claim 2, wherein the benchmark utterances were made by users in a category of users, and wherein the ASR system is configured to transcribe new utterances made by users in the category of users.
  - 4. The method of claim 3, wherein the benchmark utterances were made by a single user, and wherein the ASR system is configured to transcribe new utterances made by the single user.
  - 5. The method of claim 3, wherein the benchmark utterances were made by users with a particular dialect, and wherein the ASR system is configured to transcribe new utterances made by users with the particular dialect.
  - 6. The method of claim 3, wherein the benchmark utterances were made by users from a particular geographic location, and wherein the ASR system is configured to transcribe new utterances made by users from the particular geographic location.
  - 7. The method of claim 1, wherein there are fewer benchmark text strings than training text strings.
  - 8. The method of claim 1, wherein the training text strings were transcribed from respective utterances.
  - 9. The method of claim 1, wherein the topic classifications of the training text strings have a training text string classification distribution, and wherein the training text strings are selected such that the training text string classification distribution is substantially similar to the benchmark classification distribution.
  - 10. The method of claim 1, wherein training the language model of the ASR system comprises:
    - training the language model of the ASR system with a combination the training text strings and an additional corpus of text strings that were transcribed from utterances made by users of the ASR system.

11. An article of manufacture including a non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by a computing device, cause the computing device to perform operations comprising:
- obtaining a benchmark classification distribution of topic classifications for benchmark text strings;
  
  selecting training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
  
  B(i)/N(i); and
  
  training a language model of an automatic speech recognition (ASR) system using the training text strings.
- View Dependent Claims (12, 13, 14)
- - 12. The article of manufacture of claim 11, wherein obtaining the benchmark classification distribution comprises:
    - transcribing benchmark utterances to respective benchmark text strings; and
      
      determining the benchmark classification distribution from the benchmark text strings.
  - 13. The article of manufacture of claim 11, wherein the topic classifications of the training text strings have a training text string classification distribution, and wherein the training text strings are selected such that the training text string classification distribution is substantially similar to the benchmark classification distribution.
  - 14. The article of manufacture of claim 11, wherein training the language model of the ASR system comprises:
    - training the language model of the ASR system with a combination the training text strings and an additional corpus of text strings that were transcribed from utterances made by users of the ASR system.

15. A computing system comprising:
- at least one processor;
  
  data storage; and
  
  program instructions in the data storage that, upon execution by the at least one processor, cause the computing system to perform operations comprising;
  
  obtaining a benchmark classification distribution of topic classifications for benchmark text strings;
  
  selecting training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
  
  B(i)/N(i); and
  
  training a language model of an automatic speech recognition (ASR) system using the training text strings.
- View Dependent Claims (16, 17)
- - 16. The computing system of claim 15, wherein obtaining the benchmark classification distribution comprises:
    - transcribing benchmark utterances to respective benchmark text strings; and
      
      determining the benchmark classification distribution from the benchmark text strings.
  - 17. The computing system of claim 15, wherein the topic classifications of the training text strings have a training text string classification distribution, and wherein the training text strings are selected such that the training text string classification distribution is substantially similar to the benchmark classification distribution.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Biadsy, Fadi, Moreno Mengibar, Pedro J., Nakajima, Kaisuke, Bikel, Daniel Martin
Primary Examiner(s)
Lerner, Martin

Application Number

US13/745,295
Publication Number

US 20130289989A1
Time in Patent Office

1,047 Days
Field of Search

704/243, 704/244, 704/245, 704/246, 704/255, 704/250
US Class Current

1/1
CPC Class Codes

G10L 15/063 Training

G10L 15/183 using context dependencies,...

Sampling training data for an automatic speech recognition system based on a benchmark classification distribution

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Sampling training data for an automatic speech recognition system based on a benchmark classification distribution

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links