Sampling training data for an automatic speech recognition system based on a benchmark classification distribution
First Claim
1. A method comprising:
- obtaining a benchmark classification distribution of topic classifications for benchmark text strings;
selecting, by a computing device, training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
B(i)/N(i); and
training a language model of an automatic speech recognition (ASR) system using the training text strings.
3 Assignments
0 Petitions
Accused Products
Abstract
A set of benchmark text strings may be classified to provide a set of benchmark classifications. The benchmark text strings in the set may correspond to a benchmark corpus of benchmark utterances in a particular language. A benchmark classification distribution of the set of benchmark classifications may be determined. A respective classification for each text string in a corpus of text strings may also be determined. Text strings from the corpus of text strings may be sampled to form a training corpus of training text strings such that the classifications of the training text strings have a training text string classification distribution that is based on the benchmark classification distribution. The training corpus of training text strings may be used to train an automatic speech recognition (ASR) system.
-
Citations
17 Claims
-
1. A method comprising:
-
obtaining a benchmark classification distribution of topic classifications for benchmark text strings; selecting, by a computing device, training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
B(i)/N(i); andtraining a language model of an automatic speech recognition (ASR) system using the training text strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An article of manufacture including a non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by a computing device, cause the computing device to perform operations comprising:
-
obtaining a benchmark classification distribution of topic classifications for benchmark text strings; selecting training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
B(i)/N(i); andtraining a language model of an automatic speech recognition (ASR) system using the training text strings. - View Dependent Claims (12, 13, 14)
-
-
15. A computing system comprising:
-
at least one processor; data storage; and program instructions in the data storage that, upon execution by the at least one processor, cause the computing system to perform operations comprising; obtaining a benchmark classification distribution of topic classifications for benchmark text strings; selecting training text strings from a corpus of text strings, wherein the training text strings are associated with respective topic classifications, and wherein selecting the training text strings includes (a) determining to select t training text strings, (b) determining that a frequency of topic i in the benchmark classification distribution is B(i), wherein B(i) is inclusively between 0 and 1, (c) determining that a number of text strings classified with topic i in the corpus of text strings is N(i), and (d) selecting a training text string of topic i from the corpus of text strings based on probability t×
B(i)/N(i); andtraining a language model of an automatic speech recognition (ASR) system using the training text strings. - View Dependent Claims (16, 17)
-
Specification