Generating speech data collection prompts
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving, at a computer system, a request to generate a textual prompt to provide to a user for generating speech data in a particular language;
in response to receiving the request, determining frequencies of occurrence of linguistic features of the particular language in one or more corpora that are associated with the particular language, wherein the one or more corpora include content that was generated by people who use the particular language and that reflects current use of the particular language;
identifying, by the computer system, quantities of speech samples that include the linguistic features from a repository of previously recorded speech samples;
weighting the frequencies of occurrence of the linguistic features based on the quantities of speech samples that include the linguistic features, wherein the weighting generates weighted frequencies for the linguistic features, wherein a first linguistic feature is determined to have a weighted frequency that is greater than a weighted frequency for a second linguistic feature as a result of the computer system executing computer code that includes both of the following conditions and determining that one or more of the following conditions are satisfied;
(i) the first linguistic feature has a same or greater frequency of occurrence in the one or more corpora and has fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature, and(ii) the first linguistic feature has a greater frequency of occurrence in the one or more corpora and has the same or fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature;
generating, by the computer system, one or more textual prompts based on the weighted frequencies for the linguistic features, wherein each of the one or more textual prompts comprises a combination of two or more of the linguistic features; and
providing, by the computer system, the generated one or more textual prompts.
2 Assignments
0 Petitions
Accused Products
Abstract
This document generally describes computer technologies relating to generating speech data collection prompts, such as textual scripts and/or textual scenarios. Speech data collection prompts for a particular language can be generated based on a variety of factors, including the frequency with which linguistic elements (e.g., phonemes, syllables, words, phrases) in the particular language occur in one or more corpora of textual information associated with the particular language. Textual prompts can also and/or alternatively be generated based on statistics for previously recorded speech data.
128 Citations
19 Claims
-
1. A computer-implemented method comprising:
-
receiving, at a computer system, a request to generate a textual prompt to provide to a user for generating speech data in a particular language; in response to receiving the request, determining frequencies of occurrence of linguistic features of the particular language in one or more corpora that are associated with the particular language, wherein the one or more corpora include content that was generated by people who use the particular language and that reflects current use of the particular language; identifying, by the computer system, quantities of speech samples that include the linguistic features from a repository of previously recorded speech samples; weighting the frequencies of occurrence of the linguistic features based on the quantities of speech samples that include the linguistic features, wherein the weighting generates weighted frequencies for the linguistic features, wherein a first linguistic feature is determined to have a weighted frequency that is greater than a weighted frequency for a second linguistic feature as a result of the computer system executing computer code that includes both of the following conditions and determining that one or more of the following conditions are satisfied; (i) the first linguistic feature has a same or greater frequency of occurrence in the one or more corpora and has fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature, and (ii) the first linguistic feature has a greater frequency of occurrence in the one or more corpora and has the same or fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature; generating, by the computer system, one or more textual prompts based on the weighted frequencies for the linguistic features, wherein each of the one or more textual prompts comprises a combination of two or more of the linguistic features; and providing, by the computer system, the generated one or more textual prompts. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer system comprising:
-
one or more computing devices; an interface of the one or more computing devices that is programmed to receive requests to generate a textual prompt to provide to a user for generating speech data in a particular language; one or more corpora that are accessible to the one or more computing devices and that include content that was generated by people who use the particular language and that reflects current use of the particular language; a frequency module that is installed on the one or more computing devices and that is programmed to determine frequencies of occurrence of linguistic features of the particular language in the one or more corpora; a repository of previously recorded speech samples that are accessible to the one or more computing devices and that is separate from the one or more corpora; a quantity module that is installed on the one or more computing devices and that is programmed to identify quantities of speech samples that include the linguistic features from the repository of previously recorded speech samples; a weighting module that is installed on the one or more computing devices and that is programmed to weight the frequencies of occurrence of the linguistic features based on the quantities of speech samples that include the linguistic features, wherein the weighting generates weighted frequencies for the linguistic features; and a textual prompt generator that is installed on the one or more computing devices and that is programmed to generate one or more textual prompts based on the weighted frequencies for the linguistic features, wherein each of the one or more textual prompts comprises a combination of two or more of the linguistic features, wherein the weighting module is further programmed to generate a weighted frequency for a first linguistic feature that is greater than a weighted frequency for a second linguistic feature as a result of executing computer code that includes both of the following conditions and determining that one or more of the following conditions are satisfied;
(i) the first linguistic feature has a same or greater frequency of occurrence in the one or more corpora and has fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature, and (ii) the first linguistic feature has a greater frequency of occurrence in the one or more corpora and has the same or fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature.
-
-
19. A computer program product embodied in a non-transitory computer-readable storage device storing instructions that, when executed, cause a computer system with one or more processors to perform operations comprising:
-
receiving a request to generate a textual prompt to provide to a user for generating speech data in a particular language; in response to receiving the request, determining frequencies of occurrence of linguistic features of the particular language in one or more corpora that are associated with the particular language, wherein the one or more corpora include content that was generated by people who use the particular language and that reflects current use of the particular language; identifying quantities of speech samples from a repository of previously recorded speech samples that include the linguistic features; weighting the frequencies of occurrence of the linguistic features based on the quantities of speech samples that include the linguistic features, wherein the weighting generates weighted frequencies for the linguistic features, wherein a first linguistic feature is determined to have a weighted frequency that is greater than a weighted frequency for a second linguistic feature as a result of executing computer code that includes both of the following conditions and determining that one or more of the following conditions are satisfied; (i) the first linguistic feature has a same or greater frequency of occurrence in the one or more corpora and has fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature, and (ii) the first linguistic feature has a greater frequency of occurrence in the one or more corpora and has the same or fewer speech samples in the repository of previously recorded speech samples than the second linguistic feature; generating one or more textual prompts based on the weighted frequencies for the linguistic features, wherein each of the one or more textual prompts comprises a combination of two or more of the linguistic features; and providing the generated one or more textual prompts.
-
Specification