Machine learning dialect identification
First Claim
1. A method, comprising:
- selecting, by a computing device, an initial training data set as a current training data set, wherein the initial training data set is selected by;
receiving one or more initial content items; and
establishing dialect parameters of one or more of the initial content items, the establishing comprising;
identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect;
establishing the one or more specified geographic locations as part of the dialect parameters;
generating, by the computing device and based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified;
augmenting, by the computing device, the current training data set with additional training data by applying the dialect classifier to candidate content items;
updating the dialect classifier based on the augmented current training data set; and
applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language.
2 Assignments
0 Petitions
Accused Products
Abstract
Technology is disclosed for creating and tuning classifiers for language dialects and for generating dialect-specific language modules. A computing device can receive an initial training data set as a current training data set. The selection process for the initial training data set can be achieved by receiving one or more initial content items, establishing dialect parameters of each of the initial content items, and sorting each of the initial content items into one or more dialect groups based on the established dialect parameters. The computing device can generate, based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified. The computing device can augment the current training data set with additional training data by applying the dialect classifier to candidate content items. The computing device can then update the dialect classifier based on the augmented current training data set.
207 Citations
25 Claims
-
1. A method, comprising:
-
selecting, by a computing device, an initial training data set as a current training data set, wherein the initial training data set is selected by; receiving one or more initial content items; and establishing dialect parameters of one or more of the initial content items, the establishing comprising; identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect; establishing the one or more specified geographic locations as part of the dialect parameters; generating, by the computing device and based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified; augmenting, by the computing device, the current training data set with additional training data by applying the dialect classifier to candidate content items; updating the dialect classifier based on the augmented current training data set; and applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage medium storing instructions that, when executed by a computing system, cause the computing system to perform operations for creating a dialect-specific training data set, the operations comprising:
-
selecting an initial training data set as a current training data set, wherein the initial training data set is selected by; receiving one or more initial content items; establishing dialect parameters of each of the initial content items, the establishing comprising; identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect; establishing the one or more specified geographic locations as part of the dialect parameters; and sorting each of the initial content items into one or more dialect groups based on the established dialect parameters; generating, based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified; augmenting the current training data set with additional training data by applying the dialect classifier to candidate content items; updating the dialect classifier based on the augmented current training data set; and applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A computing device, comprising:
-
an interface configured to receive one or more initial content items; a data bootstrapping module configured to select an initial training data set as a current training data set, wherein the data bootstrapping module selects the initial training data set by establishing dialect parameters of each of the initial content items and sorting each of the initial content items into one or more dialect groups based on the established dialect parameters, the establishing comprising; identifying the initial content items associated with one or more specified geographic locations identified as correlated to a dialect; establishing the one or more specified geographic locations as part of the dialect parameters; a dialect classifier generation module configured to generate a dialect classifier based on the initial training data set, the dialect classifier configured to detect language dialects of content items to be classified; a dialect classifier application module configured to augment the current training data set with additional training data by applying the dialect classifier to candidate content items; wherein the dialect classifier generation module is further configured to update the dialect classifier based on the augmented current training data set; and a language module configured to apply the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language. - View Dependent Claims (24, 25)
-
Specification