Machine learning dialect identification

US 9,899,020 B2
Filed: 09/23/2016
Issued: 02/20/2018
Est. Priority Date: 02/13/2015
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

selecting, by a computing device, an initial training data set as a current training data set, wherein the initial training data set is selected by;

receiving one or more initial content items; and

establishing dialect parameters of one or more of the initial content items, the establishing comprising;

identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect;

establishing the one or more specified geographic locations as part of the dialect parameters;

generating, by the computing device and based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified;

augmenting, by the computing device, the current training data set with additional training data by applying the dialect classifier to candidate content items;

updating the dialect classifier based on the augmented current training data set; and

applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Technology is disclosed for creating and tuning classifiers for language dialects and for generating dialect-specific language modules. A computing device can receive an initial training data set as a current training data set. The selection process for the initial training data set can be achieved by receiving one or more initial content items, establishing dialect parameters of each of the initial content items, and sorting each of the initial content items into one or more dialect groups based on the established dialect parameters. The computing device can generate, based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified. The computing device can augment the current training data set with additional training data by applying the dialect classifier to candidate content items. The computing device can then update the dialect classifier based on the augmented current training data set.

207 Citations

25 Claims

1. A method, comprising:
- selecting, by a computing device, an initial training data set as a current training data set, wherein the initial training data set is selected by;
  
  receiving one or more initial content items; and
  
  establishing dialect parameters of one or more of the initial content items, the establishing comprising;
  
  identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect;
  
  establishing the one or more specified geographic locations as part of the dialect parameters;
  
  generating, by the computing device and based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified;
  
  augmenting, by the computing device, the current training data set with additional training data by applying the dialect classifier to candidate content items;
  
  updating the dialect classifier based on the augmented current training data set; and
  
  applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the establishing includes:
    - identifying content items associated with one or more specified locations identified as correlated to a dialect;
      
      identifying content items authored by one or more users identified as correlated to the dialect;
      
      identifying content items that use one or more n-grams, n-gram types, or word endings correlated to the dialect;
      
      identifying content items that use punctuation or grammar in a manner correlated to the dialect;
      
      oridentifying content items that are correlated to the dialect based on user interaction with the content items; and
      
      sorting the initial content items into one or more dialect groups based on the established dialect parameters.
  - 3. The method of claim 1, further comprising:
    - evaluating the current training data set using the dialect classifier;
      
      identifying incorrectly classified training data items within the current training data set;
      
      updating the current training data set by removing the incorrectly classified training data items from the current training data set; and
      
      updating the dialect classifier based on the updated current training data set.
  - 4. The method of claim 1, further comprising:
    - determining that the dialect classifier has not been completed;
      
      identifying additional training data by applying the dialect classifier to additional candidate content items; and
      
      updating the dialect classifier based on an updated version of the current training data set that includes the additional training data.
  - 5. The method of claim 1, further comprising:
    - determining a language dialect of a content item by applying the dialect classifier on the content item.
  - 6. The method of claim 1, further comprising:
    - generating a language model for a language dialect based on the current training data set.
  - 7. The method of claim 6, further comprising:
    - translating a content item into the language dialect by using the language model.
  - 8. The method of claim 6, further comprising:
    - recognizing, using the language model, a dialect of an audio portion of a content item to convert into text.
  - 9. The method of claim 1, wherein sorting the initial content items comprises:
    - receiving content items with the dialect parameters;
      
      selecting a content item from the content items as a current content item;
      
      computing a value for dialect identification for the current content item; and
      
      classifying the current content item as in the dialect in the event that the value for dialect identification exceeds a threshold value.
  - 10. The method of claim 9, wherein the value for dialect identification does not exceed the threshold value and the sorting further comprises:
    - identifying a distinctive dialect parameter from the dialect parameters; and
      
      determining whether an existing dialect cluster of content items matches the distinctive dialect parameter.
  - 11. The method of claim 10, wherein sorting the initial content items further comprises:
    - in the event that no existing dialect cluster matches the distinctive dialect parameter, creating a new dialect cluster as a selected cluster;
      
      in the event that the existing dialect cluster matches the distinctive dialect parameter, setting the existing dialect cluster as the selected cluster; and
      
      adding the content item to the selected cluster.
  - 12. The method of claim 9, wherein sorting the initial content items further comprises:
    - determining whether there are additional content items;
      
      in the event that there are additional content items, setting a next content item from the received content items as the current content item; and
      
      in the event that there are no additional content items, returning the existing dialect clusters.
  - 13. The method of claim 1, wherein the dialect parameters identify a first initial content item of the one or more initial content items as being composed in a first dialect and identify a second initial content item of the one or more initial content items as being composed in a second dialect.
  - 14. The method of claim 13, wherein the generating comprises:
    - generating, by the computing device and based on the initial training data set and corresponding dialect parameters, the dialect classifier configured to detect language dialects of content items to be classified as being in one of two or more dialects, the two or more dialects including at least the first dialect and the second dialect.
  - 15. The method of claim 13, wherein the generating comprises:
    - augmenting, by the computing device, the current training data set with the additional training data by applying the dialect classifier to the candidate content items, wherein at least one of the candidate content items that is in the augmented current training data set was not included in the initial training data set.
  - 16. The method of claim 13, further comprising:
    - returning the updated dialect classifier, wherein the updated dialect classifier is configured to identify additional content items that are not in the initial training data set and are not in the augmented current training data set as being in one of the two or more dialects.

17. A non-transitory computer-readable storage medium storing instructions that, when executed by a computing system, cause the computing system to perform operations for creating a dialect-specific training data set, the operations comprising:
- selecting an initial training data set as a current training data set, wherein the initial training data set is selected by;
  
  receiving one or more initial content items;
  
  establishing dialect parameters of each of the initial content items, the establishing comprising;
  
  identifying the one or more of the initial content items associated with one or more specified geographic locations identified as correlated to a dialect;
  
  establishing the one or more specified geographic locations as part of the dialect parameters; and
  
  sorting each of the initial content items into one or more dialect groups based on the established dialect parameters;
  
  generating, based on the initial training data set, a dialect classifier configured to detect language dialects of content items to be classified;
  
  augmenting the current training data set with additional training data by applying the dialect classifier to candidate content items;
  
  updating the dialect classifier based on the augmented current training data set; and
  
  applying the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The non-transitory computer-readable storage medium of claim 17, wherein establishing the dialect parameters comprises:
    - identifying content items associated with one or more specified locations identified as correlated to a dialect.
  - 19. The non-transitory computer-readable storage medium of claim 17, wherein establishing the dialect parameters comprises:
    - identifying content items authored by one or more users identified as correlated to the dialect.
  - 20. The non-transitory computer-readable storage medium of claim 17, wherein establishing the dialect parameters comprises:
    - identifying content items that use one or more n-grams, n-gram types, or word endings correlated to the dialect.
  - 21. The non-transitory computer-readable storage medium of claim 17, wherein establishing the dialect parameters comprises:
    - identifying content items that use punctuation or grammar in a manner correlated to the dialect.
  - 22. The non-transitory computer-readable storage medium of claim 17, wherein establishing the dialect parameters comprises:
    - identifying content items that are correlated to the dialect based on user interaction with the content items.

23. A computing device, comprising:
- an interface configured to receive one or more initial content items;
  
  a data bootstrapping module configured to select an initial training data set as a current training data set, wherein the data bootstrapping module selects the initial training data set by establishing dialect parameters of each of the initial content items and sorting each of the initial content items into one or more dialect groups based on the established dialect parameters, the establishing comprising;
  
  identifying the initial content items associated with one or more specified geographic locations identified as correlated to a dialect;
  
  establishing the one or more specified geographic locations as part of the dialect parameters;
  
  a dialect classifier generation module configured to generate a dialect classifier based on the initial training data set, the dialect classifier configured to detect language dialects of content items to be classified;
  
  a dialect classifier application module configured to augment the current training data set with additional training data by applying the dialect classifier to candidate content items;
  
  wherein the dialect classifier generation module is further configured to update the dialect classifier based on the augmented current training data set; and
  
  a language module configured to apply the dialect classifier to transform an input in a source language to an output in a target language, an output in the source language, or an output in a dialect of the source language.
- View Dependent Claims (24, 25)
- - 24. The computing device of claim 23, further comprising:
    - a crowd sourcing module configured to generate user inquiries regarding the dialect of the content items and to augment the current training data set based on results of the user inquiries.
  - 25. The computing device of claim 23, wherein the data bootstrapping module is configured to establish dialect parameters by identifying spelling of words in the content items that are distinctive for a language dialect.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Inventors
Huang, Fei
Primary Examiner(s)
SINGH, SATWANT K

Application Number

US15/275,235
Publication Number

US 20170011739A1
Time in Patent Office

515 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/253   Grammatical analysis; Style...

G06F 40/263   Language identification

G06F 40/35   Discourse or dialogue repre...

G06F 40/40   Processing or translation o...

G10L 15/005   Language recognition

G10L 15/063   Training

G10L 15/26   Speech to text systems G10L...

G10L 2015/0633   using lexical or orthograph...

G10L 2015/0636   Threshold criteria for the ...

Machine learning dialect identification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

207 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

Machine learning dialect identification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

207 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others