Automatically Creating Training Data For Language Identifiers

US 20150006148A1
Filed: 07/17/2013
Published: 01/01/2015
Est. Priority Date: 06/27/2013
Status: Abandoned Application

First Claim

Patent Images

1. A method, comprising:

accessing a target corpus of electronic communications associated with an electronic communication service;

identifying a member of the target corpus that includes an attribute from which a predicted classification of the member can be made, the attribute being separate from a message portion of the member;

accessing the predicted classification of the member, where the predicted classification is a function of the attribute and where the predicted classification is made without reference to a base classifier;

accessing an actual classification of the member, where the actual classification is made by the base classifier, the base classifier being configured to classify communications associated with the electronic communication service; and

upon determining that the predicted classification matches the actual classification;

adding a labeled member to a target training corpus stored in a data store, the labeled member comprising the member and data representing the actual classification.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example apparatus and methods concern automatically creating labeled training data for automatic language identifiers. One embodiment includes logic to produce a predicted language classification for a post from geographic data associated with the post. The post may be associated with a micro-blog, a social media site, or other electronic communication service that traffics in short messages having frequent colloquialisms, non-standard spelling, emoticons, and unique usages of characters to convey meaning. The embodiment includes logic to produce an actual language classification for the post using a base language classifier. The embodiment includes logic to selectively add the post and a language label for the post to an automatically generated labeled training data upon determining that the predicted language classification matches the actual language classification. The automatically generated labeled training data may then be used to build target language models, which may include a target language classifier.

235 Citations

20 Claims

1. A method, comprising:
- accessing a target corpus of electronic communications associated with an electronic communication service;
  
  identifying a member of the target corpus that includes an attribute from which a predicted classification of the member can be made, the attribute being separate from a message portion of the member;
  
  accessing the predicted classification of the member, where the predicted classification is a function of the attribute and where the predicted classification is made without reference to a base classifier;
  
  accessing an actual classification of the member, where the actual classification is made by the base classifier, the base classifier being configured to classify communications associated with the electronic communication service; and
  
  upon determining that the predicted classification matches the actual classification;
  
  adding a labeled member to a target training corpus stored in a data store, the labeled member comprising the member and data representing the actual classification.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, where the electronic communication service is an online social networking service or micro-blogging service.
  - 3. The method of claim 1, where the communications in the target corpus are characterized by non-standard spellings, by non-standard spellings intended to convey emphasis, by the use of vernacular expressions, or by a length shorter than a threshold length.
  - 4. The method of claim 1, where the attribute from which the predicted classification of the member can be made comprises geographic information.
  - 5. The method of claim 1, where the predicted classification and the actual classification concern a language in which the member is written.
  - 6. The method of claim 1, where the predicted classification and the actual classification concern a demographic associated with a writer of the member.
  - 7. The method of claim 1, where the base classifier is a statistical language classifier that relies on base language models built from a base corpus of labeled documents.
  - 8. The method of claim 7, comprising building the base corpus of labeled documents from a publicly available online source that includes labeled user-generated content.
  - 9. The method of claim 8, comprising:
    - upon determining that the target training corpus is sufficient for training a target classifier in target training models derived from the target training corpus, where the target classifier is to classify communications associated with the electronic communication service;
      
      deriving the target training models using the target training corpus, andstoring the target training models in the data store.
  - 10. The method of claim 9, comprising controlling the target classifier to classify an electronic communication associated with the electronic communication service.
  - 11. The method of claim 10, the base classifier being configured to classify electronic communications from the electronic communication service as belonging to one of at least fifty different languages with an accuracy of at least ninety percent, the target classifier being configured to classify communications from the electronic communication service as belonging to one of the at least fifty different languages with an accuracy of at least ninety-five percent.
  - 12. The method of claim 1, comprising selectively updating the target training corpus or the target training models upon detecting an update event, the update event being a change in a language in which communications can be written in the electronic communication service, the appearance of a new hash tag in a language in which communications can be written in the electronic communication service, the passage of a threshold amount of time, or the processing of a threshold number of members of the target corpus.

13. A computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
- constructing a base language corpus from a publicly available source of labeled documents that include user-generated content;
  
  deriving base language models for a plurality of languages from the base language corpus, where the base language models include base language classifiers configured to be able to identify documents from a communication service in the plurality of languages;
  
  identifying a possible classification of a document in a target language corpus, where the possible classification is a function of supporting evidence associated with the document, and where the possible classification does not rely on a base language classifier, where the target language corpus comprises documents from the communication service;
  
  producing an actual classification of the document, where the actual classification relies on a base language classifier;
  
  upon determining that the actual classification does not match the possible classification, discarding the document;
  
  upon determining that the actual classification does match the possible classification, adding the document and a label for the document to a filtered language corpus; and
  
  upon determining that the filtered language corpus has reached a threshold size;
  
  deriving target language models for the plurality of languages from the filtered language corpus, where the target language models include target language classifiers configured to identify documents from the target corpus in the plurality of languages.
- View Dependent Claims (14, 15)
- - 14. The computer-readable storage medium of claim 13, the method comprising:
    - iterating, until a termination condition is reached;
      
      establishing the base language classifier for an iteration I+1 as the target language classifier of iteration I, I being an integer greater than zero;
      
      establishing the base language corpus for iteration I+1 as the filtered language corpus of iteration I;
      
      rebuilding a filtered language corpus for iteration I+1; and
      
      rebuilding the target language classifier for iteration I+1.
  - 15. The computer-readable storage medium of claim 14, the termination condition being reaching a threshold number of iterations, spending a threshold amount of time training, reaching a desired accuracy, or detecting a lower than desired rate of convergence in classifier accuracy.

16. An apparatus configured to automatically produce and store, without supervision, labeled training data for automated language identification, comprising:
- a processor;
  
  a memory;
  
  a set of logics configured to produce the labeled training data; and
  
  an interface to connect the processor, the memory, and the set of logics;
  
  the set of logics comprising;
  
  a first logic configured to produce a predicted language classification for a post to a micro-blog or social media site, the post being less than a threshold number of characters, where the predicted language classification is produced without using a base language classifier and where the predicted language classification depends, at least in part, on supporting evidence associated with the post;
  
  a second logic configured to produce an actual language classification for the post, where the actual language classification is produced by the base language classifier without reference to the supporting evidence; and
  
  a third logic configured to selectively add the post and a language label for the post to the labeled training data upon determining that the predicted language classification matches the actual language classification, the labeled training data being electronic data stored in a data store.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The apparatus of claim 16, comprising a fourth logic configured to:
    - assemble a set of base language documents from online, publicly available labeled documents having user-generated content; and
      
      derive a plurality of base language models from the set of base language documents, where the base language models include base language classifiers configured to identify, with a first accuracy, the language of posts to the micro-blog or social media site.
  - 18. The apparatus of claim 17, the fourth logic being configured to:
    - derive a plurality of target language models from the labeled training data, where a target language model includes a target language classifier configured to identify, with a second accuracy greater than the first accuracy, the language of posts to the micro-blog or social media site.
  - 19. The apparatus of claim 18, the fourth logic being configured:
    - to selectively control the apparatus to produce and store additional labeled training data for automated language identification and to derive a new target language model as a function of the additional labeled training data until a training termination condition for the new target language model is satisfied, where the additional labeled training data is produced after substituting the target language classifier for the base language classifier; and
      
      to selectively control the apparatus to produce and store new labeled training data upon determining that an update threshold has been met, the update threshold being associated with a change to one of the languages associated with the plurality of base language models, a time period, or a number of posts classified by the target language classifier.
  - 20. The apparatus of claim 16, where the supporting evidence is geographic data associated with the post or profile information associated with the author of the post.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Najork, Marc, Paparizos, Stelios, Goldszmit, Moises

Application Number

US13/943,788
Publication Number

US 20150006148A1
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

Automatically Creating Training Data For Language Identifiers

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

235 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatically Creating Training Data For Language Identifiers

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

235 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links