Automatically Creating Training Data For Language Identifiers
First Claim
1. A method, comprising:
- accessing a target corpus of electronic communications associated with an electronic communication service;
identifying a member of the target corpus that includes an attribute from which a predicted classification of the member can be made, the attribute being separate from a message portion of the member;
accessing the predicted classification of the member, where the predicted classification is a function of the attribute and where the predicted classification is made without reference to a base classifier;
accessing an actual classification of the member, where the actual classification is made by the base classifier, the base classifier being configured to classify communications associated with the electronic communication service; and
upon determining that the predicted classification matches the actual classification;
adding a labeled member to a target training corpus stored in a data store, the labeled member comprising the member and data representing the actual classification.
3 Assignments
0 Petitions
Accused Products
Abstract
Example apparatus and methods concern automatically creating labeled training data for automatic language identifiers. One embodiment includes logic to produce a predicted language classification for a post from geographic data associated with the post. The post may be associated with a micro-blog, a social media site, or other electronic communication service that traffics in short messages having frequent colloquialisms, non-standard spelling, emoticons, and unique usages of characters to convey meaning. The embodiment includes logic to produce an actual language classification for the post using a base language classifier. The embodiment includes logic to selectively add the post and a language label for the post to an automatically generated labeled training data upon determining that the predicted language classification matches the actual language classification. The automatically generated labeled training data may then be used to build target language models, which may include a target language classifier.
235 Citations
20 Claims
-
1. A method, comprising:
-
accessing a target corpus of electronic communications associated with an electronic communication service; identifying a member of the target corpus that includes an attribute from which a predicted classification of the member can be made, the attribute being separate from a message portion of the member; accessing the predicted classification of the member, where the predicted classification is a function of the attribute and where the predicted classification is made without reference to a base classifier; accessing an actual classification of the member, where the actual classification is made by the base classifier, the base classifier being configured to classify communications associated with the electronic communication service; and upon determining that the predicted classification matches the actual classification; adding a labeled member to a target training corpus stored in a data store, the labeled member comprising the member and data representing the actual classification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising:
-
constructing a base language corpus from a publicly available source of labeled documents that include user-generated content; deriving base language models for a plurality of languages from the base language corpus, where the base language models include base language classifiers configured to be able to identify documents from a communication service in the plurality of languages; identifying a possible classification of a document in a target language corpus, where the possible classification is a function of supporting evidence associated with the document, and where the possible classification does not rely on a base language classifier, where the target language corpus comprises documents from the communication service; producing an actual classification of the document, where the actual classification relies on a base language classifier; upon determining that the actual classification does not match the possible classification, discarding the document; upon determining that the actual classification does match the possible classification, adding the document and a label for the document to a filtered language corpus; and upon determining that the filtered language corpus has reached a threshold size; deriving target language models for the plurality of languages from the filtered language corpus, where the target language models include target language classifiers configured to identify documents from the target corpus in the plurality of languages. - View Dependent Claims (14, 15)
-
-
16. An apparatus configured to automatically produce and store, without supervision, labeled training data for automated language identification, comprising:
-
a processor; a memory; a set of logics configured to produce the labeled training data; and an interface to connect the processor, the memory, and the set of logics; the set of logics comprising; a first logic configured to produce a predicted language classification for a post to a micro-blog or social media site, the post being less than a threshold number of characters, where the predicted language classification is produced without using a base language classifier and where the predicted language classification depends, at least in part, on supporting evidence associated with the post; a second logic configured to produce an actual language classification for the post, where the actual language classification is produced by the base language classifier without reference to the supporting evidence; and a third logic configured to selectively add the post and a language label for the post to the labeled training data upon determining that the predicted language classification matches the actual language classification, the labeled training data being electronic data stored in a data store. - View Dependent Claims (17, 18, 19, 20)
-
Specification