Classifying languages for objects and entities
First Claim
1. A system for providing a language classification of a media item, comprising:
- one or more processors; and
a memory storing instructions that, when executed by the system, cause the system to perform operations for providing a language classification of a media item, the operations implementing;
a context classifier to determine a context characteristic indicating one or more users who have interacted with the media item;
wherein the context characteristic corresponds to a computed likelihood that the media item is in one or more languages based on determined language abilities of the users who have interacted with the media item; and
wherein the context classifier is further to compute, based on the determined context characteristic and corresponding computed likelihood, a context prediction that the media item is in one or more first languages;
a trained classifier to compute a trained prediction that the media item is in one or more second languages;
wherein computing the trained prediction comprises an n-gram analysis of the media item, for one or more n-grams in the media item having a particular length, which analyzes a specified probability distribution that the n-gram is in a specific language; and
a language classifier to combine the context prediction with the trained prediction.
1 Assignment
0 Petitions
Accused Products
Abstract
Technology for media item and user language classification is disclosed. Media item classification may use models for associating language identifiers or probability distributions for multiple languages with linguistic content. User language classification may define user language models for attributing to users indications of languages they speak read, and/or write. The text classifications and user classifications may interact because the probability that given text is in a particular language may depend on a determined likelihood the user who produced the text speaks that language, or conversely, a user interacting with text in a particular language may increase the likelihood they understand that language. Some embodiments use language-tagged social media content to train n-gram classifiers for use with other social media content.
212 Citations
13 Claims
-
1. A system for providing a language classification of a media item, comprising:
-
one or more processors; and a memory storing instructions that, when executed by the system, cause the system to perform operations for providing a language classification of a media item, the operations implementing; a context classifier to determine a context characteristic indicating one or more users who have interacted with the media item; wherein the context characteristic corresponds to a computed likelihood that the media item is in one or more languages based on determined language abilities of the users who have interacted with the media item; and wherein the context classifier is further to compute, based on the determined context characteristic and corresponding computed likelihood, a context prediction that the media item is in one or more first languages; a trained classifier to compute a trained prediction that the media item is in one or more second languages; wherein computing the trained prediction comprises an n-gram analysis of the media item, for one or more n-grams in the media item having a particular length, which analyzes a specified probability distribution that the n-gram is in a specific language; and a language classifier to combine the context prediction with the trained prediction. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for building an n-gram classifier trained for analysis of social media content items, comprising:
-
storing training media items gathered from a social media source, wherein at least some of the training media items are associated with a language identifier indicating a language the media item is in, wherein each language identifier that indicates the language the media item is in is assigned based on one or more of; a language model, associated with a user who created the media item, indicating that the user who created the media item is mono-linguistic;
ora common language identified by both a first language model associated with the user who created the media item and by a second language model associated with a user who received the media item; and generating, for each of multiple selected n-grams from the training media items, a corresponding probability distribution that a particular media item is in a particular language given that the n-gram is in the particular media item; wherein at least one probability distribution corresponding to one of the multiple selected n-grams is based on an analysis of a frequency with which that n-gram occurs in a subset of the training media items having the same language identifier. - View Dependent Claims (7, 8, 9)
-
-
10. A non-transitory computer readable storage medium storing instructions that, in response to being executed by a computing device, cause the computing device to perform operations for building an n-gram classifier trained for analysis of social media content items, the operations comprising:
-
storing training media items gathered from a social media source, wherein at least some of the training media items are associated with a language identifier indicating a language the media item is in, wherein each language identifier that indicates the language the media item is in is assigned based on one or more of; a language model, associated with a user who created the media item, indicating that the user who created the media item is mono-linguistic;
ora common language identified by both a first language model associated with the user who created the media item and by a second language model associated with a user who received the media item; and generating, for each of multiple selected n-grams from the training media items, a corresponding probability distribution that a particular media item is in a particular language given that the n-gram is in the particular media item; wherein at least one probability distribution corresponding to one of the multiple selected n-grams is based on an analysis of a frequency with which that n-gram occurs in a subset of the training media items having the same language identifier. - View Dependent Claims (11, 12, 13)
-
Specification