Classifying languages for objects and entities

US 10,013,417 B2
Filed: 07/17/2017
Issued: 07/03/2018
Est. Priority Date: 06/11/2014
Status: Active Grant

First Claim

Patent Images

1. A system for providing a language classification of a media item, comprising:

one or more processors; and

a memory storing instructions that, when executed by the system, cause the system to perform operations for providing a language classification of a media item, the operations implementing;

a context classifier to determine a context characteristic indicating one or more users who have interacted with the media item;

wherein the context characteristic corresponds to a computed likelihood that the media item is in one or more languages based on determined language abilities of the users who have interacted with the media item; and

wherein the context classifier is further to compute, based on the determined context characteristic and corresponding computed likelihood, a context prediction that the media item is in one or more first languages;

a trained classifier to compute a trained prediction that the media item is in one or more second languages;

wherein computing the trained prediction comprises an n-gram analysis of the media item, for one or more n-grams in the media item having a particular length, which analyzes a specified probability distribution that the n-gram is in a specific language; and

a language classifier to combine the context prediction with the trained prediction.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Technology for media item and user language classification is disclosed. Media item classification may use models for associating language identifiers or probability distributions for multiple languages with linguistic content. User language classification may define user language models for attributing to users indications of languages they speak read, and/or write. The text classifications and user classifications may interact because the probability that given text is in a particular language may depend on a determined likelihood the user who produced the text speaks that language, or conversely, a user interacting with text in a particular language may increase the likelihood they understand that language. Some embodiments use language-tagged social media content to train n-gram classifiers for use with other social media content.

212 Citations

13 Claims

1. A system for providing a language classification of a media item, comprising:
- one or more processors; and
  
  a memory storing instructions that, when executed by the system, cause the system to perform operations for providing a language classification of a media item, the operations implementing;
  
  a context classifier to determine a context characteristic indicating one or more users who have interacted with the media item;
  
  wherein the context characteristic corresponds to a computed likelihood that the media item is in one or more languages based on determined language abilities of the users who have interacted with the media item; and
  
  wherein the context classifier is further to compute, based on the determined context characteristic and corresponding computed likelihood, a context prediction that the media item is in one or more first languages;
  
  a trained classifier to compute a trained prediction that the media item is in one or more second languages;
  
  wherein computing the trained prediction comprises an n-gram analysis of the media item, for one or more n-grams in the media item having a particular length, which analyzes a specified probability distribution that the n-gram is in a specific language; and
  
  a language classifier to combine the context prediction with the trained prediction.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the probability distribution is a distribution across multiple languages that gauges whether the media item is in each of the multiple languages.
  - 3. The system of claim 1, wherein the language classifier is further to combine the context prediction with a prediction based on dictionary classifiers, wherein the dictionary classifiers select one or more words of the media item which indicate a particular probability that media items containing the selected words are in a certain language.
  - 4. The system of claim 1, wherein the particular length for the trained n-gram analysis is four or five characters.
  - 5. The system of claim 1, wherein the context classifier identifies the one or more users who have interacted with the media item as an author of the media item, and wherein the corresponding computed likelihood is based on a language associated with the author indicating the author is facile with the one or more first languages.

6. A method for building an n-gram classifier trained for analysis of social media content items, comprising:
- storing training media items gathered from a social media source, wherein at least some of the training media items are associated with a language identifier indicating a language the media item is in, wherein each language identifier that indicates the language the media item is in is assigned based on one or more of;
  
  a language model, associated with a user who created the media item, indicating that the user who created the media item is mono-linguistic;
  
  ora common language identified by both a first language model associated with the user who created the media item and by a second language model associated with a user who received the media item; and
  
  generating, for each of multiple selected n-grams from the training media items, a corresponding probability distribution that a particular media item is in a particular language given that the n-gram is in the particular media item;
  
  wherein at least one probability distribution corresponding to one of the multiple selected n-grams is based on an analysis of a frequency with which that n-gram occurs in a subset of the training media items having the same language identifier.
- View Dependent Claims (7, 8, 9)
- - 7. The method of claim 6, wherein each of the multiple selected n-grams has a length of four or five characters.
  - 8. The method of claim 6, wherein the multiple selected n-grams comprise all possible character sequences of a particular length in the training data.
  - 9. The method of claim 6, wherein at least one probability distribution corresponding to one individual n-gram of the multiple selected n-grams includes a rate at which the individual n-gram appears across languages of the training media items.

10. A non-transitory computer readable storage medium storing instructions that, in response to being executed by a computing device, cause the computing device to perform operations for building an n-gram classifier trained for analysis of social media content items, the operations comprising:
- storing training media items gathered from a social media source, wherein at least some of the training media items are associated with a language identifier indicating a language the media item is in, wherein each language identifier that indicates the language the media item is in is assigned based on one or more of;
  
  a language model, associated with a user who created the media item, indicating that the user who created the media item is mono-linguistic;
  
  ora common language identified by both a first language model associated with the user who created the media item and by a second language model associated with a user who received the media item; and
  
  generating, for each of multiple selected n-grams from the training media items, a corresponding probability distribution that a particular media item is in a particular language given that the n-gram is in the particular media item;
  
  wherein at least one probability distribution corresponding to one of the multiple selected n-grams is based on an analysis of a frequency with which that n-gram occurs in a subset of the training media items having the same language identifier.
- View Dependent Claims (11, 12, 13)
- - 11. The non-transitory computer readable storage medium of claim 10, wherein each of the multiple selected n-grams has a length of four or five characters.
  - 12. The non-transitory computer readable storage medium of claim 10, wherein the multiple selected n-grams comprise all possible character sequences of a particular length in the training data.
  - 13. The non-transitory computer readable storage medium of claim 10, wherein at least one probability distribution corresponding to one individual n-gram of the multiple selected n-grams includes a rate at which the individual n-gram appears across languages of the training media items.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Inventors
Herdagdelen, Amac, Green, Bradley Ray
Primary Examiner(s)
Saint Cyr, Leonard

Application Number

US15/652,175
Publication Number

US 20170315988A1
Time in Patent Office

351 Days
Field of Search

704 2- 10
US Class Current
CPC Class Codes

G06F 40/263   Language identification

G06F 40/40   Processing or translation o...

G06Q 10/00   Administration; Management

G06Q 10/10   Office automation; Time man...

H04L 67/02   based on web technology, e....

Classifying languages for objects and entities

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

212 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Classifying languages for objects and entities

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

212 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links