Identifying multiple languages in a content item

US 10,180,935 B2
Filed: 02/02/2017
Issued: 01/15/2019
Est. Priority Date: 12/30/2016
Status: Active Grant

First Claim

Patent Images

1. A method for improving language processing technologies by determining language segments of a content item, comprising:

receiving a content item derived from a social network item, the content item comprising two or more words, wherein at least a first portion of the two or more words were composed in a first language and at least a second portion of the two or more words were composed in a second language different from the first language;

tokenizing the content item into an ordered set of tokens comprising one or more tokens;

identifying;

the first language for a first set of the one or more tokens by a machine learning model, andthe second language for a second set of the one or more tokens by the machine learning model,wherein the identifying is performed by maximizing a probability computed for the ordered set of tokens based on a combination of transition probabilities, a respective transition probability corresponding to each token after the first token in the ordered set of tokens, wherein each respective transition probability indicates a likelihood of switching from a language of a previous token to a language of a current token in the ordered set of tokens; and

grouping consecutive ones of the one or more tokens into the language segments based on the identifying, wherein a first of the language segment corresponds to the first language and a second language segment corresponds to the second language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for identifying language(s) for content items is disclosed. The system can identify different languages for content item words segments by identifying segment languages that maximize a probability across the segments. The probability can be a combination of: an author'"'"'s likelihood for the language identified for the first word; a combination of transition frequencies for selected languages identified for words, the transition frequencies indicating likelihoods that a transition occurred to the selected language from the previous word'"'"'s language; and a combination of observation probabilities indicating, for a given word in the content item, a likelihood the given word is in the identified language. For an in-vocabulary word, the observation probabilities can be based on learned probability for that word. For an out-of-vocabulary word, the probability can be computed by breaking the word into overlapping n-grams and computing combined learned probabilities that each n-gram is in the given language.

96 Citations

19 Claims

1. A method for improving language processing technologies by determining language segments of a content item, comprising:
- receiving a content item derived from a social network item, the content item comprising two or more words, wherein at least a first portion of the two or more words were composed in a first language and at least a second portion of the two or more words were composed in a second language different from the first language;
  
  tokenizing the content item into an ordered set of tokens comprising one or more tokens;
  
  identifying;
  
  the first language for a first set of the one or more tokens by a machine learning model, andthe second language for a second set of the one or more tokens by the machine learning model,wherein the identifying is performed by maximizing a probability computed for the ordered set of tokens based on a combination of transition probabilities, a respective transition probability corresponding to each token after the first token in the ordered set of tokens, wherein each respective transition probability indicates a likelihood of switching from a language of a previous token to a language of a current token in the ordered set of tokens; and
  
  grouping consecutive ones of the one or more tokens into the language segments based on the identifying, wherein a first of the language segment corresponds to the first language and a second language segment corresponds to the second language.

2. The method of claim 1, wherein the combination of transition probabilities is a product of the transition probabilities.

3. The method of claim 1, wherein the probability computed for the ordered set of tokens is based on a combination of observation probabilities, one observation probability corresponding to each token in the ordered set of tokens, wherein each observation probability indicates a probability for a corresponding token of the ordered set of tokens that the corresponding token is in the language.

4. The method of claim 3, wherein an observation probability for a corresponding token is determined such that:
- where the corresponding token corresponds to a known word, the observation probability for the corresponding token is computer using an in-vocabulary distribution based on observed occurrences of that word appearing in various languages; and
  
  where the corresponding token does not correspond to a known word, the observation probability for the corresponding token is computed by dividing the token into one or more n-grams and computing a combination of probabilities for the one or more n-grams using an out-of-vocabulary distribution based on observed occurrences of the one of more n-grams appearing in various languages.

5. The method of claim 3, wherein the combination of observation probabilities is a product of the observation probabilities.

6. The method of claim 1, wherein maximizing the probability computed for the ordered set of tokens is further based on a user language probability, wherein the user language probability indicates a probability that an author of the content item is facile with the language corresponding to the first token of the one or more tokens.

7. The method of claim 1, wherein the method further comprises, prior to tokenizing the content item into an ordered set of tokens, using a pattern matching to:
- remove established patterns from the content item;
  
  orreplace established patterns in the content item with whitespace.

8. The method of claim 7, wherein the established patterns include dates, times, email addresses, URLs, hashtags, emoji, emoticons, mentions, symbols, or non-words.

9. The method of claim 7, wherein the established patterns are replaced with an equivalent amount of whitespace such that the text boundaries of the content item are preserved.

10. The method of claim 1, wherein tokenizing the content item into an ordered set of tokens comprises identifying each of the one or more tokens as a word from the content item.

11. The method of claim 10, wherein tokenizing the content item into an ordered set of tokens comprises:
- splitting the content item into the tokens by using whitespaces as boundaries; and
  
  running the tokens through a computer character library for locales that are not whitespace delimited.

12. The method of claim 1, further comprising, prior to identifying languages for the one or more tokens:
- converting the tokens to all lower case letters;
  
  removing tokens from the ordered set of tokens that contain only numbers; and
  
  removing tokens comprising patters of letters that repeat above a threshold amount.

13. The method of claim 1, wherein the probability computed for the ordered set of tokens is further baseda combination of observation probabilities with one observation probability corresponding to each token in the ordered set of tokens, wherein each observation indicates a probability for a corresponding token of the ordered set of tokens that the corresponding token is in the language.

14. The method of claim 1, further comprising:
- identifying a social media object where a first user who produced content of the social media object and a second user who received the social media object share a common language preference of a language;
  
  including the social media object as part of a training dataset labeled as in the language; and
  
  training a machine learning model for identifying the languages using the training dataset.

15. A system for improving language processing technologies by determining language segments of a content item, comprising:
- an interface configured to receive a content item comprising two or more words, wherein at least a first portion of the two or more words were composed in a first language and at least a second portion of the tow or more words were composed in a second language different from the first language;
  
  a tokenization module configured to tokenize the content item into an ordered set of tokens comprising one of more tokens;
  
  an inference engine configured to identify the first language for a first set of the one or more tokens and the second language for a second set of the one or more tokens, wherein the identifying is performed by maximizing a probability computed for the ordered set of tokens based on;
  
  a combination of transition probabilities, a respective transition probability corresponding to each token after the first token in the ordered set of tokens, wherein each respective transition probability indicates a likelihood of switching from a language of a previous token to a language of a current token in the ordered set of tokens; and
  
  a segmentation module configured to group consecutive ones of the one or more tokens into the language segments based on the identifying, wherein a first of the language segment corresponds to the first language and a second of the language segment corresponds to the second language;
  
  wherein the language segments with corresponding language identifications are used in one or more language processing technologies including one or more of;
  
  machine translation, part-of-speech tagging, topic labeling, spell checking, or any combination thereof, thereby providing the improvement to the one or more language processing technologies.

16. The system of claim 15, wherein the inference engine is flexible to detect transition points at a token level and a sentence level, wherein the transition points are places where content changes from one language to another language.

17. A system of claim 15, wherein the inference engine is further configured to generate confidence levels associated with the identified languages for the one or more tokens.

18. A non-transitory computer-readable storage medium storing instructions that, when executed by a computing system, cause the computing system to perform operations for determining language segments of a content item, the operations comprising:
- receiving a content item comprising two or more words, wherein at least a first portion of the two or more words were composed in a first language and at least a second portion of the two or more words were composed in a second language different from the first language;
  
  tokenizing the content item into an ordered set of tokens comprising one or more tokens;
  
  identifying the first language for a first set of the one or more tokens and the second language for a second set of the one or more tokens, wherein the identifying is performed by maximizing a probability computer for the ordered set of tokens based on both;
  
  a combination of transition probabilities, a respective transition probability corresponding to each token after the first token in the ordered set of tokens, wherein each respective transition probability indicates a likelihood of switching from a language of a previous token to a language of a current token in the ordered set of tokens, anda combination of observation probabilities, one observation probability corresponding to each token in the ordered set of tokens, wherein each observation probability indicates a probability, for a selected token of the ordered set of tokens, that the selected token is in the language corresponding to the selected token; and
  
  grouping consecutive ones of the one or more tokens into the language segments based on the identifying, wherein a first of the language segment corresponds to the first language and a second of the language segment corresponds to the second language;
  
  wherein the language segments with the corresponding language identifications are used in one or more language processing technologies including one or more of;
  
  machine translation, part-of-speech tagging, topic labeling, spell checking, or any combination thereof, thereby providing an improvement to the one or more language processing technologies.

19. The non-transitory computer-readable storage medium of claim 18, wherein the combination of transition probabilities is a product of the transition probabilities, and wherein the combination of observation probabilities is a product of the observation probabilities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Inventors
Merl, Daniel Matthew, Pal, Aditya, Funiak, Stanislav, Park, Seyoung, Huang, Fei, Herdagdelen, Amac
Primary Examiner(s)
Roberts, Shaun

Application Number

US15/422,463
Publication Number

US 20180189259A1
Time in Patent Office

712 Days
Field of Search

704 1, 704 2, 704 8, 704 9
US Class Current
CPC Class Codes

G06F 40/163   Handling of whitespace

G06F 40/263   Language identification

G06F 40/289   Phrasal analysis, e.g. fini...

Identifying multiple languages in a content item

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

96 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying multiple languages in a content item

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

96 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links