×

Identifying cultural background from text

  • US 9,158,761 B2
  • Filed: 03/28/2013
  • Issued: 10/13/2015
  • Est. Priority Date: 03/28/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method for determining a diaculture of text, comprising:

  • tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;

    a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing;

    constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text;

    comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture;

    tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;

    constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;

    assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and

    assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×