Identifying cultural background from text
First Claim
1. A method for determining a diaculture of text, comprising:
- tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;
a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing;
constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text;
comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture;
tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;
constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;
assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and
assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.
7 Assignments
0 Petitions
Accused Products
Abstract
Diaculture of text can be determined or analyzed by tokenizing words of the text according to a rule set to generate tokenized text, the rule set defining: a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing. Grams can be constructed from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text. The grams can be compared to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
20 Citations
18 Claims
-
1. A method for determining a diaculture of text, comprising:
-
tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing; constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture; tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments; constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments; assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for determining a diaculture of text, comprising computer hardware, including a central processor and memory, which is configured to:
-
tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing; construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture; tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments; constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments; assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.
-
-
18. A processing machine for determining a diaculture of text, comprising:
-
tokenizing circuitry to tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced, by the tokenizing circuitry, with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed, by the tokenizing circuitry, as tokens without changing; constructing circuitry to construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and comparing circuitry to compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments; constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments; assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and assigning scores to the grains of the tokenized text based on the scores assigned to the grams of the tokenized comments.
-
Specification