IDENTIFYING CULTURAL BACKGROUND FROM TEXT
First Claim
1. A method for determining a diaculture of text, comprising:
- tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;
a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing;
constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and
comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
7 Assignments
0 Petitions
Accused Products
Abstract
Diaculture of text can be determined or analyzed by tokenizing words of the text according to a rule set to generate tokenized text, the rule set defining: a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing. Grams can be constructed from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text. The grams can be compared to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
20 Citations
20 Claims
-
1. A method for determining a diaculture of text, comprising:
-
tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing; constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for determining a diaculture of text, comprising computer hardware, including a central processor and memory, which is configured to:
-
tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing; construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
-
-
20. A processing machine for determining a diaculture of text, comprising:
-
tokenizing circuitry to tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining; a first set of grammatical types of words, which are words that are replaced, by the tokenizing circuitry, with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed, by the tokenizing circuitry, as tokens without changing; constructing circuitry to construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and comparing circuitry to compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
-
Specification