Identifying cultural background from text

US 9,158,761 B2
Filed: 03/28/2013
Issued: 10/13/2015
Est. Priority Date: 03/28/2012
Status: Active Grant

First Claim

Patent Images

1. A method for determining a diaculture of text, comprising:

tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;

a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing;

constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text;

comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture;

tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;

constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;

assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and

assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Diaculture of text can be determined or analyzed by tokenizing words of the text according to a rule set to generate tokenized text, the rule set defining: a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing. Grams can be constructed from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text. The grams can be compared to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.

20 Citations

View as Search Results

18 Claims

1. A method for determining a diaculture of text, comprising:
- tokenizing words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;
  
  a first set of grammatical types of words, which are words that are replaced, in the tokenizing, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, in the tokenizing, as tokens without changing;
  
  constructing grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text;
  
  comparing the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture;
  
  tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;
  
  constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;
  
  assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and
  
  assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method according to claim 1, wherein the comparing includes:
    - assigning scores to the grams based on a comparison between the training data set and a baseline data set;
      
      windowing a fixed number of the consecutive tokens in the tokenized text to form a first window, and repeatedly advancing the first window by one token to form a plurality of windows of tokens from the tokenized text;
      
      assigning a score to each of the windows based on the scores assigned to the grams; and
      
      obtaining the comparison result based on the scores assigned to the windows.
  - 3. The method according to claim 2, wherein the constructing grams includes constructing a plurality of 1, 2, and 3-grams from the tokenized text, the 1, 2, and 3 grams respectively including 1, 2, and 3 consecutive tokens from the tokenized text, such that a 1-gram includes a first token, a 2-gram includes the first token and a second token that consecutively follows the first token, and a 3-gram includes the first and second tokens and a third token that consecutively follows the first token.
  - 4. The method according to claim 3, wherein the comparing includes:
    - assigning scores to the grams based on the training data set, including assigning a composite score for one gram that is calculated based on neighboring grams, such that the composite score for the 1-gram is calculated based on scores assigned to the first, second and third tokens.
  - 5. The method according to claim 4, wherein the composite score for the 1-gram is an average of the scores assigned to the first, second and third tokens.
  - 6. The method according to claim 1, wherein the first set of grammatical types of words includes words indicative of a topic of the text.
  - 7. The method according to claim 6, wherein the second set of grammatical types of words does not include words that are indicative of the topic of the text.
  - 8. The method according to claim 1, wherein the first set of grammatical types of words includes verbs, nouns, adverbs, and adjectives.
  - 9. The method according to claim 8, wherein each tense of each grammatical type in the first set is tokenized with a different token.
  - 10. The method according to claim 9, wherein the second set of grammatical types of words includes possessive pronouns, pronouns, articles, and prepositions.
  - 11. The method according to claim 1, wherein the training data set includes a plurality of data sets that respectively correspond to a plurality of different diacultures, and the comparing includes comparing the grams to the data sets to obtain comparison results that indicate how well the text matches the data sets.
  - 12. The method according to claim 11, further comprising displaying a result of the comparing on a display.
  - 13. The method according to claim 1, wherein the comments include comments of a posting, and the training data set does not include the posting.
  - 14. The method according to claim 1, wherein the comparing includes:
    - windowing a fixed number of the consecutive tokens in the tokenized text to form a first window, and repeatedly advancing the first window by one token to form a plurality of windows of tokens from the tokenized text;
      
      assigning scores to the windows based on the scores assigned to the grams; and
      
      obtaining the comparison result based on the scores assigned to the windows.
  - 15. The method according to claim 14, wherein:
    - the training data set includes a plurality of data sets that respectively correspond to a plurality of different diacultures;
      
      the comparing includes comparing the grams to the data sets to obtain comparison results that indicate how well the text matches the data sets; and
      
      the method further comprises;
      
      displaying results of the comparing for each combination of the one or more scoring methods and the different diacultures.
  - 16. A non-transitory computer readable medium including computer-executable instructions that, when executed by a computer processor, cause the computer processor to execute the method according to claim 1.

17. A system for determining a diaculture of text, comprising computer hardware, including a central processor and memory, which is configured to:
- tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;
  
  a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed as tokens without changing;
  
  construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and
  
  compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculture;
  
  tokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;
  
  constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;
  
  assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and
  
  assigning scores to the grams of the tokenized text based on the scores assigned to the grams of the tokenized comments.

18. A processing machine for determining a diaculture of text, comprising:
- tokenizing circuitry to tokenize words of the text with one or more processors according to a rule set to generate tokenized text, the rule set defining;
  
  a first set of grammatical types of words, which are words that are replaced, by the tokenizing circuitry, with tokens that respectively indicate a grammatical type of a respective word, anda second set of grammatical types of words, which are words that are passed, by the tokenizing circuitry, as tokens without changing;
  
  constructing circuitry to construct grams from the tokenized text, each gram including one or more of consecutive tokens from the tokenized text; and
  
  comparing circuitry to compare the grams to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture, wherein the training data set includes a plurality of comments written by authors of the known diaculturetokenizing words of the comments with one or more processors according to the rule set to generate tokenized comments;
  
  constructing grams from the tokenized comments, each gram including one or more of consecutive tokens from the tokenized comments;
  
  assigning scores to each of the grams of the tokenizing comments according to one or more scoring methods that each define a relationship between a score of a gram, and a number of times the gram appears in the training data set and a baseline data set; and
  
  assigning scores to the grains of the tokenized text based on the scores assigned to the grams of the tokenized comments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Leidos Innovations Technology, Inc.
Original Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Inventors
Davenport, Daniel M., Menaker, David, Paradis, Rosemary D., Taylor, Sarah M.
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US13/852,620
Publication Number

US 20130282362A1
Time in Patent Office

929 Days
Field of Search

704/3, 704/9, 704/8, 704/4, 715/752, 715/236, 709/224, 709/206, 709/204, 707/797, 707/758, 707/737, 706/52, 705/7.29, 705/14.53
US Class Current

1/1
CPC Class Codes

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/40   Processing or translation o...

Identifying cultural background from text

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying cultural background from text

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links