System and method for providing robust topic identification in social indexes

US 8,549,016 B2
Filed: 10/29/2009
Issued: 10/01/2013
Est. Priority Date: 11/14/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented system for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:

electronically-stored data, comprising;

a corpus of articles each comprised of online textual materials; and

a hierarchically-structured tree of topics; and

a social indexing system, comprising;

a finite state modeler comprising;

a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and

a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;

a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;

a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;

a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and

a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;

a filter module configured to filter new articles received into the corpus, comprising;

a matching module configured to match the finite state patterns to each new article;

a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and

a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and

a display module configured to order the new articles for each of the topics, comprising;

a new article matching module configured to match the new articles to the finite state pattern of the fine grained topic model for the topic;

a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and

a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for providing robust topic identification in social indexes is described. Electronically-stored articles and one or more indexes are maintained. Each index includes topics that each relate to one or more of the articles. A random sampling and a selective sampling of the articles are both selected. For each topic, characteristic words included in the articles in each of the random sampling and the selective sampling are identified. Frequencies of occurrence of the characteristic words in each of the random sampling and the selective sampling are determined. A ratio of the frequencies of occurrence for the characteristic words included in the random sampling and the selective sampling is identified. Finally, for each topic, a coarse-grained topic model is built, which includes the characteristic words included in the articles relating to the topic and scores assigned to those characteristic words.

Citations

22 Claims

1. A computer-implemented system for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:
- electronically-stored data, comprising;
  
  a corpus of articles each comprised of online textual materials; and
  
  a hierarchically-structured tree of topics; and
  
  a social indexing system, comprising;
  
  a finite state modeler comprising;
  
  a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and
  
  a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;
  
  a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  a filter module configured to filter new articles received into the corpus, comprising;
  
  a matching module configured to match the finite state patterns to each new article;
  
  a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  a display module configured to order the new articles for each of the topics, comprising;
  
  a new article matching module configured to match the new articles to the finite state pattern of the fine grained topic model for the topic;
  
  a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A computer-implemented system according to claim 1, further comprising:
    - the selective sampling module further configured to designate a further subset of the articles in the corpus as additions to the negative training examples; and
      
      the pattern evaluator further configured to redefine the finite state patterns to match the on-topic positive training examples and to not match the negative training examples prior to the filtering.
  - 3. A computer-implemented system according to claim 1, further comprising:
    - the selective sampling module further configured to randomly select a set of the articles in the corpus, which match the finite state patterns as a further fine-grained topic model in lieu of designating a set of the articles in the corpus as the on-topic positive training examples; and
      
      a term vector module configured to form term vectors for the characteristic words in each of the articles in the further fine-grained topic model comprising frequencies of occurrence within the further fine-grained topic model, and to average the term vectors.
  - 4. A computer-implemented system according to claim 3, further comprising:
    - the scoring module further configured to adjust the weight of each of the characteristic words comprising at least one of;
      
      a sample reduction module configured to reduce the weight for each such characteristic word appearing fewer than a minimum number of times in the sampling of the articles;
      
      a character reduction module configured to reduce the weight for each characteristic word comprising a length of less than a minimum number of characters;
      
      a label increase module configured to increase the weight for each characteristic word appearing in an index label of one or more of the articles in the sampling of the articles; and
      
      a neighbor increase module configured to increase the weight of each characteristic word either neighboring or appearing adjacent to another characteristic word appearing in an index label of one or more of the articles in the sampling of the articles.
  - 5. A computer-implemented system according to claim 1, further comprising:
    - the characteristic word modeler further configured at least one of to take cosines of the highest scores of the characteristic words, and to find the highest score within the scores of the characteristic words and to normalize the scores of the remaining characteristic words against the highest score.

6. A computer-implemented method for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:
- accessing a corpus of articles each comprised of online textual materials;
  
  specifying a hierarchically-structured tree of topics;
  
  for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples;
  
  finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising;
  
  randomly selecting a set of the articles in the corpus;
  
  identifying a set of characteristic words in each of the randomly-selected articles;
  
  determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  identifying a set of characteristic words in each of the articles in the on-topic positive training examples;
  
  determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  filtering new articles received into the corpus, comprising;
  
  matching the finite state patterns to each new article;
  
  identifying a set of characteristic words in each new article;
  
  determining a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  for each of the topics, ordering the new articles comprising;
  
  matching the new articles to the finite state pattern of the fine-grained topic model for the topic;
  
  for each new article that matches the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  displaying each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples.
- View Dependent Claims (7, 8, 9, 10)
- - 7. A computer-implemented method according to claim 6, further comprising:
    - designating a further subset of the articles in the corpus as additions to the negative training examples; and
      
      redefining the finite state patterns to match the on-topic positive training examples and to not match the negative training examples prior to the filtering.
  - 8. A computer-implemented method according to claim 6, further comprising:
    - randomly selecting a set of the articles in the corpus, which match the finite state patterns as a further fine-grained topic model in lieu of designating a set of the articles in the corpus as the on-topic positive training examples;
      
      forming term vectors for the characteristic words in each of the articles in the further fine-grained topic model comprising frequencies of occurrence within the further fine-grained topic model; and
      
      averaging the term vectors.
  - 9. A computer-implemented method according to claim 8, further comprising:
    - adjusting the weight of each of the characteristic words, comprising at least one of;
      
      reducing the weight for each such characteristic word appearing fewer than a minimum number of times in the sampling of the articles;
      
      reducing the weight for each characteristic word comprising a length of less than a minimum number of characters;
      
      increasing the weight for each characteristic word appearing in an index label of one or more of the articles in the sampling of the articles; and
      
      increasing the weight of each characteristic word either neighboring or appearing adjacent to another characteristic word appearing in an index label of one or more of the articles in the sampling of the articles.
  - 10. A computer-implemented method according to claim 6, further comprising at least one of:
    - taking cosines of the highest scores of the characteristic words; and
      
      finding the highest score within the scores of the characteristic words and normalizing the scores of the remaining characteristic words against the highest score.

11. A computer-implemented system for providing topic broadening in interactive building of electronically-stored social indexes, comprising:
- electronically-stored data, comprising;
  
  a corpus of articles each comprised of online textual materials; and
  
  a hierarchically-structured tree of topics; and
  
  a social indexing system, comprising;
  
  a finite state modeler comprising;
  
  a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and
  
  a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;
  
  a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  a filter module configured to filter new articles received into the corpus, comprising;
  
  a matching module configured to match the finite state patterns to each new article;
  
  a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  a display module configured to order the new articles for each of the topics, comprising;
  
  a new article matching module configured to match the new articles to the finite state pattern of the fine-grained topic model for the topic;
  
  a new article comparison module configured to compare, for each new article that does not match the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  a display configured to display each of the new articles that was not matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for additional positive training examples.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A computer-implemented system according to claim 11, further comprising:
    - the selective sampling module further configured to designate a further subset of the non-matching articles in the corpus as topic-broadening positive training examples; and
      
      the pattern evaluator further configured to redefine the finite state patterns to match the on-topic positive training examples and the topic-broadening positive training examples and to not match the negative training examples prior to the filtering.
  - 13. A computer-implemented system according to claim 11, further comprising:
    - the selective sampling module further configured to randomly select a set of the articles in the corpus, which match the finite state patterns as a further fine-grained topic model in lieu of designating a set of the articles in the corpus as the on-topic positive training examples; and
      
      a term vector module configured to form term vectors for the characteristic words in each of the articles in the further fine-grained topic model comprising frequencies of occurrence within the further fine-grained topic model, and to average the term vectors.
  - 14. A computer-implemented system according to claim 13, further comprising:
    - the scoring module further configured to adjust the weight of each of the characteristic words comprising at least one of;
      
      a sample reduction module configured to reduce the weight for each such characteristic word appearing fewer than a minimum number of times in the sampling of the articles;
      
      a character reduction module configured to reduce the weight for each characteristic word comprising a length of less than a minimum number of characters;
      
      a label increase module configured to increase the weight for each characteristic word appearing in an index label of one or more of the articles in the sampling of the articles; and
      
      a neighbor increase module configured to increase the weight of each characteristic word either neighboring or appearing adjacent to another characteristic word appearing in an index label of one or more of the articles in the sampling of the articles.
  - 15. A computer-implemented system according to claim 11, further comprising:
    - the characteristic word modeler further configured at least one of to take cosines of the highest scores of the characteristic words, and to find the highest score within the scores of the characteristic words and to normalize the scores of the remaining characteristic words against the highest score.

16. A computer-implemented method for providing topic broadening in interactive building of electronically-stored social indexes, comprising:
- accessing a corpus of articles each comprised of online textual materials;
  
  specifying a hierarchically-structured tree of topics;
  
  for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples;
  
  finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising;
  
  randomly selecting a set of the articles in the corpus;
  
  identifying a set of characteristic words in each of the randomly-selected articles;
  
  determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  identifying a set of characteristic words in each of the articles in the on-topic positive training examples;
  
  determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  filtering new articles received into the corpus, comprising;
  
  matching the finite state patterns to each new article;
  
  identifying a set of characteristic words in each new article;
  
  determining a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  for each of the topics, ordering the new articles comprising;
  
  matching the new articles to the finite state pattern of the fine-grained topic model for the topic;
  
  for each new article that does not match the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  displaying each of the new articles that was not matched by the topic'"'"'s fine-grained topic model and which has similarit scores close to the topic'"'"'s coarse-_trained topic model'"'"'s characteristic word scores articles as candidate articles for additional positive training examples.
- View Dependent Claims (17, 18, 19, 20)
- - 17. A computer-implemented method according to claim 16, further comprising:
    - accepting designations by a user of a further subset of the non-matching articles in the corpus as topic-broadening positive training examples; and
      
      redefining the finite state patterns to match the on-topic positive training examples and the topic-broadening positive training examples and to not match the negative training examples prior to the filtering.
  - 18. A computer-implemented method according to claim 16, further comprising:
    - randomly selecting a set of the articles in the corpus, which match the finite state patterns as a further fine-grained topic model in lieu of designating a set of the articles in the corpus as the on-topic positive training examples;
      
      forming term vectors for the characteristic words in each of the articles in the further fine-grained topic model comprising frequencies of occurrence within the further fine-grained topic model; and
      
      averaging the term vectors.
  - 19. A computer-implemented method according to claim 18, further comprising:
    - adjusting the score of each of the characteristic words, comprising at least one of;
      
      reducing the score for each such characteristic word appearing fewer than a minimum number of times in the sampling of the articles;
      
      reducing the score for each characteristic word comprising a length of less than a minimum number of characters;
      
      increasing the score for each characteristic word appearing in an index label of one or more of the articles in the sampling of the articles; and
      
      increasing the score of each characteristic word either neighboring or appearing adjacent to another characteristic word appearing in an index label of one or more of the articles in the sampling of the articles.
  - 20. A computer-implemented method according to claim 16, further comprising:
    - finding the highest score within the scores of the characteristic words; and
      
      normalizing the scores of the remaining characteristic words against the highest score.

21. A computer-implemented system for providing robustness against noise during interactive building of electronically-stored social indexes, comprising:
- electronically-stored data, comprising;
  
  a corpus of articles each comprised of online textual materials; and
  
  a hierarchically-structured tree of topics; and
  
  a social indexing system, comprising;
  
  a finite state modeler comprising;
  
  a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and
  
  a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;
  
  a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  a filter module configured to filter new articles received into the corpus, comprising;
  
  a matching module configured to match the finite state patterns to each new article;
  
  a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  a display module configured to order the new articles for each of the topics, comprising;
  
  a new article matching module configured to match the new articles to the finite state pattern of the fine- rained to is model for the topic;
  
  a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores far from the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate noise articles.

22. A computer-implemented method for providing robustness against noise during interactive building of electronically-stored social indexes, comprising:
- accessing a corpus of articles each comprised of online textual materials;
  
  specifying a hierarchically-structured tree of topics;
  
  for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples;
  
  finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
  
  for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising;
  
  randomly selecting a set of the articles in the corpus;
  
  identifying a set of characteristic words in each of the randomly-selected articles;
  
  determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
  
  identifying a set of characteristic words in each of the articles in the on-top positive training examples;
  
  determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
  
  assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
  
  filtering new articles received into the corpus, comprising;
  
  matching the finite state patterns to each new article;
  
  identifying a set of characteristic words in each new article;
  
  determining a frequency of occurrence of each of the characteristic words identified in the each article; and
  
  assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
  
  for each of the topics, ordering the new articles comprising;
  
  matching the new articles to the finite state pattern of the fine-grained topic model for the topic;
  
  for each new article that matches the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
  
  displaying each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores far from the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate noise articles.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Original Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Inventors
Stefik, Mark J., Good, Lance E., Mittal, Sanjay
Primary Examiner(s)
SYED, FARHAN M

Application Number

US12/608,929
Publication Number

US 20100125540A1
Time in Patent Office

1,433 Days
Field of Search
US Class Current

707/749
CPC Class Codes

G06F 16/00 Information retrieval; Data...

System and method for providing robust topic identification in social indexes

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for providing robust topic identification in social indexes

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links