×

System and method for providing robust topic identification in social indexes

  • US 8,549,016 B2
  • Filed: 10/29/2009
  • Issued: 10/01/2013
  • Est. Priority Date: 11/14/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented system for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:

  • electronically-stored data, comprising;

    a corpus of articles each comprised of online textual materials; and

    a hierarchically-structured tree of topics; and

    a social indexing system, comprising;

    a finite state modeler comprising;

    a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and

    a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;

    a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;

    a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;

    a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and

    a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;

    a filter module configured to filter new articles received into the corpus, comprising;

    a matching module configured to match the finite state patterns to each new article;

    a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and

    a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and

    a display module configured to order the new articles for each of the topics, comprising;

    a new article matching module configured to match the new articles to the finite state pattern of the fine grained topic model for the topic;

    a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and

    a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×