System and method for providing robust topic identification in social indexes
First Claim
1. A computer-implemented system for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:
- electronically-stored data, comprising;
a corpus of articles each comprised of online textual materials; and
a hierarchically-structured tree of topics; and
a social indexing system, comprising;
a finite state modeler comprising;
a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and
a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic;
a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising;
a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles;
a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and
a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles;
a filter module configured to filter new articles received into the corpus, comprising;
a matching module configured to match the finite state patterns to each new article;
a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and
a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and
a display module configured to order the new articles for each of the topics, comprising;
a new article matching module configured to match the new articles to the finite state pattern of the fine grained topic model for the topic;
a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and
a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method for providing robust topic identification in social indexes is described. Electronically-stored articles and one or more indexes are maintained. Each index includes topics that each relate to one or more of the articles. A random sampling and a selective sampling of the articles are both selected. For each topic, characteristic words included in the articles in each of the random sampling and the selective sampling are identified. Frequencies of occurrence of the characteristic words in each of the random sampling and the selective sampling are determined. A ratio of the frequencies of occurrence for the characteristic words included in the random sampling and the selective sampling is identified. Finally, for each topic, a coarse-grained topic model is built, which includes the characteristic words included in the articles relating to the topic and scores assigned to those characteristic words.
-
Citations
22 Claims
-
1. A computer-implemented system for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:
-
electronically-stored data, comprising; a corpus of articles each comprised of online textual materials; and a hierarchically-structured tree of topics; and a social indexing system, comprising; a finite state modeler comprising; a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising; a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; a filter module configured to filter new articles received into the corpus, comprising; a matching module configured to match the finite state patterns to each new article; a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and a display module configured to order the new articles for each of the topics, comprising; a new article matching module configured to match the new articles to the finite state pattern of the fine grained topic model for the topic; a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method for providing topic narrowing in interactive building of electronically-stored social indexes, comprising:
-
accessing a corpus of articles each comprised of online textual materials; specifying a hierarchically-structured tree of topics; for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples; finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising; randomly selecting a set of the articles in the corpus; identifying a set of characteristic words in each of the randomly-selected articles; determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; identifying a set of characteristic words in each of the articles in the on-topic positive training examples; determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; filtering new articles received into the corpus, comprising; matching the finite state patterns to each new article; identifying a set of characteristic words in each new article; determining a frequency of occurrence of each of the characteristic words identified in the each article; and assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and for each of the topics, ordering the new articles comprising; matching the new articles to the finite state pattern of the fine-grained topic model for the topic; for each new article that matches the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and displaying each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for negative training examples. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer-implemented system for providing topic broadening in interactive building of electronically-stored social indexes, comprising:
-
electronically-stored data, comprising; a corpus of articles each comprised of online textual materials; and a hierarchically-structured tree of topics; and a social indexing system, comprising; a finite state modeler comprising; a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising; a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; a filter module configured to filter new articles received into the corpus, comprising; a matching module configured to match the finite state patterns to each new article; a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and a display module configured to order the new articles for each of the topics, comprising; a new article matching module configured to match the new articles to the finite state pattern of the fine-grained topic model for the topic; a new article comparison module configured to compare, for each new article that does not match the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and a display configured to display each of the new articles that was not matched by the topic'"'"'s fine-grained topic model and which has similarity scores close to the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate articles for additional positive training examples. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer-implemented method for providing topic broadening in interactive building of electronically-stored social indexes, comprising:
-
accessing a corpus of articles each comprised of online textual materials; specifying a hierarchically-structured tree of topics; for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples; finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising; randomly selecting a set of the articles in the corpus; identifying a set of characteristic words in each of the randomly-selected articles; determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; identifying a set of characteristic words in each of the articles in the on-topic positive training examples; determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; filtering new articles received into the corpus, comprising; matching the finite state patterns to each new article; identifying a set of characteristic words in each new article; determining a frequency of occurrence of each of the characteristic words identified in the each article; and assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and for each of the topics, ordering the new articles comprising; matching the new articles to the finite state pattern of the fine-grained topic model for the topic; for each new article that does not match the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and displaying each of the new articles that was not matched by the topic'"'"'s fine-grained topic model and which has similarit scores close to the topic'"'"'s coarse-trained topic model'"'"'s characteristic word scores articles as candidate articles for additional positive training examples. - View Dependent Claims (17, 18, 19, 20)
-
-
21. A computer-implemented system for providing robustness against noise during interactive building of electronically-stored social indexes, comprising:
-
electronically-stored data, comprising; a corpus of articles each comprised of online textual materials; and a hierarchically-structured tree of topics; and a social indexing system, comprising; a finite state modeler comprising; a selection module configured to designate, for each of the topics, a set of the articles in the corpus as on-topic positive training examples; and a pattern evaluator configured to find a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; a characteristic word modeler configured to generate a coarse-grained topic model for each of the topics corresponding to a center of the topic, comprising; a random sampling module configured to randomly select a set of the articles in the corpus, to identify a set of characteristic words in each of the randomly-selected articles, and to determine a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; a selective sampling module configured to identify a set of characteristic words in each of the articles in the on-topic positive training examples, and to determine a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and a scoring module configured to assign a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; a filter module configured to filter new articles received into the corpus, comprising; a matching module configured to match the finite state patterns to each new article; a characteristic word evaluator configured to identify a set of characteristic words in each new article, and to determine a frequency of occurrence of each of the characteristic words identified in the each article; and a similarity scoring module configured to assign a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and a display module configured to order the new articles for each of the topics, comprising; a new article matching module configured to match the new articles to the finite state pattern of the fine- rained to is model for the topic; a new article comparison module configured to compare, for each new article that matches the fine-grained topic model for the topic, similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and a display configured to display each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores far from the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate noise articles.
-
-
22. A computer-implemented method for providing robustness against noise during interactive building of electronically-stored social indexes, comprising:
-
accessing a corpus of articles each comprised of online textual materials; specifying a hierarchically-structured tree of topics; for each of the topics, designating a set of the articles in the corpus as on-topic positive training examples; finding a fine-grained topic model comprising a finite state pattern that matches the on-topic positive training examples, each finite state pattern comprising a pattern evaluable against the articles, wherein the pattern identifies such articles matching the on-topic positive training examples for the corresponding topic; for each of the topics, generating a coarse-grained topic model corresponding to a center of the topic comprising; randomly selecting a set of the articles in the corpus; identifying a set of characteristic words in each of the randomly-selected articles; determining a frequency of occurrence of each of the characteristic words identified in the set of randomly-selected articles; identifying a set of characteristic words in each of the articles in the on-top positive training examples; determining a frequency of occurrence of each of the characteristic words identified in the articles in the on-topic training examples; and assigning a score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the articles in the on-topic training examples and in the set of randomly-selected articles; filtering new articles received into the corpus, comprising; matching the finite state patterns to each new article; identifying a set of characteristic words in each new article; determining a frequency of occurrence of each of the characteristic words identified in the each article; and assigning a similarity score to each characteristic word as a ratio of the respective frequencies of occurrence of the characteristic word in the new article and in the set of randomly-selected articles; and for each of the topics, ordering the new articles comprising; matching the new articles to the finite state pattern of the fine-grained topic model for the topic; for each new article that matches the fine-grained topic model for the topic, comparing similarity scores for each of the characteristic words identified in the new article to the scores of the corresponding characteristic words in the coarse-grained topic model for the topic; and displaying each of the new articles that was matched by the topic'"'"'s fine-grained topic model and which has similarity scores far from the topic'"'"'s coarse-grained topic model'"'"'s characteristic word scores as candidate noise articles.
-
Specification