System and method for providing multi-core and multi-level topical organization in social indexes
First Claim
1. A computer-implemented system for providing multi-core topic indexing in electronically-stored social indexes, comprising:
- a storage device comprising;
a corpus of articles each comprised of online textual materials and a topics;
a finite state pattern for each topic, each finite state pattern defining a fine-grained topic model that is used to identify the articles that are potentially on-topic; and
on-topic training examples and off-topic training examples from the articles for each topic;
one or more of distinct core meanings for the topic by assigning at least one of the on-topic training examples and the off-topic training examples;
a set of average on-topic articles, comprising;
a training module configured to provide a set of random training examples from the corpus;
a match module configured to match the set of random training examples to the finite state pattern for the topic;
an off-topic elimination module configured to eliminate an article that is similar to the off-topic training examples; and
an on-topic addition module configured to add the on-topic training examples into the set of the random training examples; and
an average on-topic core meaning based on the set of the average on-topic articles;
a social indexing system, comprising;
a characteristic words selector configured to specify characteristic words for each of the on-topic training examples, the off-topic training examples, and the set of average on-topic articles, and to assign scores to the characteristic words that were specified for the on-topic training examples, off-topic training examples, and the set of average on-topic articles;
a characteristic words organizer configured to specify on-topic characteristic word term vectors, each on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the on-topic training examples;
a characteristic words scorer configured to specify off-topic characteristic word term vectors, each off-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the off-topic training examples;
a characteristic words specifier configured to specify average on-topic characteristic word term vectors, each average on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for the set of average on-topic articles;
an information collector configured to obtain a new article;
a finite state pattern matcher configured to match the new article to the finite state pattern of each of the topics to designate the new article as a candidate article for each topic to which the finite state pattern was matched;
a candidate article characteristic words selector configured to specify characteristic words extracted from the candidate article;
a candidate article characteristic words scorer configured to assign candidate article scores to the characteristic words of the candidate article;
a topic comparer configured to compare the candidate article scores to the off-topic characteristic word term vectors of each topic and to form an off-topic score for each topic, and to discard the candidate article as off-topic for each topic in which the off-topic score for that topic exceeds an off-topic threshold; and
a similarity score comparer configured to compare the candidate article scores to the on-topic characteristic word term vectors and the average on-topic characteristic word term vectors of each topic and to form an on-topic score for each topic and configured to select only the candidate articles as candidate on-topic articles which the on-topic score for that topic exceeds an on-topic threshold; and
a display configured to present the candidate on-topic articles.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method affords multi-core and multi-level topical organization in social indexes. A corpus of articles is accessed. Each article includes online textual materials. A finite state pattern for a topic filters the articles as candidate articles, which are potentially on-topic. Similarity-based representations are formed for on-topic and off-topic core meanings of the topic. An aggregate score for each of the candidate articles is determined using the similarity-based representations to indicate whether the candidate article is sufficiently on-topic. The candidate articles are presented ordered by their aggregate scores. In a further embodiment, a hierarchy of topics is used to guide the presentation of articles from subtopics, with considerations of fairness of subtopic coverage, elimination of similarity-duplicates in articles, and article freshness.
91 Citations
8 Claims
-
1. A computer-implemented system for providing multi-core topic indexing in electronically-stored social indexes, comprising:
-
a storage device comprising; a corpus of articles each comprised of online textual materials and a topics; a finite state pattern for each topic, each finite state pattern defining a fine-grained topic model that is used to identify the articles that are potentially on-topic; and on-topic training examples and off-topic training examples from the articles for each topic; one or more of distinct core meanings for the topic by assigning at least one of the on-topic training examples and the off-topic training examples; a set of average on-topic articles, comprising; a training module configured to provide a set of random training examples from the corpus; a match module configured to match the set of random training examples to the finite state pattern for the topic; an off-topic elimination module configured to eliminate an article that is similar to the off-topic training examples; and an on-topic addition module configured to add the on-topic training examples into the set of the random training examples; and an average on-topic core meaning based on the set of the average on-topic articles; a social indexing system, comprising; a characteristic words selector configured to specify characteristic words for each of the on-topic training examples, the off-topic training examples, and the set of average on-topic articles, and to assign scores to the characteristic words that were specified for the on-topic training examples, off-topic training examples, and the set of average on-topic articles; a characteristic words organizer configured to specify on-topic characteristic word term vectors, each on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the on-topic training examples; a characteristic words scorer configured to specify off-topic characteristic word term vectors, each off-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the off-topic training examples; a characteristic words specifier configured to specify average on-topic characteristic word term vectors, each average on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for the set of average on-topic articles; an information collector configured to obtain a new article; a finite state pattern matcher configured to match the new article to the finite state pattern of each of the topics to designate the new article as a candidate article for each topic to which the finite state pattern was matched; a candidate article characteristic words selector configured to specify characteristic words extracted from the candidate article; a candidate article characteristic words scorer configured to assign candidate article scores to the characteristic words of the candidate article; a topic comparer configured to compare the candidate article scores to the off-topic characteristic word term vectors of each topic and to form an off-topic score for each topic, and to discard the candidate article as off-topic for each topic in which the off-topic score for that topic exceeds an off-topic threshold; and a similarity score comparer configured to compare the candidate article scores to the on-topic characteristic word term vectors and the average on-topic characteristic word term vectors of each topic and to form an on-topic score for each topic and configured to select only the candidate articles as candidate on-topic articles which the on-topic score for that topic exceeds an on-topic threshold; and a display configured to present the candidate on-topic articles.
-
-
2. A computer-implemented method for providing multi-core topic indexing in electronically-stored social indexes, comprising:
-
accessing a corpus of articles each comprised of online textual materials and a tree of topics; providing a finite state pattern for each topic, each finite state pattern defining a fine-grained topic model that is used to identify the articles that are potentially on-topic; providing on-topic training examples and off-topic training examples from the articles for each topic; defining one or more of distinct core meanings for the topic by assigning at least one of the on-topic training examples and the off-topic training examples; obtaining a set of an average on-topic articles, comprising; providing a set of random training examples from the corpus; matching the set of random training examples to the finite state pattern for the topic; eliminating an article in the set of random training examples that is similar to the off-topic training examples; and adding the on-topic training examples into the set of the random training examples; defining an average on-topic core meaning based on the set of average on-topic articles; specifying characteristic words for each of the on-topic training examples, off-topic training examples, and the set of average on-topic articles; assigning scores to the characteristic words that were specified for the on-topic training examples, off-topic training examples, and the set of average on-topic articles; specifying on-topic characteristic word term vectors, each on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the on-topic training examples; specifying off-topic characteristic word term vectors, each off-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for each of the off-topic training examples; specifying average on-topic characteristic word term vectors, each average on-topic characteristic word term vector comprising the scores of the characteristic words that were specified for each topic for the set of average on-topic articles; obtaining a new article; matching the new article to the finite state pattern of each of the topics and designating the new article as a candidate article for each topic to which the finite state pattern was matched; specifying characteristic words extracted from the candidate article; assigning candidate article scores to the characteristic words of the candidate article; comparing the candidate article scores to the off-topic characteristic word term vectors of each topic and forming an off-topic score for each topic; discarding the candidate article as off-topic for each topic in which the off-topic score for that topic exceeds an off-topic threshold; comparing the candidate article scores to the on-topic characteristic word term vectors and the average on-topic characteristic word term vectors of each topic and forming an on-topic score for each topic and selecting only the candidate articles as candidate on-topic articles which the on-topic score for that topic exceeds an on-topic threshold; and presenting the candidate on-topic articles. - View Dependent Claims (3, 4, 5, 6, 7, 8)
-
Specification