Method and system for filtering content in a discovered topic
First Claim
1. A method of content filtering in a discovered topic comprising:
- collecting querying data contained in a log said querying data having caused a retrieval of a collection of documents;
preprocessing said querying data, wherein said preprocessing comprises;
cleaning said querying data;
transforming said querying data into a querying data vector; and
clustering said querying data based on said querying data vector; and
postfiltering a collection of documents, said postfiltering comprising;
collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log;
preprocessing said actual document content data, wherein said preprocessing comprises;
cleaning said actual document content data;
transforming said actual document content data into a document content data vector; and
clustering said collection of actual documents content data based on said document content data vector;
wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of filtering content in a discovered topic. In one embodiment, a method for filtering content in a discovered topic is comprised of preprocessing querying data. The querying data has caused retrieval of a collection of documents. The collection of documents includes documents containing subject matter related to said querying data. The collection of documents also includes documents containing subject matter extraneous to the querying data. The querying data is clustered. Clustering of the querying data enables the discovered topic to be identified. The collection of documents are postfiltered. The postfiltering of the collection of documents generates a collection of documents having the related subject matter, and extraneous subject matter is excluded.
45 Citations
22 Claims
-
1. A method of content filtering in a discovered topic comprising:
-
collecting querying data contained in a log said querying data having caused a retrieval of a collection of documents; preprocessing said querying data, wherein said preprocessing comprises; cleaning said querying data; transforming said querying data into a querying data vector; and clustering said querying data based on said querying data vector; and postfiltering a collection of documents, said postfiltering comprising; collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log; preprocessing said actual document content data, wherein said preprocessing comprises; cleaning said actual document content data; transforming said actual document content data into a document content data vector; and clustering said collection of actual documents content data based on said document content data vector; wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. In a web based environment, a method of topic discovery and content filtering comprising:
-
receiving a query causing a retrieval of a collection of documents related to said query, said collection of documents comprising documents having content comprising extraneous subject matter and documents having content comprising desired subject matter; storing said query in a log; collecting query data contained in said log; preprocessing query data, said query data information pertaining to said query, wherein said preprocessing comprises; cleaning said query data; transforming said query data into a query data vector; and clustering said query data enabling discovery of a topic relative to said query; and postfiltering said retrieved collection of documents, said postfiltering comprising; collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log; preprocessing said actual document content data, wherein said preprocessing comprises; cleaning said actual document content data; transforming said actual document content data into a document content data vector; and clustering said collection of actual documents content data based on said document content data vector; wherein said postfiltering generates a collection of documents having content comprising said desired subject matter relative to a discovered topic. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer system comprising:
-
a bus; a display device coupled to said bus; a storage device coupled to said bus; and a processor coupled to said bus, said processor for; collecting querying data contained in a log; preprocessing said querying data, wherein said preprocessing comprises; cleaning said querying data; transforming said querying data into a querying data vector; and clustering said querying data based on said querying data vector; and postfiltering a collection of documents, said postfiltering comprising; collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log; preprocessing said actual document content data, wherein said preprocessing comprises; cleaning said actual document content data; transforming said actual document content data into a document content data vector; and clustering said collection of actual documents content data based on said document content data vector; wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents; and labeling said collection of documents in accordance with a metric based upon document similarity, said metric used to measure cohesion between said documents in said collection of documents, wherein a high measure of cohesion indicates a document containing subject matter relative to said a topic, said topic displayed to a user via said display device. - View Dependent Claims (20, 21, 22)
-
Specification