Method and system for filtering content in a discovered topic

US 7,146,359 B2
Filed: 05/03/2002
Issued: 12/05/2006
Est. Priority Date: 05/03/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of content filtering in a discovered topic comprising:

collecting querying data contained in a log said querying data having caused a retrieval of a collection of documents;

preprocessing said querying data, wherein said preprocessing comprises;

cleaning said querying data;

transforming said querying data into a querying data vector; and

clustering said querying data based on said querying data vector; and

postfiltering a collection of documents, said postfiltering comprising;

collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log;

preprocessing said actual document content data, wherein said preprocessing comprises;

cleaning said actual document content data;

transforming said actual document content data into a document content data vector; and

clustering said collection of actual documents content data based on said document content data vector;

wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of filtering content in a discovered topic. In one embodiment, a method for filtering content in a discovered topic is comprised of preprocessing querying data. The querying data has caused retrieval of a collection of documents. The collection of documents includes documents containing subject matter related to said querying data. The collection of documents also includes documents containing subject matter extraneous to the querying data. The querying data is clustered. Clustering of the querying data enables the discovered topic to be identified. The collection of documents are postfiltered. The postfiltering of the collection of documents generates a collection of documents having the related subject matter, and extraneous subject matter is excluded.

45 Citations

View as Search Results

22 Claims

1. A method of content filtering in a discovered topic comprising:
- collecting querying data contained in a log said querying data having caused a retrieval of a collection of documents;
  
  preprocessing said querying data, wherein said preprocessing comprises;
  
  cleaning said querying data;
  
  transforming said querying data into a querying data vector; and
  
  clustering said querying data based on said querying data vector; and
  
  postfiltering a collection of documents, said postfiltering comprising;
  
  collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log;
  
  preprocessing said actual document content data, wherein said preprocessing comprises;
  
  cleaning said actual document content data;
  
  transforming said actual document content data into a document content data vector; and
  
  clustering said collection of actual documents content data based on said document content data vector;
  
  wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as recited in claim 1 further comprising labeling said collection of documents having subject matter similar to said querying data.
  - 3. The method as recited in claim 1 wherein said preprocessing said querying data comprises collecting information regarding said querying data for utilization in determining said discovered topic, said querying data retrieved from a plurality of search logs.
  - 4. The method as recited in claim 1 further comprising transforming each of said querying data into a matrix of query based numerical vectors.
  - 5. The method as recited in claim 1 wherein said clustering comprises application of clustering software to query based document vectors for the subsequent grouping thereof, and wherein said query based document vectors comprise the input of said clustering software, and wherein the output is a cluster and wherein said cluster is indicative of said collection of documents.
  - 6. The method as recited in claim 1 wherein said postfiltering comprises collecting said contents of said documents retrieved by said querying data from a database.
  - 7. The method as recited in claim 1 further comprising cleaning said contents of said documents.
  - 8. The method as recited in claim 7 further comprising transforming each of said cleaned documents into a matrix of content based document vectors.
  - 9. The method as recited in claim 8 further comprising clustering said content based document vectors and wherein the similarity of each of said documents within said cluster is determined.
  - 10. The method as recited in claim 9 wherein the determined similarity of each of said documents in said cluster is utilized as a metric to measure cohesion between said documents, and wherein a high measure of cohesion is indicative of a document containing said preferred subject matter, and wherein a low measure of cohesion is indicative of a document containing said extraneous subject matter.

11. In a web based environment, a method of topic discovery and content filtering comprising:
- receiving a query causing a retrieval of a collection of documents related to said query, said collection of documents comprising documents having content comprising extraneous subject matter and documents having content comprising desired subject matter;
  
  storing said query in a log;
  
  collecting query data contained in said log;
  
  preprocessing query data, said query data information pertaining to said query, wherein said preprocessing comprises;
  
  cleaning said query data;
  
  transforming said query data into a query data vector; and
  
  clustering said query data enabling discovery of a topic relative to said query; and
  
  postfiltering said retrieved collection of documents, said postfiltering comprising;
  
  collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log;
  
  preprocessing said actual document content data, wherein said preprocessing comprises;
  
  cleaning said actual document content data;
  
  transforming said actual document content data into a document content data vector; and
  
  clustering said collection of actual documents content data based on said document content data vector;
  
  wherein said postfiltering generates a collection of documents having content comprising said desired subject matter relative to a discovered topic.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The method as recited in claim 11 further comprising labeling said collection of documents in accordance with a metric based upon document similarity, said metric used to measure cohesion between said documents, wherein a high measure of cohesion indicates a document containing said desired subject matter related to said discovered topic.
  - 13. The method as recited in claim 11 wherein said preprocessing said query data comprises collecting data of said query, said data stored in a search log.
  - 14. The method as recited in claim 11 further comprising performing a vector transformation upon said query data, resulting in a matrix of query based document vectors.
  - 15. The method as recited in claim 11 wherein said clustering of said query data comprises applying clustering software to said query based document vectors.
  - 16. The method as recited in claim 11 wherein said postfiltering said retrieved documents comprises collecting the contents of said retrieved collection of documents.
  - 17. The method as recited in claim 11 wherein said postfiltering said retrieved collection of documents further comprises a vector transformation applied to the contents of said documents of said retrieved collection of documents.
  - 18. The method as recited in claim 11 wherein said postfiltering said retrieved collection of documents further comprises clustering said content based document vectors, wherein a cluster of said content based document vectors enables similarity determination of each document of said retrieved collection of documents within said cluster, and wherein a high measure of similarity is indicative of a document containing said preferred subject matter, and wherein a low measure of similarity is indicative of a document containing said extraneous subject matter.

19. A computer system comprising:
- a bus;
  
  a display device coupled to said bus;
  
  a storage device coupled to said bus; and
  
  a processor coupled to said bus, said processor for;
  
  collecting querying data contained in a log;
  
  preprocessing said querying data, wherein said preprocessing comprises;
  
  cleaning said querying data;
  
  transforming said querying data into a querying data vector; and
  
  clustering said querying data based on said querying data vector; and
  
  postfiltering a collection of documents, said postfiltering comprising;
  
  collecting actual document content data, said actual document content data related to documents which were retrieved based on said querying data contained in said log;
  
  preprocessing said actual document content data, wherein said preprocessing comprises;
  
  cleaning said actual document content data;
  
  transforming said actual document content data into a document content data vector; and
  
  clustering said collection of actual documents content data based on said document content data vector;
  
  wherein said postfiltering performs a similarity computation between said querying data cluster and said actual content data cluster to generate a collection of documents having content similar to said querying data wherein extraneous subject matter documents are excluded from said collection of documents; and
  
  labeling said collection of documents in accordance with a metric based upon document similarity, said metric used to measure cohesion between said documents in said collection of documents, wherein a high measure of cohesion indicates a document containing subject matter relative to said a topic, said topic displayed to a user via said display device.
- View Dependent Claims (20, 21, 22)
- - 20. The computer system of claim 19 further comprises;
    - performing a vector transformation upon said data, resulting in a matrix of query based document vectors.
  - 21. The computer system of claim 20 wherein said clustering of said query data comprises applying clustering software to said query data vectors.
  - 22. The computer system of claim 19 wherein said postfiltering said retrieved documents comprises:
    - performing a vector transformation upon said contents of each of said documents of said retrieved collection of documents, wherein content based document vectors are used to determine content similarity of said retrieved documents; and
      
      clustering said content based document vectors, wherein a cluster of content based document vectors enables determining the similarity of each of said retrieved document within said cluster, and wherein a high measure of similarity is indicative of a document containing said preferred subject matter, and wherein a low measure of similarity is indicative of a document containing said extraneous subject matter.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Castellanos, Maria G.
Primary Examiner(s)
Gaffin, Jeffrey
Assistant Examiner(s)
Mahmoudi, Hassan

Application Number

US10/138,950
Publication Number

US 20030208485A1
Time in Patent Office

1,677 Days
Field of Search

707/3, 707/4, 707/5, 707/6, 707/10, 707/104.1, 717/11, 715/500
US Class Current

1/1
CPC Class Codes

G06F 16/3326   using relevance feedback fr...

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Method and system for filtering content in a discovered topic

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

45 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for filtering content in a discovered topic

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links