Document clustering method and system
First Claim
Patent Images
1. A method for clustering documents comprising:
- generating a hybrid matrix of vectors comprising a first vector representing a first document and a second vector representing a log-based document cluster; and
clustering the documents using the hybrid matrix, wherein the step of generating the hybrid matrix comprises;
accessing retrieval session logs;
clustering retrieval sessions into session clusters;
generating, for each session cluster, a log-based document cluster by combining all documents opened during any retrieval session of the session cluster;
generating a log-based document cluster vector for each of the log-based document clusters;
replacing each document in the log-based document cluster with the log-based document cluster vector;
generating an individual document vector for each document not opened during any retrieval session; and
combining the log-based document cluster vector and the individual document cluster vector, wherein the log-based document cluster is formed by concatenating the documents to be combined.
4 Assignments
0 Petitions
Accused Products
Abstract
Document clustering method and system utilizing both the log-based clustering method and the content-based clustering method are disclosed. The method includes the steps of generating log-based document clusters and combining vectors from the log-based document clusters with individual document clusters for content-based clustering analysis. The log-based document clusters are generated by accessing the retrieval session log, clustering the retrieval sessions, and combining the documents opened during each of the sessions of session clusters.
32 Citations
4 Claims
-
1. A method for clustering documents comprising:
-
generating a hybrid matrix of vectors comprising a first vector representing a first document and a second vector representing a log-based document cluster; and
clustering the documents using the hybrid matrix, wherein the step of generating the hybrid matrix comprises;
accessing retrieval session logs;
clustering retrieval sessions into session clusters;
generating, for each session cluster, a log-based document cluster by combining all documents opened during any retrieval session of the session cluster;
generating a log-based document cluster vector for each of the log-based document clusters;
replacing each document in the log-based document cluster with the log-based document cluster vector;
generating an individual document vector for each document not opened during any retrieval session; and
combining the log-based document cluster vector and the individual document cluster vector, wherein the log-based document cluster is formed by concatenating the documents to be combined. - View Dependent Claims (2, 3, 4)
generating a Boolean session vector for each retrieval session;
forming a matrix of the Boolean session vectors; and
applying a clustering algorithm to the matrix of the Boolean session vectors.
-
Specification