Method and apparatus for information access employing overlapping clusters
DC CAFCFirst Claim
Patent Images
1. A method, operating in a digital computer, for searching a corpus of unclustered documents, comprising the steps of:
- preparing, in response to a query, an initial structuring of the unclustered corpus into a plurality of primary overlapping clusters, wherein at least two of the plurality of primary overlapping clusters contain a document in common; and
determining a summary of the plurality of primary overlapping clusters prepared by said initial structuring of the corpus.
6 Assignments
Litigations
0 Petitions
Accused Products
Abstract
The present invention is a method and apparatus for document clustering-based browsing of a corpus of documents, and more particularly to the use of overlapping clusters to improve recall. The present invention is directed to improving the performance of information access methods and apparatus through the use of non-disjoint (overlapped) clustering operations. The present invention is further described in terms of two possible methods for expanding document clusters so as to achieve the overlap, and a method for increasing precision through the use of the overlapped clusters.
-
Citations
16 Claims
-
1. A method, operating in a digital computer, for searching a corpus of unclustered documents, comprising the steps of:
-
preparing, in response to a query, an initial structuring of the unclustered corpus into a plurality of primary overlapping clusters, wherein at least two of the plurality of primary overlapping clusters contain a document in common; and determining a summary of the plurality of primary overlapping clusters prepared by said initial structuring of the corpus. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A document browsing system for use with a corpus of unclustered documents stored in a computer system, the document browsing system comprising:
-
program memory for storing executable program code therein; a processor, operating in response to the executable program stored in said program memory, for automatically preparing, in response to a query, an initial structuring of the corpus of unclustered documents into a plurality of document clusters, wherein at least two of the plurality of document clusters overlap and contain at least one common document therebetween; data memory for storing data identifying the documents associated with each of the plurality of document clusters; memory access means for accessing the data memory and said processor summarizing the plurality of document clusters and generating summary data for said document clusters; and a user interface for displaying the summary data. - View Dependent Claims (9)
-
-
10. A document search and retrieval method, operating in a digital computer, for searching a corpus of unclustered documents, comprising the steps of:
-
identifying, in response to at least one user specified search term, a sub-corpus of unclustered documents containing the at least one user specified search term; preparing an initial structuring of the sub-corpus of unclustered documents into a plurality of primary overlapping clusters, wherein at least two of the plurality of primary overlapping clusters contain a document in common; and determining a summary of the plurality of primary overlapping clusters prepared by said initial structuring of the sub-corpus. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method, operating in a digital computer, for searching a corpus of unclustered documents, comprising the steps of:
-
subdividing the unclustered corpus of documents into a hierarchical structure containing a plurality of levels of clusters, wherein at least two of the clusters on a particular level are overlapping clusters containing at least a document in common; selecting, from the hierarchical structure, a plurality of clusters to form a subcorpus, wherein the subcorpus contains fewer document than the corpus; and identifying, in response to a search query, those documents in the subcorpus providing a positive response to the search query.
-
Specification