Scatter-gather: a cluster-based method and apparatus for browsing large document collections
First Claim
1. A document browsing method in a digital computer for a corpus of documents, comprising the steps of:
- preparing an initial ordering of the corpus into a first plurality of clusters by using a first method that automatically performs the initial ordering without external inputs based on contents of the documents using the digital computer;
determining a summary for each cluster of the first plurality of clusters prepared by said initial ordering of the corpus;
selecting by a user at least one cluster of the first plurality of clusters based on the summary of each cluster; and
automatically providing a further ordering of the user selected at least one cluster into a second plurality of clusters by automatically analyzing contents of documents of the selected at least one cluster using a second method comprising the steps of;
grouping together all of the documents from the selected at least one cluster based on the content of each document, and thenassigning each of the documents to one cluster of the second plurality of clusters.
5 Assignments
0 Petitions
Accused Products
Abstract
Scatter-Gather is a computer based document browsing method which operates in time proportional to a number of documents in a target corpus. The Scatter-Gather method includes: preparing an initial ordering of the corpus using, for example, an off-line computational method; determining a summary of the initial ordering of the corpus for interactive utility; and providing a further ordering of the corpus using, for example, an on-line non-deterministic method. The step of an off-line preparation of an initial ordering of a corpus is non-time-dependent, thus an accurate initial ordering is prepared. The step of determining a summary includes determining a summary for presentation to a user without scrolling on a CRT. The step of providing a further ordering includes truncated group average agglomerate clustering, merging disjointed document sets, center finding, assign-to-nearest and other refinement methods.
317 Citations
21 Claims
-
1. A document browsing method in a digital computer for a corpus of documents, comprising the steps of:
-
preparing an initial ordering of the corpus into a first plurality of clusters by using a first method that automatically performs the initial ordering without external inputs based on contents of the documents using the digital computer; determining a summary for each cluster of the first plurality of clusters prepared by said initial ordering of the corpus; selecting by a user at least one cluster of the first plurality of clusters based on the summary of each cluster; and automatically providing a further ordering of the user selected at least one cluster into a second plurality of clusters by automatically analyzing contents of documents of the selected at least one cluster using a second method comprising the steps of; grouping together all of the documents from the selected at least one cluster based on the content of each document, and then assigning each of the documents to one cluster of the second plurality of clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A document browsing system for use with a corpus of documents in a digital computer, the document browsing system comprising:
-
preparing means for preparing without external inputs an initial ordering of the corpus into a first plurality of document clusters using the digital computer; determining means for determining a summary for each cluster of the first plurality of document clusters prepared by said preparing means; selecting means for a user to select at least one of the first plurality of document clusters; and ordering means for automatically ordering the selected at least one of the first plurality of document clusters into a second plurality of clusters by analyzing contents of documents of the selected at least one of the first plurality of document clusters, grouping together all of the documents from the selected at least one of the first plurality of document clusters based on the contents of the documents of the selected at least one of the first plurality of document clusters, and then assigning each of the documents to one cluster of the second plurality of clusters. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
15. A document partitioning Fractionation method in a digital computer for non-hierarchical, linear-time partitioning of a corpus of documents, said Fractionation method comprising the steps of:
-
preparing an ordering of the corpus by sorting words in order of frequency, most frequent word first, by entry into a corpus countfile, labeling each document by a number of an earliest word in a sorted corpus countfile, adjoining the number of an earliest word in a sorted corpus countfile to a number of a first text-ordered word in the document to form a compound label, and sorting documents by the compound label; determining a partitioning of a desired size from the ordering to form a set of buckets, each document of the corpus of documents assigned to only one bucket of the set of buckets; and refining the partitioning by a predetermined number of iterations of creating a the set of modified buckets from the set of buckets based on contents and size of each bucket reassigning each document of the corpus of documents to the set of modified buckets.
-
Specification