Scatter-gather: a cluster-based method and apparatus for browsing large document collections

US 5,442,778 A
Filed: 11/12/1991
Issued: 08/15/1995
Est. Priority Date: 11/12/1991
Status: Expired due to Term

First Claim

Patent Images

1. A document browsing method in a digital computer for a corpus of documents, comprising the steps of:

preparing an initial ordering of the corpus into a first plurality of clusters by using a first method that automatically performs the initial ordering without external inputs based on contents of the documents using the digital computer;

determining a summary for each cluster of the first plurality of clusters prepared by said initial ordering of the corpus;

selecting by a user at least one cluster of the first plurality of clusters based on the summary of each cluster; and

automatically providing a further ordering of the user selected at least one cluster into a second plurality of clusters by automatically analyzing contents of documents of the selected at least one cluster using a second method comprising the steps of;

grouping together all of the documents from the selected at least one cluster based on the content of each document, and thenassigning each of the documents to one cluster of the second plurality of clusters.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Scatter-Gather is a computer based document browsing method which operates in time proportional to a number of documents in a target corpus. The Scatter-Gather method includes: preparing an initial ordering of the corpus using, for example, an off-line computational method; determining a summary of the initial ordering of the corpus for interactive utility; and providing a further ordering of the corpus using, for example, an on-line non-deterministic method. The step of an off-line preparation of an initial ordering of a corpus is non-time-dependent, thus an accurate initial ordering is prepared. The step of determining a summary includes determining a summary for presentation to a user without scrolling on a CRT. The step of providing a further ordering includes truncated group average agglomerate clustering, merging disjointed document sets, center finding, assign-to-nearest and other refinement methods.

317 Citations

21 Claims

1. A document browsing method in a digital computer for a corpus of documents, comprising the steps of:
- preparing an initial ordering of the corpus into a first plurality of clusters by using a first method that automatically performs the initial ordering without external inputs based on contents of the documents using the digital computer;
  
  determining a summary for each cluster of the first plurality of clusters prepared by said initial ordering of the corpus;
  
  selecting by a user at least one cluster of the first plurality of clusters based on the summary of each cluster; and
  
  automatically providing a further ordering of the user selected at least one cluster into a second plurality of clusters by automatically analyzing contents of documents of the selected at least one cluster using a second method comprising the steps of;
  
  grouping together all of the documents from the selected at least one cluster based on the content of each document, and thenassigning each of the documents to one cluster of the second plurality of clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the preparing step includes a Fractionation method for partitioning the corpus of documents, said Fractionation method comprising the steps of:
    - preparing an ordering of the corpus;
      
      determining a partitioning of a desired size from the ordering; and
      
      refining the partitioning.
  - 3. The method of claim 2, wherein the preparing an ordering step includes:
    - sorting words in order of frequency, most frequent word first, by entry into a corpus countfile;
      
      labeling each document by a number of an earliest word in a sorted corpus countfile;
      
      adjoining the number of an earliest word in a sorted corpus countfile to a number of a first text-ordered word in the document to form a compound label; and
      
      sorting documents by the compound label.
  - 4. The method of claim 3, wherein the sorting words step further comprises:
    - segmenting the sorted corpus countfile according to frequency into a number of segments;
      
      rearranging the sorted corpus countfile according to segments; and
      
      renumbering words to reflect the rearranging.
  - 5. The method of claim 2, wherein the determining a partition step comprises truncated group averaging agglomerative clustering which includes limiting a growth of an agglomeration by terminating a group averaging agglomerative clustering before a single over-arching agglomeration is formed.
  - 6. The method of claim 2, wherein the refining step includes refining with a assign-to-nearest method for assigning a document to a nearest bucket.
  - 7. The method of claim 2, wherein the refining step includes merging similar buckets.
  - 8. The method of claim 2, wherein the refining step includes splitting non-similar buckets.
  - 9. The method of claim 2, wherein the refining step includes detecting at least one of weak similarity and small buckets and incoherent buckets by applying size and average similarity thresholds.
  - 10. The document partitioning method of claim 1, wherein the determining a summary step includes determining a summary using a Cluster Digest method, said Cluster Digest method comprising the steps of:
    - providing a summary of constant size for each cluster; and
      
      listing a fixed number of topical words plus document titles of a few typical documents within each cluster, wherein the topical words are words that occur often in the cluster and typical documents are documents close to a cluster centroid.
  - 11. The document partitioning method of claim 10, wherein the providing a further ordering step includes providing a further ordering using a Buckshot method, said Buckshot method comprising the steps of:
    - constructing a random sample from the corpus of documents of size √
      
      kN where k is an integer number of desired clusters and N is a number of documents in the corpus of documents;
      
      partitioning into a partition G a random sample into k groups using truncated group average agglomerative clustering;
      
      constructing a partition P of the corpus of documents by assigning each document to a k center in partition G and applying an assign-to-nearest procedure over the corpus and the k centers in partition G;
      
      replacing partition G with partition P and repeating the step of constructing a partition; and
      
      returning a new partition P.
  - 12. The document partitioning method of claim 1, wherein the providing a further ordering step includes providing a further ordering step using a Buckshot method, said Buckshot method comprising the steps of:
    - constructing a random sample from the corpus of documents of size √
      
      kN where k is an integer k number of desired clusters and N is a number of documents in the corpus of documents;
      
      partitioning into a partition G a random sample into k groups using truncated group average agglomerative clustering;
      
      constructing a partition P of the corpus of documents by assigning each document to a k center in partition G and applying assign-to-nearest over the corpus and the k centers in partition G;
      
      replacing partition G with partition P and repeating the step of constructing a partition; and
      
      returning a new partition P.
  - 13. The document browsing method of claim 1, wherein the first method for preparing an initial ordering of the corpus is the same as the second method for providing a further ordering of a portion of the corpus.

14. A document browsing system for use with a corpus of documents in a digital computer, the document browsing system comprising:
- preparing means for preparing without external inputs an initial ordering of the corpus into a first plurality of document clusters using the digital computer;
  
  determining means for determining a summary for each cluster of the first plurality of document clusters prepared by said preparing means;
  
  selecting means for a user to select at least one of the first plurality of document clusters; and
  
  ordering means for automatically ordering the selected at least one of the first plurality of document clusters into a second plurality of clusters byanalyzing contents of documents of the selected at least one of the first plurality of document clusters,grouping together all of the documents from the selected at least one of the first plurality of document clusters based on the contents of the documents of the selected at least one of the first plurality of document clusters, and thenassigning each of the documents to one cluster of the second plurality of clusters.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The Fractionation method of claim 14, wherein the sorting words step further comprises:
    - segmenting the sorted corpus countfile according to frequency into a number of segments;
      
      rearranging the sorted corpus countfile according to segments; and
      
      renumbering words to reflect the rearranging.
  - 17. The Fractionation method of claim 14, wherein the determining step comprises truncated group averaging agglomerative clustering which includes limiting a growth of an agglomeration by terminating a group averaging agglomerative clustering before a single over-arching agglomeration is formed.
  - 18. The Fractionation method of claim 14, wherein the refining step includes refining with a assign-to-nearest method for assigning a document to a nearest bucket.
  - 19. The Fractionation method of claim 14, wherein the refining step includes merging similar buckets.
  - 20. The Fractionation method of claim 14, wherein the refining step includes splitting non-similar buckets.
  - 21. The method of claim 14, wherein the refining step includes detecting at least one of weak similarity and small buckets and incoherent buckets by applying size and average similarity thresholds.

15. A document partitioning Fractionation method in a digital computer for non-hierarchical, linear-time partitioning of a corpus of documents, said Fractionation method comprising the steps of:
- preparing an ordering of the corpus bysorting words in order of frequency, most frequent word first, by entry into a corpus countfile,labeling each document by a number of an earliest word in a sorted corpus countfile,adjoining the number of an earliest word in a sorted corpus countfile to a number of a first text-ordered word in the document to form a compound label, andsorting documents by the compound label;
  
  determining a partitioning of a desired size from the ordering to form a set of buckets, each document of the corpus of documents assigned to only one bucket of the set of buckets; and
  
  refining the partitioning by a predetermined number of iterations ofcreating a the set of modified buckets from the set of buckets based on contents and size of each bucketreassigning each document of the corpus of documents to the set of modified buckets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Karger, David, Pedersen, Jan. O., Cutting, Douglass R., Tukey, John W.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
ELLCESSOR, LARR

Application Number

US07/790,316
Time in Patent Office

1,372 Days
Field of Search

395/600, 395/144, 382/39, 340/146.2, 364/419.19, 364/419.13
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Scatter-gather: a cluster-based method and apparatus for browsing large document collections

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

317 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Scatter-gather: a cluster-based method and apparatus for browsing large document collections

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

317 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links