Method for clustering closely resembling data objects

US 6,119,124 A
Filed: 03/26/1998
Issued: 09/12/2000
Est. Priority Date: 03/26/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:

parsing each data object into a canonical sequence of tokens;

grouping overlapping sequences of the tokens of each data object into shingles;

assigning a unique identification element to each shingle;

permuting the elements of the data objects to form image sets;

selecting a predetermined number of minimum elements from each image to form a sketch;

partitioning the selected elements of each sketch into a plurality of groups; and

assigning another unique identification to each group to generate the features of each data object to determine a level of resemblance of the plurality of data objects.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

Citations

24 Claims

1. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:
- parsing each data object into a canonical sequence of tokens;
  
  grouping overlapping sequences of the tokens of each data object into shingles;
  
  assigning a unique identification element to each shingle;
  
  permuting the elements of the data objects to form image sets;
  
  selecting a predetermined number of minimum elements from each image to form a sketch;
  
  partitioning the selected elements of each sketch into a plurality of groups; and
  
  assigning another unique identification to each group to generate the features of each data object to determine a level of resemblance of the plurality of data objects.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1 wherein the sketches are a fixed size and the permuting is a single random permutation of the universe of all of the elements.
  - 3. The method of claim 1 wherein the size of a particular sketch is proportional to the size of the image set and the permutation is a single random permutation.
  - 4. The method of claim 3 further including the step of determining a containment of the data objects.
  - 5. The method of claim 1 wherein the permuting is a predetermined number of independent random permutations of the universe of all of the elements and selecting the minimum element of each image set under each of the predetermined number of independent random permutations.
  - 6. The method of claim 1 further including the step of tracking a level of resemblance of each data object, and clustering data objects having a predetermined level of resemblance into clusters.
  - 7. The method of claim 6 wherein the data objects are Web pages indexed by a search engine and in response to a search query returning all resembling Web pages in a particular cluster if one Web page of the particular cluster satisfies the query.
  - 8. The method of claim 1 further including the step of representing each data object by a super-shingle, and determining levels of resemblance based on the super-shingles.
  - 9. The method of claim 1 further including the step of representing each data object by a plurality of features, and clustering data objects sharing predetermined number of features.
  - 10. The method of claim 1 wherein frequently occurring shingles are eliminated.
  - 11. The method of claim 1 wherein the parsing, permuting, and selecting are performed with first parameters.
  - 12. The method of claim 11 wherein the parsing, grouping, and selecting are repeated with second parameters to perform variable threshold filtering of the data objects.
  - 13. The method of claim 1 wherein the data objects are Web pages to be indexed by a search engine, and wherein frequently occurring shingles are eliminated before assigning the unique identifications.
  - 14. The method of claim 13 wherein the frequently occurring shingles include HTML comment tags that identify a program that generated the HTML comment tags of the Web pages, shared headers or footers, and text sequences including a sequence of numbers.
  - 15. The method of claim 13 and wherein only different Web pages are indexed.
  - 16. The method of claim 15 wherein the different Web pages include lexically different Web pages.
  - 17. The method of claim 1 further including the step of determining the resemblance of the data objects in real time.
  - 18. The method of claim 1 further including the step of tracking multiple versions of the data objects.
  - 19. The method of claim 1 further comprising the step of identifying illegal copies of a particular data object.
  - 20. The method of claim 1 wherein the data objects encode audio and video signals.
  - 21. The method of claim 1 wherein the unique identifications are fingerprints.

22. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:
- parsing each data object into a canonical sequence of tokens;
  
  grouping overlapping sequences of the tokens of each data object into shingles;
  
  assigning a unique identification element to each shingle;
  
  permuting the elements of the data objects to form image sets;
  
  selecting a predetermined number of minimum elements from each image to form a sketch;
  
  wherein a first and a second data object are designated as fungible when the first and the second data objects share at least one common feature, and collecting fungible data objects into clusters of closely resembling data objects.
- View Dependent Claims (23)
- - 23. The method of claim 22 further comprising the steps of storing a first list of pairs in a memory, each pair including a data object identification and a particular feature of the identified data object, sorting the first list to produce a second list, each entry in the second list identifying a unique feature and all data objects that include the unique feature, and processing the second list to identify data objects sharing the at least one common feature.

24. An information processing system for determining the resemblance of a plurality of data objects, comprising program instructions for:
- parsing each data object into a canonical sequence of tokens;
  
  grouping overlapping sequences of the tokens of each data object into shingles;
  
  assigning a unique identification element to each shingle;
  
  permuting the elements of the data objects to form image sets;
  
  selecting a predetermined number of minimum elements from each image to form a sketch;
  
  partitioning the selected elements of each sketch into a plurality of groups; and
  
  assigning another unique identification to each group to generate the features of each data object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Digital Equipment Corporation (HP Inc.)
Inventors
Zweig, Geoffrey G., Broder, Andrei Z., Manasse, Mark S., Glassman, Steven C., Nelson, Charles G.
Primary Examiner(s)
Amsbury, Wayne
Assistant Examiner(s)
Terry, Mark

Application Number

US09/048,653
Time in Patent Office

901 Days
Field of Search

707/3, 707/2, 707/5, 707/103
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99944   Object-oriented database st...

Method for clustering closely resembling data objects

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method for clustering closely resembling data objects

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links