Method for clustering closely resembling data objects

US 6,349,296 B1
Filed: 08/21/2000
Issued: 02/19/2002
Est. Priority Date: 03/26/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining the resemblance of a plurality of data objects, the method comprising the steps of:

parsing each data object into a canonical sequence of tokens;

grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;

assigning a unique identification element to each shingle;

subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;

selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and

comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

Citations

20 Claims

1. A method for determining the resemblance of a plurality of data objects, the method comprising the steps of:
- parsing each data object into a canonical sequence of tokens;
  
  grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;
  
  assigning a unique identification element to each shingle;
  
  subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;
  
  selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
  
  comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising the steps of:
3. The method of claim 2, wherein the data objects are Web pages indexed by a search engine, further comprising the step of:
- in response to a search query, returning all resembling Web pages in a particular cluster if one Web page of the particular cluster satisfies the query.
4. The method of claim 1, further comprising the steps of:
- representing each data object by a super-shingle; and
  
  determining levels of resemblance based on the super-shingles.
5. The method of claim 1, further comprising the steps of:
- representing each data object by a plurality of features; and
  
  clustering data objects sharing predetermined number of features.
6. The method of claim 1, wherein frequently occurring shingles are eliminated.
7. The method of claim 1, wherein the parsing, permuting, and selecting steps are performed with first parameters.
8. The method of claim 7, wherein the parsing, grouping, and selecting steps are repeated with second parameters to perform variable threshold filtering of the data objects.
9. The method of claim 1, wherein the data objects are Web pages to be indexed by a search engine, and wherein frequently occurring shingles are eliminated before assigning the unique identifications.
10. The method of claim 9, wherein the frequently occurring shingles include HTML comment tags that identify a program that generated the HTML comment tags of the Web pages, shared headers or footers, and text sequences including a sequence of numbers.
11. The method of claim 9, wherein only different Web pages are indexed.
12. The method of claim 11, wherein the different Web pages include lexically different Web pages.
13. The method of claim 1, further comprising the step of:
- determining the resemblance of the data objects in real time.
14. The method of claim 1, further comprising the step of:
- tracking multiple versions of the data objects.
15. The method of claim 1, further comprising the step of:
- identifying illegal copies of a particular data object.
16. The method of claim 1, wherein the data objects encode audio and video signals.
17. The method of claim 1, wherein the unique identifications are fingerprints.
18. A computer data signal embodied in a carrier wave readable by a computing system and encoding a computer program of instructions for executing a computer process performing the method recited in claim 1.

19. An apparatus for determining the resemblance of a plurality of data objects, the apparatus comprising:
- means for parsing each data object into a canonical sequence of tokens;
  
  means for grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;
  
  means for assigning a unique identification element to each shingle;
  
  means for subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;
  
  means for selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
  
  means for comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.

20. An article of manufacture for determining the resemblance of a plurality of data objects, the article of manufacture comprising:
- at least one processor readable carrier; and
  
  instructions carried on the at least one carrier;
  
  wherein the instructions are configured to be readable from the at least one carrier by at least one processor and thereby cause the at least one processor to operate so as to;
  
  parse each data object into a canonical sequence of tokens;
  
  group contiguous sequences in the canonical sequence of tokens of each data object into shingles;
  
  assign a unique identification element to each shingle;
  
  subject the unique identification elements to a plurality of permutations to provide a corresponding plurality of images of the unique identification elements;
  
  select a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
  
  compare data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
AltaVista Company (Apollo Global Management, Inc.)
Inventors
Zweig, Geoffrey G., Nelson, Charles G., Broder, Andrei Z., Manasse, Mark S., Glassman, Steven C.
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
Ali, Mohammad

Application Number

US09/642,017
Time in Patent Office

547 Days
Field of Search

707/1-3, 707/6, 707/5, 707/10, 707/103-104, 707/102, 707/500.1, 707/541, 382/137
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99944   Object-oriented database st...

Method for clustering closely resembling data objects

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method for clustering closely resembling data objects

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links