Method for clustering closely resembling data objects
First Claim
1. A method for determining the resemblance of a plurality of data objects, the method comprising the steps of:
- parsing each data object into a canonical sequence of tokens;
grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;
assigning a unique identification element to each shingle;
subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;
selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.
7 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.
-
Citations
20 Claims
-
1. A method for determining the resemblance of a plurality of data objects, the method comprising the steps of:
-
parsing each data object into a canonical sequence of tokens;
grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;
assigning a unique identification element to each shingle;
subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;
selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
tracking a level of resemblance of each data object; and
clustering data objects having a predetermined level of resemblance into clusters.
-
-
3. The method of claim 2, wherein the data objects are Web pages indexed by a search engine, further comprising the step of:
in response to a search query, returning all resembling Web pages in a particular cluster if one Web page of the particular cluster satisfies the query.
-
4. The method of claim 1, further comprising the steps of:
-
representing each data object by a super-shingle; and
determining levels of resemblance based on the super-shingles.
-
-
5. The method of claim 1, further comprising the steps of:
-
representing each data object by a plurality of features; and
clustering data objects sharing predetermined number of features.
-
-
6. The method of claim 1, wherein frequently occurring shingles are eliminated.
-
7. The method of claim 1, wherein the parsing, permuting, and selecting steps are performed with first parameters.
-
8. The method of claim 7, wherein the parsing, grouping, and selecting steps are repeated with second parameters to perform variable threshold filtering of the data objects.
-
9. The method of claim 1, wherein the data objects are Web pages to be indexed by a search engine, and wherein frequently occurring shingles are eliminated before assigning the unique identifications.
-
10. The method of claim 9, wherein the frequently occurring shingles include HTML comment tags that identify a program that generated the HTML comment tags of the Web pages, shared headers or footers, and text sequences including a sequence of numbers.
-
11. The method of claim 9, wherein only different Web pages are indexed.
-
12. The method of claim 11, wherein the different Web pages include lexically different Web pages.
-
13. The method of claim 1, further comprising the step of:
determining the resemblance of the data objects in real time.
-
14. The method of claim 1, further comprising the step of:
tracking multiple versions of the data objects.
-
15. The method of claim 1, further comprising the step of:
identifying illegal copies of a particular data object.
-
16. The method of claim 1, wherein the data objects encode audio and video signals.
-
17. The method of claim 1, wherein the unique identifications are fingerprints.
-
18. A computer data signal embodied in a carrier wave readable by a computing system and encoding a computer program of instructions for executing a computer process performing the method recited in claim 1.
-
19. An apparatus for determining the resemblance of a plurality of data objects, the apparatus comprising:
-
means for parsing each data object into a canonical sequence of tokens;
means for grouping contiguous sequences in the canonical sequence of tokens of each data object into shingles;
means for assigning a unique identification element to each shingle;
means for subjecting the unique identification elements to a plurality of permutations to provide a corresponding plurality of images;
means for selecting a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
means for comparing data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.
-
-
20. An article of manufacture for determining the resemblance of a plurality of data objects, the article of manufacture comprising:
-
at least one processor readable carrier; and
instructions carried on the at least one carrier;
wherein the instructions are configured to be readable from the at least one carrier by at least one processor and thereby cause the at least one processor to operate so as to;
parse each data object into a canonical sequence of tokens;
group contiguous sequences in the canonical sequence of tokens of each data object into shingles;
assign a unique identification element to each shingle;
subject the unique identification elements to a plurality of permutations to provide a corresponding plurality of images of the unique identification elements;
select a smallest unique identification element from each of the plurality of images to provide a sketch of each corresponding data object; and
compare data object sketches so as to determine common smallest unique identification elements between data object sketches and thereby determine data object resemblance.
-
Specification