Method for clustering closely resembling data objects
First Claim
1. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:
- parsing each data object into a canonical sequence of tokens;
grouping overlapping sequences of the tokens of each data object into shingles;
assigning a unique identification element to each shingle;
permuting the elements of the data objects to form image sets;
selecting a predetermined number of minimum elements from each image to form a sketch;
partitioning the selected elements of each sketch into a plurality of groups; and
assigning another unique identification to each group to generate the features of each data object to determine a level of resemblance of the plurality of data objects.
10 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.
-
Citations
24 Claims
-
1. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:
-
parsing each data object into a canonical sequence of tokens; grouping overlapping sequences of the tokens of each data object into shingles; assigning a unique identification element to each shingle; permuting the elements of the data objects to form image sets; selecting a predetermined number of minimum elements from each image to form a sketch; partitioning the selected elements of each sketch into a plurality of groups; and assigning another unique identification to each group to generate the features of each data object to determine a level of resemblance of the plurality of data objects. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer-implemented method of determining the resemblance of a plurality of data objects, comprising the steps of:
-
parsing each data object into a canonical sequence of tokens; grouping overlapping sequences of the tokens of each data object into shingles; assigning a unique identification element to each shingle; permuting the elements of the data objects to form image sets; selecting a predetermined number of minimum elements from each image to form a sketch; wherein a first and a second data object are designated as fungible when the first and the second data objects share at least one common feature, and collecting fungible data objects into clusters of closely resembling data objects. - View Dependent Claims (23)
-
-
24. An information processing system for determining the resemblance of a plurality of data objects, comprising program instructions for:
-
parsing each data object into a canonical sequence of tokens; grouping overlapping sequences of the tokens of each data object into shingles; assigning a unique identification element to each shingle; permuting the elements of the data objects to form image sets; selecting a predetermined number of minimum elements from each image to form a sketch; partitioning the selected elements of each sketch into a plurality of groups; and assigning another unique identification to each group to generate the features of each data object.
-
Specification