Control of document similarity determinations by respective nodes of a plurality of computing devices
First Claim
1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising:
- receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar;
generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents;
hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;
generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;
generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and
assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques and systems are described to control a determination of document similarity. In one example, dimensionality of the documents is reduced through computation of a signature, e.g., via a hashing technique such as “minhashing” which is also known as min-wise independent permutations locality sensitive hashing. From these signatures, another hashing technique (e.g., locality sensitive hashing) is used to determine similarity of the signatures to each other. Identification of disjoint sets is then used as a basis to partition the documents for determination of document similarity by respective nodes of a plurality of computing devices. In this way, an amount of data shuffling between the nodes as part of the determination of document similarity may be reduced. In another example, a weighting is applied to attributes of documents as part of the determination of document similarity.
-
Citations
20 Claims
-
1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising:
-
receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar; generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents; hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. In a digital medium environment to determine document similarity, a method implemented by at least one computing device, the method comprising:
-
receiving, by the at least one computing device, an input specifying a similarity threshold and a weight of an attribute, the similarity threshold defining a minimum number of a plurality of buckets that two documents of a plurality of documents are both to be included in to be considered similar; applying, by the at least one computing device, the weight to a word that corresponds to the attribute, the applying including adding additional instances of the word to the respective ones of the plurality of documents based on the weight; generating, by the at least one computing device using hashing, a plurality of signature data from the plurality of documents having the applied weighting; hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of the filtered set of documents, to each other, within respective said partitions. - View Dependent Claims (12, 19)
-
-
13. In a digital medium environment to determine document similarity, a system comprising:
-
a processing system; and a computer-readable storage medium having instructions stored thereon that, responsive to execution by the processing system, causes the processing system to perform operations comprising; receiving an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two documents, of a plurality of documents, are to be included in to be considered similar; generating a plurality of signature data from the plurality of documents using hashing; hashing the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data of a plurality of disjoint sets of data; and assigning the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions. - View Dependent Claims (14, 15, 16, 17, 18, 20)
-
Specification