Please download the dossier by clicking on the dossier button x
×

Control of document similarity determinations by respective nodes of a plurality of computing devices

  • US 10,642,912 B2
  • Filed: 08/17/2016
  • Issued: 05/05/2020
  • Est. Priority Date: 08/17/2016
  • Status: Active Grant
First Claim
Patent Images

1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising:

  • receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar;

    generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents;

    hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;

    generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;

    generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and

    assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×