Control of document similarity determinations by respective nodes of a plurality of computing devices

US 10,642,912 B2
Filed: 08/17/2016
Issued: 05/05/2020
Est. Priority Date: 08/17/2016
Status: Active Grant

First Claim

Patent Images

1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising:

receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar;

generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents;

hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;

generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;

generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and

assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques and systems are described to control a determination of document similarity. In one example, dimensionality of the documents is reduced through computation of a signature, e.g., via a hashing technique such as “minhashing” which is also known as min-wise independent permutations locality sensitive hashing. From these signatures, another hashing technique (e.g., locality sensitive hashing) is used to determine similarity of the signatures to each other. Identification of disjoint sets is then used as a basis to partition the documents for determination of document similarity by respective nodes of a plurality of computing devices. In this way, an amount of data shuffling between the nodes as part of the determination of document similarity may be reduced. In another example, a weighting is applied to attributes of documents as part of the determination of document similarity.

Citations

20 Claims

1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising:
- receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar;
  
  generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents;
  
  hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;
  
  generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;
  
  generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and
  
  assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as described in claim 1, wherein the plurality of documents is configured as webpages, product descriptions, or social network communications.
  - 3. The method as described in claim 1, further comprising extracting word data from each document of the plurality of documents and filtering the extracted word data to locate meaningful word data.
  - 4. The method as described in claim 3, wherein the filtering of the extracted word data to locate meaningful word data includes removing word data included in a listing of rare or common word data.
  - 5. The method as described in claim 1, wherein the input specifying the similarity threshold is a user input.
  - 6. The method as described in claim 1, further comprising generating a recommendation by the at least one computing device based on the determination of similarity.
  - 7. The method as described in claim 1, wherein the generating of the plurality of signatures is based at least in part on locality-sensitive hashing (LSH).
  - 8. The method as described in claim 1, wherein the assigning includes applying a weight to an attribute described by words in respective ones of the plurality of documents.
  - 9. The method as described in claim 8, wherein the applying of the weight includes adding additional instances of the words that describe the attribute to the respective ones of the plurality of documents based on the weight.
  - 10. The method as described in claim 8, wherein the attribute and the weight are user specified via one or more inputs.

11. In a digital medium environment to determine document similarity, a method implemented by at least one computing device, the method comprising:
- receiving, by the at least one computing device, an input specifying a similarity threshold and a weight of an attribute, the similarity threshold defining a minimum number of a plurality of buckets that two documents of a plurality of documents are both to be included in to be considered similar;
  
  applying, by the at least one computing device, the weight to a word that corresponds to the attribute, the applying including adding additional instances of the word to the respective ones of the plurality of documents based on the weight;
  
  generating, by the at least one computing device using hashing, a plurality of signature data from the plurality of documents having the applied weighting;
  
  hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;
  
  generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;
  
  generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and
  
  assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of the filtered set of documents, to each other, within respective said partitions.
- View Dependent Claims (12, 19)
- - 12. The method as described in claim 11, wherein the indication of the similarity threshold data and the weight are user specified.
  - 19. The method as described in claim 11, further comprising generating a recommendation based on the determination of document similarity.

13. In a digital medium environment to determine document similarity, a system comprising:
- a processing system; and
  
  a computer-readable storage medium having instructions stored thereon that, responsive to execution by the processing system, causes the processing system to perform operations comprising;
  
  receiving an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two documents, of a plurality of documents, are to be included in to be considered similar;
  
  generating a plurality of signature data from the plurality of documents using hashing;
  
  hashing the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data;
  
  generating a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold;
  
  generating a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data of a plurality of disjoint sets of data; and
  
  assigning the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions.
- View Dependent Claims (14, 15, 16, 17, 18, 20)
- - 14. The system as described in claim 13, wherein the operations further comprising:
    - extracting word data from each document of the plurality of documents; and
      
      filtering the extracted word data to locate meaningful word data.
  - 15. The system as described in claim 14, wherein the filtering of the extracted word data to locate meaningful words includes removing word data included in a listing of rare or common word data.
  - 16. The system as described in claim 14, wherein the generating includes applying a weight to an attribute described by word data in respective ones of the plurality of documents to control the determination of document similarity.
  - 17. The system as described in claim 16, wherein the application of the weight includes adding additional instances of the word data that describe the attribute to the respective ones of the plurality of documents based on the weight.
  - 18. The system as described in claim 16, wherein the attribute and the weight are user specified via one or more inputs.
  - 20. The system as described in claim 16, the operations further comprising generating a recommendation based on the determination of document similarity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Inc.
Inventors
Verma, Anshul, Russell, Kenneth G.
Primary Examiner(s)
Gorney, Boris
Assistant Examiner(s)
Shah, Vaishali

Application Number

US15/239,521
Publication Number

US 20180052933A1
Time in Patent Office

1,357 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/9535 Search customisation based ...

G06Q 50/01 Social networking

Control of document similarity determinations by respective nodes of a plurality of computing devices

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Control of document similarity determinations by respective nodes of a plurality of computing devices

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links