Identifying potential duplicates of a document in a document corpus

US 9,195,714 B1
Filed: 02/17/2011
Issued: 11/24/2015
Est. Priority Date: 12/06/2007
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

performing, by one or more computers,initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising;

receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;

determining a plurality of different queries from content of the source document;

executing the plurality of different queries on the document corpus in response to receiving the source document, wherein the plurality of different queries differ from one another by at least one search term, wherein individual queries of the plurality of different queries produce a respective list of documents that identifies at least some of the plurality of reference documents of the document corpus, wherein the respective reference documents of the respective lists of documents are scored at least in part with respect to the source document;

based, at least in part, on scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and

storing an identification of the one or more potential duplicate documents.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document'"'"'s ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Citations

20 Claims

1. A method, comprising:
- performing, by one or more computers,initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising;
  
  receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;
  
  determining a plurality of different queries from content of the source document;
  
  executing the plurality of different queries on the document corpus in response to receiving the source document, wherein the plurality of different queries differ from one another by at least one search term, wherein individual queries of the plurality of different queries produce a respective list of documents that identifies at least some of the plurality of reference documents of the document corpus, wherein the respective reference documents of the respective lists of documents are scored at least in part with respect to the source document;
  
  based, at least in part, on scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and
  
  storing an identification of the one or more potential duplicate documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - performing, by said one or more computers,prior to said executing the different queries, determining one or more of the different queries based on at least one of a term, word, or phrase within the source document.
  - 3. The method of claim 1, further comprising:
    - performing, by said one or more computers,prior to said executing the different queries, determining one or more of the different queries based on a set of stored queries.
  - 4. The method of claim 1, wherein at least one of the different queries is configured to score a reference document based on the reference document'"'"'s relevance with respect to the source document.
  - 5. The method of claim 4, wherein the at least one of the different queries is further configured to score the reference document based on a number of reference documents identified in the at least one query'"'"'s respective list of documents.
  - 6. The method of claim 1, further comprising:
    - performing, by said one or more computers,prior to said determining that one or more reference documents are potential duplicates of the received source document, determining that a reference document is identified in two or more of the lists of documents, the two or more lists of documents having a respective different score for the reference document; and
      
      assigning the highest of the different scores to the reference document.
  - 7. The method of claim 1, wherein the score threshold is an absolute score threshold applied to a collection of the identified reference documents.
  - 8. The method of claim 1, wherein the score threshold is a relative score threshold applied to a collection of the identified reference documents.

9. A non-transitory computer-readable storage medium having program instructions stored thereon that, in response to execution by a computer system, cause the computer system to perform operations comprising:
- initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising;
  
  receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;
  
  determining two or more different queries, from content of the source document;
  
  in response to receiving the source document, executing the two or more different queries on the document corpus, wherein the two or more different queries differ from one another by at least one search term, wherein individual ones of the two or more different queries return a respective list of reference documents that identifies at least some of the plurality of reference documents of the document corpus, and wherein the respective reference documents identified in the respective lists are associated with a respective score representing, at least in part, a relevance of that reference document with respect to the source document;
  
  based, at least in part, on the scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and
  
  storing an identification of the one or more potential duplicate documents.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The non-transitory computer-readable storage medium of claim 9, said operations further comprising:
    - prior to said executing the two or more different queries, determining one or more of the different queries based on at least one of a term, word, or phrase within the source document.
  - 11. The non-transitory computer-readable storage medium of claim 9, said operations further comprising:
    - prior to said executing the two or more different queries, determining one of more of the different queries based on a set of stored queries.
  - 12. The non-transitory computer-readable storage medium of claim 9, wherein at least one of the different queries is configured to score a reference document based on the reference document'"'"'s relevance with respect to the source document and on a number of reference documents identified by the at least one query'"'"'s respective list of documents.
  - 13. The non-transitory computer-readable storage medium of claim 9, the operations further comprising:
    - prior to said determining that one or more reference documents are potential duplicates of the received source document, determining that a reference document is identified in two or more of the lists of documents, the two or more lists of documents having a respective different score for the reference document; and
      
      assigning the highest of the different scores to the reference document.
  - 14. The non-transitory computer-readable storage medium of claim 9, wherein the score threshold is a relative score threshold applied to a collection of the identified reference documents.

15. A computer system, comprising:
- a memory that, during operation, stores instructions; and
  
  a processor that, during operation, retrieves instructions from the memory and executes at least some of the instructions to cause the computer system to;
  
  initiate, based on receipt of a source document, a routine for identification of one or more candidate duplicate documents of the source document from a document corpus, said identification comprising;
  
  receive the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;
  
  determine a plurality of queries from content of the source document;
  
  execute the plurality of queries on the document corpus, wherein the plurality of queries includes;
  
  a first query configured to return a first list of reference documents that identifies at least some of the plurality of reference documents of the document corpus, wherein reference documents in the first list are associated with scores representing at least in part relevance with respect to the source document;
  
  a second query different from the first query by at least one search term from the first query and configured to return a second list of reference documents that also identifies at least some of the plurality of reference documents of the document corpus, wherein reference documents in the second list are associated with scores representing at least in part relevance with respect to the source document;
  
  based, at least in part, on scores for the reference documents for the first and second list, select one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and
  
  store an identification of the one or more potential duplicate documents.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein at least some of the instructions further cause the computer system to:
    - prior to said execution of the plurality of queries, determine at least one of the plurality of queries based on at least one of a term, word, or phrase within the source document.
  - 17. The system of claim 15, wherein at least some of the instructions further cause the computer system to:
    - prior to said execution of the plurality of queries, determine at least one of the plurality of queries based on a set of stored queries.
  - 18. The system of claim 15, wherein at least one of the plurality of queries is configured to score a reference document based on the reference document'"'"'s relevance with respect to the source document and on a number of reference documents identified the at least one query'"'"'s respective list of documents.
  - 19. The system of claim 15, wherein at least some of the instructions further cause the computer system to:
    - prior to said determination that one or more reference documents are potential duplicates of the received source document, determine that a reference document is identified in the first and second lists of documents, the first and second lists of documents having a respective different score for the reference document; and
      
      assign the highest of the different scores to the reference document.
  - 20. The system of claim 15, wherein the score threshold is a relative score threshold applied to reference documents in the first and second lists of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Thirumalai, Srikanth, Manoharan, Aswath, Tomko, Mark J., Emery, Grant M., Mohan, Vijai
Primary Examiner(s)
Nguyen, Loan T

Application Number

US13/030,114
Time in Patent Office

1,741 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24553   of query operations

G06F 16/24575   using context

G06F 16/355   Class or cluster creation o...

Identifying potential duplicates of a document in a document corpus

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying potential duplicates of a document in a document corpus

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links