Identifying potential duplicates of a document in a document corpus
First Claim
1. A method, comprising:
- performing, by one or more computers,initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising;
receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;
determining a plurality of different queries from content of the source document;
executing the plurality of different queries on the document corpus in response to receiving the source document, wherein the plurality of different queries differ from one another by at least one search term, wherein individual queries of the plurality of different queries produce a respective list of documents that identifies at least some of the plurality of reference documents of the document corpus, wherein the respective reference documents of the respective lists of documents are scored at least in part with respect to the source document;
based, at least in part, on scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and
storing an identification of the one or more potential duplicate documents.
0 Assignments
0 Petitions
Accused Products
Abstract
According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document'"'"'s ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.
-
Citations
20 Claims
-
1. A method, comprising:
performing, by one or more computers, initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising; receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus; determining a plurality of different queries from content of the source document; executing the plurality of different queries on the document corpus in response to receiving the source document, wherein the plurality of different queries differ from one another by at least one search term, wherein individual queries of the plurality of different queries produce a respective list of documents that identifies at least some of the plurality of reference documents of the document corpus, wherein the respective reference documents of the respective lists of documents are scored at least in part with respect to the source document; based, at least in part, on scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and storing an identification of the one or more potential duplicate documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A non-transitory computer-readable storage medium having program instructions stored thereon that, in response to execution by a computer system, cause the computer system to perform operations comprising:
initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising; receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus; determining two or more different queries, from content of the source document; in response to receiving the source document, executing the two or more different queries on the document corpus, wherein the two or more different queries differ from one another by at least one search term, wherein individual ones of the two or more different queries return a respective list of reference documents that identifies at least some of the plurality of reference documents of the document corpus, and wherein the respective reference documents identified in the respective lists are associated with a respective score representing, at least in part, a relevance of that reference document with respect to the source document; based, at least in part, on the scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and storing an identification of the one or more potential duplicate documents. - View Dependent Claims (10, 11, 12, 13, 14)
-
15. A computer system, comprising:
-
a memory that, during operation, stores instructions; and a processor that, during operation, retrieves instructions from the memory and executes at least some of the instructions to cause the computer system to; initiate, based on receipt of a source document, a routine for identification of one or more candidate duplicate documents of the source document from a document corpus, said identification comprising; receive the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus; determine a plurality of queries from content of the source document; execute the plurality of queries on the document corpus, wherein the plurality of queries includes; a first query configured to return a first list of reference documents that identifies at least some of the plurality of reference documents of the document corpus, wherein reference documents in the first list are associated with scores representing at least in part relevance with respect to the source document; a second query different from the first query by at least one search term from the first query and configured to return a second list of reference documents that also identifies at least some of the plurality of reference documents of the document corpus, wherein reference documents in the second list are associated with scores representing at least in part relevance with respect to the source document; based, at least in part, on scores for the reference documents for the first and second list, select one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and store an identification of the one or more potential duplicate documents. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification