Information retrieval systems with duplicate document detection and presentation functions
First Claim
1. An information-retrieval system comprising:
- a plurality of databases; and
one or more servers for facilitating client access to the plurality of databases over a network, with the one or more servers collectively comprising;
signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from and their respective positions in a corresponding document in one or more of the databases, the signature-generation means comprising means for forming a document signature based on one or more of the group consisting of a document hash value and a document feature vector, wherein the hash value is based on features and positions of the features within a document;
query-definition means for defining a query and directing identification of search-result documents that include content duplicative of one or more other search-result documents;
duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;
means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
5 Assignments
0 Petitions
Accused Products
Abstract
Many companies provide online search facilities that enable users to conduct computerized searches for documents. Unfortunately, these searches frequently provide results that include duplicate documents—that is, documents that are completely or substantially identical to each other. This problem is particularly vexing when searching news stories, for example. Moreover, the duplicate documents are intermixed in the search results, leaving users to manually manage the complexities of identifying and/or filtering them. Accordingly, the present inventors devised systems, methods, and software that facilitate the identification and/or grouping of duplicate documents in search results. One exemplary system includes a signature generation module which generates document signatures based on length, temporal, and/or content components; a real-time duplicate detection module which uses the document signatures to identify “exact” or “fuzzy” duplicate documents; and a user-interface or presentation module which controls how duplicate documents are presented or suppressed in search results.
32 Citations
39 Claims
-
1. An information-retrieval system comprising:
-
a plurality of databases; and one or more servers for facilitating client access to the plurality of databases over a network, with the one or more servers collectively comprising; signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from and their respective positions in a corresponding document in one or more of the databases, the signature-generation means comprising means for forming a document signature based on one or more of the group consisting of a document hash value and a document feature vector, wherein the hash value is based on features and positions of the features within a document; query-definition means for defining a query and directing identification of search-result documents that include content duplicative of one or more other search-result documents; duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results; means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An information-retrieval system comprising:
-
a plurality of databases; and a server for providing users access to one or more of the databases, the server including; means for defining and processing a query to generate results comprising documents that include content duplicative of content within one or more other documents within the results; duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results, wherein the duplicate-determination means includes; means for comparing a first document signature for a first one of the documents within the results to a second document signature for a second one of the documents within the results, with each signature based on a plurality of terms and with corresponding positions within the corresponding document; means for comparing, respectively, first and second lengths and first and second temporal features of the first and second documents; means for comparing first and second hash values for the respective first and second documents, with each hash value based on features and positions of the features within its respective document; and wherein the duplicate-determination means is adapted to determine whether the first and second documents are duplicates in response to the results of the means for comparing hash values; means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer-implemented method of identifying whether first and second documents contain duplicate content, the method comprising:
-
determining whether the first and second documents have corresponding temporal traits that are within a first range of each other; determining whether the first and second documents have corresponding length traits that are within a second range of each other; determining, in response to an affirmative determination that the temporal traits of the first and second documents are within the first range of each other and an affirmative determination that the length traits of the first and second documents are within the second range of each other, whether the first and second documents have a significant number of features in common with each other; and identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other. - View Dependent Claims (17, 18, 19)
-
-
20. A method comprising:
-
determining whether respective length traits of first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value, wherein determining whether the first and second documents have at least the threshold number of features in common with each other, comprises; defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; and comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and identifying the first and second documents as duplicates in response to the determination being affirmative.
-
-
21. A computer-implemented method comprising:
-
receiving a user query; identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values; determining whether temporal traits of the first and second documents are within a first range of each other; determining whether length traits of the first and second documents are within a second range of each other; and in response to identifying the first and second documents and at least one determination being affirmative, comparing the first and second feature vectors. - View Dependent Claims (22, 23, 24)
-
-
25. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
-
means for receiving a user query; means for identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values; first determining means for determining whether temporal traits of the first and second documents are within a first range of each other; second determining means for determining whether length traits of the first and second documents are within a second range of each other; and means for comparing the first and second feature vectors in response to having identified the first and second documents and at least one determination being affirmative. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. An information-retrieval system comprising:
-
a plurality of databases; and a server for providing users access to one or more of the databases, the server including; means for defining and processing a query to generate results comprising documents that include content duplicative of content within one or more other documents within the results; duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results, wherein the duplicate-determination means includes; first means for comparing a first document signature for a first one of the documents within the results to a second document signature for a second one of the documents within the results, with each signature based on a plurality of terms and their corresponding positions within its corresponding document; second means for comparing, respectively, first and second lengths and first and second temporal features of the first and second documents; and third means for comparing a set of features common to the first and second documents, the set of features comprising features selected based on a corresponding inverse-document-frequency (idf) value, wherein the first and second documents have at least a threshold number of features in common with each other; wherein the duplicate-determination means is adapted to determine whether the first and second documents are duplicates in response to the results of the third comparing means; means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results. - View Dependent Claims (35)
-
-
36. A method comprising:
-
determining whether temporal values respectively associated with first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value, wherein determining whether the first and second documents have at least the threshold number of features in common with each other, comprises; defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; and comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and identifying the first and second documents as duplicates in response to the determination being affirmative.
-
-
37. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
-
first determining means for determining whether the first and second documents have corresponding temporal traits that are within a first range of each other; second determining means for determining whether the first and second documents have corresponding length traits that are within a second range of each other; third determining means for determining, in response to an affirmative determination that the temporal traits of the first and second documents are within the first range of each other and an affirmative determination that the length traits of the first and second documents are within the second range of each other, whether the first and second documents have a significant number of features in common with each other; and identifying means for identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other.
-
-
38. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
-
means for determining whether respective length traits of first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value; means for defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; means for comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and means for identifying the first and second documents as duplicates in response to the determination being affirmative.
-
-
39. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
-
means for determining whether temporal values respectively associated with first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value; means for defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; means for comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and means for identifying the first and second documents as duplicates in response to the determination being affirmative.
-
Specification