Information retrieval systems with duplicate document detection and presentation functions
First Claim
1. An information-retrieval system comprising:
- a plurality of databases; and
one or more servers for facilitating client access to the plurality of databases over a network, with each of the servers including at least one of;
signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from a corresponding document in one or more of the databases;
query-definition means for defining a query and selecting an option related to identification of search-result documents that include content duplicative of one or more other search-result documents;
duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;
means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
5 Assignments
0 Petitions
Accused Products
Abstract
Many companies provide online search facilities that enable users to conduct computerized searches for documents. Unfortunately, these searches frequently provide results that include duplicate documents—that is, documents that are completely or substantially identical to each other. This problem is particularly vexing when searching news stories, for example. Moreover, the duplicate documents are intermixed in the search results, leaving users to manually manage the complexities of identifying and/or filtering them. Accordingly, the present inventors devised systems, methods, and software that facilitate the identification and/or grouping of duplicate documents in search results. One exemplary system includes a signature generation module which generates document signatures based on length, temporal, and/or content components; a real-time duplicate detection module which uses the document signatures to identify “exact” or “fuzzy” duplicate documents; and a user-interface or presentation module which controls how duplicate documents are presented or suppressed in search results.
210 Citations
39 Claims
-
1. An information-retrieval system comprising:
-
a plurality of databases; and
one or more servers for facilitating client access to the plurality of databases over a network, with each of the servers including at least one of;
signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from a corresponding document in one or more of the databases;
query-definition means for defining a query and selecting an option related to identification of search-result documents that include content duplicative of one or more other search-result documents;
duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;
means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results. - View Dependent Claims (2, 3, 4)
-
-
5. An information-retrieval system comprising:
-
a plurality of databases; and
a server for providing users access to one or more of the databases, the server including;
query-definition means for defining a query and selecting an option related to identification of documents within results of the query that include content duplicative of content within one or more other documents within the results;
duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results;
means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results. - View Dependent Claims (6, 7, 8)
-
-
9. A method comprising:
-
comparing first and second lengths of respective first and second documents;
comparing first and second content sets for the respective first and second documents; and
determining whether the first and second documents are duplicates based on results of comparing the first and second lengths or results of comparing the first and second content sets. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method of identifying whether first and second documents are likely to contain duplicate content, the method comprising:
-
determining whether the first and second documents have corresponding temporal traits that are within a first range of each other;
determining whether the first and second documents have corresponding length traits that are within a second range of each other;
determining whether the first and second documents have a significant number of features in common with each other; and
identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A method comprising:
-
determining whether first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value; and
identifying the first and second documents as duplicates in response to the determination being affirmative. - View Dependent Claims (26, 27)
-
-
28. A method comprising:
-
receiving a user query;
identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values; and
in response to identifying the first and second documents, comparing the first and second feature vectors. - View Dependent Claims (29, 30, 31)
-
-
32. A graphical user interface comprising:
-
one or more interactive control features for facilitating user definition of a query; and
at least one interactive control feature for indicating whether search results provided in response to the query are to identify documents determined to have substantial duplicate content. - View Dependent Claims (33, 34, 35)
-
-
36. A graphical user interface comprising:
-
one or more interactive control features for submitting a query; and
a query results region for displaying search results based on the query, the region including at least one interactive control feature for identifying and invoking display or retrieval of a corresponding search-result document and at least one duplicate-indication feature for indicating whether the search results include any documents that are deemed to be duplicative of the corresponding search-result document. - View Dependent Claims (37, 38, 39)
-
Specification