Information retrieval systems with duplicate document detection and presentation functions

US 20060041597A1
Filed: 05/05/2005
Published: 02/23/2006
Est. Priority Date: 08/23/2004
Status: Active Grant

First Claim

Patent Images

1. An information-retrieval system comprising:

a plurality of databases; and

one or more servers for facilitating client access to the plurality of databases over a network, with each of the servers including at least one of;

signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from a corresponding document in one or more of the databases;

query-definition means for defining a query and selecting an option related to identification of search-result documents that include content duplicative of one or more other search-result documents;

duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;

means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and

means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Many companies provide online search facilities that enable users to conduct computerized searches for documents. Unfortunately, these searches frequently provide results that include duplicate documents—that is, documents that are completely or substantially identical to each other. This problem is particularly vexing when searching news stories, for example. Moreover, the duplicate documents are intermixed in the search results, leaving users to manually manage the complexities of identifying and/or filtering them. Accordingly, the present inventors devised systems, methods, and software that facilitate the identification and/or grouping of duplicate documents in search results. One exemplary system includes a signature generation module which generates document signatures based on length, temporal, and/or content components; a real-time duplicate detection module which uses the document signatures to identify “exact” or “fuzzy” duplicate documents; and a user-interface or presentation module which controls how duplicate documents are presented or suppressed in search results.

210 Citations

39 Claims

1. An information-retrieval system comprising:
- a plurality of databases; and
  
  one or more servers for facilitating client access to the plurality of databases over a network, with each of the servers including at least one of;
  
  signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from a corresponding document in one or more of the databases;
  
  query-definition means for defining a query and selecting an option related to identification of search-result documents that include content duplicative of one or more other search-result documents;
  
  duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;
  
  means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
  
  means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the signature-generation means comprises means for determining at least two of a temporal, a length, and a content component for each document.
  - 3. The system of claim 1, wherein each means comprises one or more sets of machine-executable instructions.
  - 4. The system of claim 1, wherein the query-definition means for defining a query provides an option to define the query using Boolean or natural language.

5. An information-retrieval system comprising:
- a plurality of databases; and
  
  a server for providing users access to one or more of the databases, the server including;
  
  query-definition means for defining a query and selecting an option related to identification of documents within results of the query that include content duplicative of content within one or more other documents within the results;
  
  duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results;
  
  means for controlling display of results of the query based on the selected option, with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
  
  means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
- View Dependent Claims (6, 7, 8)
- - 6. The system of claim 5, wherein the duplicate-determination means comprises means for comparing on a real-time basis a temporal, a length, and a content component for at least one document within the results of the query to a temporal, length, and a content component for at least one other document with the results.
  - 7. The system of claim 5, wherein each means comprises one or more sets of machine-executable instructions.
  - 8. The system of claim 5, wherein the query-definition means for defining a query provides an option to define the query using Boolean or natural language.

9. A method comprising:
- comparing first and second lengths of respective first and second documents;
  
  comparing first and second content sets for the respective first and second documents; and
  
  determining whether the first and second documents are duplicates based on results of comparing the first and second lengths or results of comparing the first and second content sets.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 10. The method of claim 9, wherein determining whether the first and second documents are duplicates is based on results of comparing the first and second lengths and results of comparing the first and second content sets.
  - 11. The method of claim 9, further comprising retrieving first and second document signatures corresponding to the respective first and second documents, with the signatures having respective first and second length components and respective first and second content components.
  - 12. The method of claim 9, further comprising comparing first and second document signatures to each other, with the first and second document signatures associated respectively with the first and second documents, wherein the determination of whether the first and second documents are duplicates is contingent on comparison of the first and second document signatures.
  - 13. The method of claim 12, wherein each document signature comprises a hash value and wherein the comparison of the first and second signature is affirmative if and only if the first and second hash values are equal.
  - 14. The method of claim 13, further comprising:
    - identifying a set of features from each document, based on their corresponding inverse-document-frequency (idf) values;
      
      determining positions of the features in each document;
      
      rounding the determined positions of each feature;
      
      concatenating the set of features and their rounded determined positions to define a string; and
      
      hashing the string to define the hash value.
  - 15. The method of claim 12, wherein each document signature comprises a feature vector and wherein the comparison of the first and second signatures is affirmative if and only if the feature vector of the first signature has at least a threshold number of features in common with the feature vector of the second signature.
  - 16. The method of claim 15, wherein the threshold number is at least as great as 80 percent of the number of features in each term vector.
  - 17. The method of claim 15, wherein each feature vector comprises a set of two or more top inverse-document-frequency terms from its corresponding document.
  - 18. The method of claim 15, wherein the threshold number is variable.
  - 19. The method of claim 9, wherein the first and second documents are identified in response to a user defining a query for automatic future execution and in response to an automatic execution of the user defined query.

20. A method of identifying whether first and second documents are likely to contain duplicate content, the method comprising:
- determining whether the first and second documents have corresponding temporal traits that are within a first range of each other;
  
  determining whether the first and second documents have corresponding length traits that are within a second range of each other;
  
  determining whether the first and second documents have a significant number of features in common with each other; and
  
  identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The method of claim 20, wherein determining whether the first and second documents have a significant number of features in common with each other occurs in response to an affirmative determination that the temporal traits of the first and second documents are within the first range of each other and an affirmative determination that the length traits of the first and second documents are within the second range of each other.
  - 22. The method of claim 20, wherein the first range is predetermined to be 30 days;
    - the second range is predetermined to be ±
      
      20%; and
      
      the significant number of features is predetermined to be at least 80 percent of a number of representative terms in each document, with the terms selected based on corresponding inverse-document-frequency values.
  - 23. The method of claim 20, wherein the first and second documents have respective first and second document signature data structures, with each document signature data structure including:
    - a temporal component based on a date of publication associated with the document;
      
      a length component based on a word count associated with the document; and
      
      a term vector based on a fixed number of (top-ranked) inverse-document frequency terms for the document.
  - 24. The method of claim 23, wherein the term vector includes at least 10 terms.

25. A method comprising:
- determining whether first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value; and
  
  identifying the first and second documents as duplicates in response to the determination being affirmative.
- View Dependent Claims (26, 27)
- - 26. The method of claim 25, wherein determining whether the first and second documents have at least a significant number of features in common with each other, comprises:
    - defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; and
      
      comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector.
  - 27. The method of claim 25, further comprising determining whether the first and second documents are sufficiently similar in length, wherein the first and second documents are identified as duplicates in response to determining that the first and second documents have at least the threshold number of features in common with each other and determining that the first and second documents are sufficiently similar in length.

28. A method comprising:
- receiving a user query;
  
  identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values; and
  
  in response to identifying the first and second documents, comparing the first and second feature vectors.
- View Dependent Claims (29, 30, 31)
- - 29. The method of claim 28, further comprising:
    - presenting search results identifying at least one of the first and second documents in response to the user query, with the presented listing including an indication, based on the comparison of the first and second feature vectors, of whether the first and second documents contain content that is duplicative of each other.
  - 30. The method of claim 28, wherein presenting search results identifying at least one of the first and second documents comprises presenting a listing of a title of the first document and wherein the indication of whether the first and second documents contain content that is duplicative of each other, includes:
    - presenting a listing of a title of the second document that is below and indented relative to the title of the first document;
      
      or presenting a listing of a title of the second document in a font that differs from that of the title of the first document;
      
      or presenting a folder or other container icon which is selectable to display a listing of one or more documents that identifies the second document.
  - 31. The method of claim 29, wherein comparing the first and second feature vectors occurs in response to an affirmative determination that temporal traits of the first and second documents are within a first range of each other and an affirmative determination that length traits of the first and second documents are within a second range of each other.

32. A graphical user interface comprising:
- one or more interactive control features for facilitating user definition of a query; and
  
  at least one interactive control feature for indicating whether search results provided in response to the query are to identify documents determined to have substantial duplicate content.
- View Dependent Claims (33, 34, 35)
- - 33. The graphical user interface of claim 32, wherein the one interactive control feature comprises a check box.
  - 34. The graphical user interface of claim 32, further comprising at least one interactive control feature for controlling output of documents determined to have substantial duplicate content with other documents in the search results.
  - 35. The graphical user interface of claim 32, further comprising a query-results region including:
    - one or more interactive control features for invoking display of corresponding search-result documents; and
      
      at least one interactive control feature for identifying and invoking display of a search-result document deemed to be a duplicate of at least another of the search-result documents.

36. A graphical user interface comprising:
- one or more interactive control features for submitting a query; and
  
  a query results region for displaying search results based on the query, the region including at least one interactive control feature for identifying and invoking display or retrieval of a corresponding search-result document and at least one duplicate-indication feature for indicating whether the search results include any documents that are deemed to be duplicative of the corresponding search-result document.
- View Dependent Claims (37, 38, 39)
- - 37. The graphical user interface of claim 36, wherein the one duplicate-indication feature comprises a selectable link that is positioned below and indented relative to the one interactive control feature for the corresponding search-result document.
  - 38. The graphical interface of claim 36, further comprising at least one interactive control feature for defining a default setting of whether search results presented in response to a query are to identify documents determined to have substantial duplicate content.
  - 39. The graphical user interface of claim 38, wherein the one interactive control feature comprises a check box.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Thomson Reuters Enterprise Centre GmbH (The Woodbridge Co. Ltd.)
Original Assignee
West Services
Inventors
Lin, Jie, Conrad, Jack G., Claussen, Joanne R.S.

Granted Patent

US 7,809,695 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30 of unstructured textual dat...

G06F 16/951 Indexing; Web crawling tech...

Information retrieval systems with duplicate document detection and presentation functions

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

210 Citations

39 Claims

Specification

Use Cases

Quick Links

Others

Information retrieval systems with duplicate document detection and presentation functions

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

210 Citations

39 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others