Information retrieval systems with duplicate document detection and presentation functions

US 7,809,695 B2
Filed: 05/05/2005
Issued: 10/05/2010
Est. Priority Date: 08/23/2004
Status: Active Grant

First Claim

Patent Images

1. An information-retrieval system comprising:

a plurality of databases; and

one or more servers for facilitating client access to the plurality of databases over a network, with the one or more servers collectively comprising;

signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from and their respective positions in a corresponding document in one or more of the databases, the signature-generation means comprising means for forming a document signature based on one or more of the group consisting of a document hash value and a document feature vector, wherein the hash value is based on features and positions of the features within a document;

query-definition means for defining a query and directing identification of search-result documents that include content duplicative of one or more other search-result documents;

duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;

means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and

means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Many companies provide online search facilities that enable users to conduct computerized searches for documents. Unfortunately, these searches frequently provide results that include duplicate documents—that is, documents that are completely or substantially identical to each other. This problem is particularly vexing when searching news stories, for example. Moreover, the duplicate documents are intermixed in the search results, leaving users to manually manage the complexities of identifying and/or filtering them. Accordingly, the present inventors devised systems, methods, and software that facilitate the identification and/or grouping of duplicate documents in search results. One exemplary system includes a signature generation module which generates document signatures based on length, temporal, and/or content components; a real-time duplicate detection module which uses the document signatures to identify “exact” or “fuzzy” duplicate documents; and a user-interface or presentation module which controls how duplicate documents are presented or suppressed in search results.

32 Citations

View as Search Results

39 Claims

1. An information-retrieval system comprising:
- a plurality of databases; and
  
  one or more servers for facilitating client access to the plurality of databases over a network, with the one or more servers collectively comprising;
  
  signature-generation means for generating a plurality of document signatures, with each document signature based on a plurality of features from and their respective positions in a corresponding document in one or more of the databases, the signature-generation means comprising means for forming a document signature based on one or more of the group consisting of a document hash value and a document feature vector, wherein the hash value is based on features and positions of the features within a document;
  
  query-definition means for defining a query and directing identification of search-result documents that include content duplicative of one or more other search-result documents;
  
  duplicate-determination means for determining, based on a subset of the document signatures, whether one or more documents within results of the query include content duplicative of content in one or more other documents within the results;
  
  means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
  
  means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the signature-generation means comprises means for determining at least two of a temporal, a length, and a content component for each document.
  - 3. The system of claim 1, wherein each means comprises one or more sets of machine-executable instructions.
  - 4. The system of claim 1, wherein the query-definition means for defining a query provides an option to define the query using Boolean or natural language.
  - 5. The system of claim 1, wherein the signature-generation means for generating a plurality of document signatures comprises:
    - means for determining one or more document length features or values;
      
      means for identifying a set of features from each document, based on their corresponding inverse-document-frequency (idf) values;
      
      means for determining positions of the features in each document;
      
      means for concatenating the set of features and their determined positions to define a string;
      
      means for hashing the string to define a hash value for each document;
      
      means for forming a document signature based on the document length feature or values and the hash value for each document; and
      
      storing a document signature in a memory device.
  - 6. The system of claim 5, wherein the signature-generation means further comprises means for rounding the determined positions of each feature;
    - and concatenating the set of features and their rounded determined positions to define a string.
  - 7. The system of claim 1, wherein the query-definition means for defining a query includes means for selecting an option controlling identification of search-result documents that include content duplicative of one or more other search-result documents.

8. An information-retrieval system comprising:
- a plurality of databases; and
  
  a server for providing users access to one or more of the databases, the server including;
  
  means for defining and processing a query to generate results comprising documents that include content duplicative of content within one or more other documents within the results;
  
  duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results, wherein the duplicate-determination means includes;
  
  means for comparing a first document signature for a first one of the documents within the results to a second document signature for a second one of the documents within the results, with each signature based on a plurality of terms and with corresponding positions within the corresponding document;
  
  means for comparing, respectively, first and second lengths and first and second temporal features of the first and second documents;
  
  means for comparing first and second hash values for the respective first and second documents, with each hash value based on features and positions of the features within its respective document; and
  
  wherein the duplicate-determination means is adapted to determine whether the first and second documents are duplicates in response to the results of the means for comparing hash values;
  
  means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
  
  means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The system of claim 8, wherein each signature is based on rounded positions of the plurality of terms.
  - 10. The system of claim 8, wherein each means comprises one or more sets of machine-executable instructions.
  - 11. The system of claim 8, wherein the means for defining and processing a query provides an option to define the query using Boolean or natural language.
  - 12. The system of claim 8, wherein the means for comparing first and second hash values is executed in response to the results of the means for comparing first and second lengths and respective first and second temporal features of respective first and second documents.
  - 13. The method of claim 12, wherein the duplicate-determination means is executed in response to a query adapted for automatic future execution and in response to an automatic execution of the query.
  - 14. The system of claim 8, wherein means for comparing first and second hash values is executed only if the length or temporal comparison is affirmative.
  - 15. The system of claim 8, wherein the means for comparing first and second hash values further comprises:
    - means for identifying a set of features from the respective document, based on their corresponding inverse-document-frequency (idf) values;
      
      means for determining positions of the features in the respective document;
      
      means for rounding the determined positions of each feature;
      
      means for concatenating the set of features and their rounded determined positions to define a string; and
      
      means for hashing the string to define the hash value.

16. A computer-implemented method of identifying whether first and second documents contain duplicate content, the method comprising:
- determining whether the first and second documents have corresponding temporal traits that are within a first range of each other;
  
  determining whether the first and second documents have corresponding length traits that are within a second range of each other;
  
  determining, in response to an affirmative determination that the temporal traits of the first and second documents are within the first range of each other and an affirmative determination that the length traits of the first and second documents are within the second range of each other, whether the first and second documents have a significant number of features in common with each other; and
  
  identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, wherein the first range is not greater than 30 days;
    - the second range is not greater than 20%; and
      
      the significant number of features is predetermined to be at least 80 percent of a number of representative terms in each document, with the terms selected based on corresponding inverse-document-frequency values.
  - 18. The method of claim 16, wherein the first and second documents have respective first and second document signature data structures, with each document signature data structure including:
    - a temporal component based on a date of publication associated with the document;
      
      a length component based on a word count associated with the document; and
      
      a term vector based on a fixed number of (top-ranked) inverse-document frequency terms for the document.
  - 19. The method of claim 18, wherein the term vector includes at least 10 terms.

20. A method comprising:
- determining whether respective length traits of first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value, wherein determining whether the first and second documents have at least the threshold number of features in common with each other, comprises;
  
  defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; and
  
  comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and
  
  identifying the first and second documents as duplicates in response to the determination being affirmative.

21. A computer-implemented method comprising:
- receiving a user query;
  
  identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values;
  
  determining whether temporal traits of the first and second documents are within a first range of each other;
  
  determining whether length traits of the first and second documents are within a second range of each other; and
  
  in response to identifying the first and second documents and at least one determination being affirmative, comparing the first and second feature vectors.
- View Dependent Claims (22, 23, 24)
- - 22. The method of claim 21, further comprising:
    - presenting search results identifying at least one of the first and second documents in response to the user query, with the presented listing including an indication, based on the comparison of the first and second feature vectors, of whether the first and second documents contain content that is duplicative of each other.
  - 23. The method of claim 22, wherein comparing the first and second feature vectors occurs in response to an affirmative determination that temporal traits of the first and second documents are within a first range of each other and an affirmative determination that length traits of the first and second documents are within a second range of each other.
  - 24. The method of claim 21, wherein presenting search results identifying at least one of the first and second documents comprises presenting a listing of a title of the first document and wherein the indication of whether the first and second documents contain content that is duplicative of each other, includes:
    - presenting a listing of a title of the second document that is below and indented relative to the title of the first document;
      
      orpresenting a listing of a title of the second document in a font that differs from that of the title of the first document;
      
      orpresenting a folder or other container icon which is selectable to display a listing of one or more documents that identifies the second document.

25. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
- means for receiving a user query;
  
  means for identifying at least first and second documents from a database in response to the user query, with the first and second documents associated with respective first and second feature vectors, each feature vector having a plurality of equal-length binary representations of terms or features within its respective document, with the terms or features selected based on relative magnitude of corresponding inverse-document-frequency (idf) values within a table of inverse-document-frequency values;
  
  first determining means for determining whether temporal traits of the first and second documents are within a first range of each other;
  
  second determining means for determining whether length traits of the first and second documents are within a second range of each other; and
  
  means for comparing the first and second feature vectors in response to having identified the first and second documents and at least one determination being affirmative.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
- - 26. The computer-based system of claim 25 further comprising a graphical user interface comprising:
    - a query-definition region on a display device, the region including one or more interactive control features for facilitating user definition of a query; and
      
      at least one interactive control feature, within the query-definition region, for enabling a user to select whether search results provided in response to the query are to identify documents determined to have substantial duplicate content.
  - 27. The computer-based system of claim 26, wherein the one interactive control feature comprises a check box.
  - 28. The computer-based system of claim 26, further comprising at least one interactive control feature for controlling output of documents determined to have substantial duplicate content with other documents in the search results.
  - 29. The computer-based system of claim 26, further comprising a query-results region including:
    - one or more interactive control features for invoking display of corresponding search-result documents; and
      
      at least one interactive control feature for identifying and invoking display of a search-result document deemed to be a duplicate of at least another of the search-result documents.
  - 30. The computer-based system of claim 25 further comprising a graphical user interface displayed on a display device, the interface comprising:
    - one or more interactive control features for submitting a query; and
      
      a query results region displaying search results on the display device based on the query, the region including;
      
      at least one interactive control feature for identifying and invoking display or retrieval of a corresponding search-result document;
      
      at least one duplicate-indication feature for indicating whether the search results include any documents that are deemed to be duplicative of the corresponding search-result document; and
      
      a duplicate count indicator for indicating a number of documents that are deemed to be duplicative.
  - 31. The computer-based system of claim 30, wherein the one duplicate-indication feature comprises a selectable link that is positioned below and indented relative to the one interactive control feature for the corresponding search-result document.
  - 32. The computer-based system of claim 30, further comprisingat least one interactive control feature for defining a default setting of whether search results presented in response to a query identify documents determined to have substantial duplicate content.
  - 33. The computer-based system of claim 32, wherein the one interactive control feature comprises a check box.

34. An information-retrieval system comprising:
- a plurality of databases; and
  
  a server for providing users access to one or more of the databases, the server including;
  
  means for defining and processing a query to generate results comprising documents that include content duplicative of content within one or more other documents within the results;
  
  duplicate-determination means for determining whether one or more documents within the results of the query include content duplicative of content within one or more other documents within the results, wherein the duplicate-determination means includes;
  
  first means for comparing a first document signature for a first one of the documents within the results to a second document signature for a second one of the documents within the results, with each signature based on a plurality of terms and their corresponding positions within its corresponding document;
  
  second means for comparing, respectively, first and second lengths and first and second temporal features of the first and second documents; and
  
  third means for comparing a set of features common to the first and second documents, the set of features comprising features selected based on a corresponding inverse-document-frequency (idf) value, wherein the first and second documents have at least a threshold number of features in common with each other;
  
  wherein the duplicate-determination means is adapted to determine whether the first and second documents are duplicates in response to the results of the third comparing means;
  
  means for controlling display of results of the query with at least one of the displayed results indicated as including content duplicative of content in one or more other documents within the results; and
  
  means for controlling output of results of the query to a printer or email transmission device, based on user selected options related to output of documents that include content duplicative of content of one or more other documents within the results.
- View Dependent Claims (35)
- - 35. The system of claim 34, wherein the means for comparing a set of features is executed in response to the results of the means for comparing first and second lengths and respective first and second temporal features of respective first and second documents.

36. A method comprising:
- determining whether temporal values respectively associated with first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value, wherein determining whether the first and second documents have at least the threshold number of features in common with each other, comprises;
  
  defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document; and
  
  comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and
  
  identifying the first and second documents as duplicates in response to the determination being affirmative.

37. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
- first determining means for determining whether the first and second documents have corresponding temporal traits that are within a first range of each other;
  
  second determining means for determining whether the first and second documents have corresponding length traits that are within a second range of each other;
  
  third determining means for determining, in response to an affirmative determination that the temporal traits of the first and second documents are within the first range of each other and an affirmative determination that the length traits of the first and second documents are within the second range of each other, whether the first and second documents have a significant number of features in common with each other; and
  
  identifying means for identifying the first and second documents as duplicates in response to determining that the temporal traits of the first and second documents are within the first range of each other, that the lengths of the first and second documents are within the second range of each other, and that the first and second documents have at least a significant number of features in common with each other.

38. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
- means for determining whether respective length traits of first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value;
  
  means for defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document;
  
  means for comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and
  
  means for identifying the first and second documents as duplicates in response to the determination being affirmative.

39. A computer-based system for identifying whether first and second documents contain duplicate content, the system comprising a processor, a memory, a user interface, and code executable by the processor, the system further comprising:
- means for determining whether temporal values respectively associated with first and second documents are within a range of each other and whether the first and second documents have at least a threshold number of selected features in common with each other, with each feature selected based on a corresponding inverse-document-frequency (idf) value;
  
  means for defining respective first and second term vectors for the first and second documents, with each term vector including a plurality of equal-length binary representations of idf terms for its respective document;
  
  means for comparing a number of the binary representations of the first term vector to a number of binary representations of the second term vector; and
  
  means for identifying the first and second documents as duplicates in response to the determination being affirmative.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Thomson Reuters Enterprise Centre GmbH (The Woodbridge Co. Ltd.)
Original Assignee
Thomson Reuters Global Resources
Inventors
Lin, Jie, Claussen, Joanne R. S., Conrad, Jack G.
Primary Examiner(s)
Vital; Pierre M
Assistant Examiner(s)
OBISESAN, AUGUSTINE KUNLE

Application Number

US11/122,577
Publication Number

US 20060041597A1
Time in Patent Office

1,979 Days
Field of Search

707/200, 707/300, 707/100
US Class Current

707/692
CPC Class Codes

G06F 16/30 of unstructured textual dat...

G06F 16/951 Indexing; Web crawling tech...

Information retrieval systems with duplicate document detection and presentation functions

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

32 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Information retrieval systems with duplicate document detection and presentation functions

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links