Phrase-based detection of duplicate documents in an information retrieval system

US 7,711,679 B2
Filed: 07/26/2004
Issued: 05/04/2010
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. A method of detecting a duplicate document, the method comprising:

selecting a first document and a second document from a set of documents;

comparing, by operation of a processor adapted to manipulate data within a computer system, a document description of the first document with a document description of the second document,wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold, the information gain of g_jwith respect to g_kbeing a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents; and

responsive to the document description of the first document matching the document description of the second document, identifying the first document and the second document as duplicate documents in the set of documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

214 Citations

14 Claims

1. A method of detecting a duplicate document, the method comprising:
- selecting a first document and a second document from a set of documents;
  
  comparing, by operation of a processor adapted to manipulate data within a computer system, a document description of the first document with a document description of the second document,wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold, the information gain of g_jwith respect to g_kbeing a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents; and
  
  responsive to the document description of the first document matching the document description of the second document, identifying the first document and the second document as duplicate documents in the set of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising:
    - receiving a query comprising at least one phrase;
      
      retrieving a plurality of documents responsive to the query to form the set of documents as a search result set including the first document and the second document; and
      
      discarding at least one of the first document or the second document from the search result set based on the identification of the first and second documents as duplicate documents.
  - 3. The method of claim 2, wherein the document discarded has a lower document significance measure than the other document.
  - 4. The method of claim 3, wherein the document significance measure includes a page rank of the document.
  - 5. The method of claim 3, wherein discarding at least one of the first document or the second document from the set of documents comprises removing the first document or the second document from an index.
  - 6. The method of claim 1, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - for the first document, generating the document description of the first document by selecting sentences of the document, and ordering the selected sentences in the document description as a function of a number of related phrases in the selected sentences; and
      
      for the second document, generating the document description of the second document by selecting sentences of the document, and ordering the selected sentences in the document description as a function of a number of related phrases in the selected sentences.
  - 7. The method of claim 1, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - retrieving a stored document description of the first document comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in the selected sentences; and
      
      retrieving a stored document description of the second document comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in the selected sentences.
  - 8. The method of claim 1, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - generating for the first document a document description by selecting sentences of the first document, and ordering in the document description as a function of a number of related phrases in each sentence; and
      
      retrieving for the second document a stored document description comprising selected sentences of the second document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each sentence.
  - 9. The method of claim 1, wherein selecting a first document and a second document from a set of documents further comprises:
    - selecting the first document and second document during indexing of the first document.
  - 10. The method of claim 1, further comprising storing a document description in association with the document to which the document description corresponds.
  - 11. The method of claim 10, wherein the association is accomplished by concatenating the sentences of the document description, computing a hash value of the concatenated sentences, and storing an identifier of the document in an associative data structure at a data structure location corresponding to the computed hash value.
  - 12. The method of claim 1, wherein the document description of the first document matches the document description of the second document when a hash value of the first document description equals a hash value of the second document description.

13. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
- selecting a first document and a second document from a set of documents;
  
  comparing a document description of the first document with a document description of the second document,wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences;
  
  wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein a phrase g_jis a related phrase of another phrase g_k, occurring in the set of documents when an information gain of g_jwith respect to g_k, exceeds a predetermined threshold, the information gain of g_jwith respect to g_kbeing a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents; and
  
  responsive to the document description corresponding to the first document matching the document description corresponding to the second document, indentifying the first document and the second document as duplicate documents in the set of documents.

14. A system for detecting a duplicate document, comprising:
- a document description system, executed by a processor, and configured to associate a set of documents with a set of corresponding document descriptions and store the associations in a memory,wherein a document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences;
  
  wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold, the information gain of g_jwith respect to g_kbeing a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents; and
  
  a duplicate detection system, executed by a processor and configured to;
  
  select a first document and a second document from the document description system;
  
  compare the document description corresponding to the first document with the document description corresponding to the second document, andresponsive to the document description corresponding to the first document matching the document description corresponding to the second document, indentifying the first document and the second document as duplicate documents in the set of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna Lynn
Primary Examiner(s)
Vy; Hung T

Application Number

US10/900,012
Publication Number

US 20080306943A1
Time in Patent Office

2,108 Days
Field of Search

707 2- 6, 703/2, 704/4, 704/9, 715/500, 715/256, 715/267, 715/513, 715/501.1
US Class Current

707/715
CPC Class Codes

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/243   Natural language query form...

G06F 16/24578   using ranking

G06F 16/3322   using system suggestions G0...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06Q 10/10   Office automation; Time man...

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

214 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

214 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links