PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM

US 20080306943A1
Filed: 07/26/2004
Published: 12/11/2008
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

Citations

17 Claims

1. (canceled)

2. (canceled)

3. A method of detecting a duplicate document, the method comprising:
- selecting a first document and a second document from a set of documents;
  
  comparing a document description of the first document with a document description of the second document, wherein the document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and
  
  responsive to the document description of the first document matching the document description of the second document, discarding at least one of the first document or the second document from the set of documents.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 4. The method of claim 3, further comprising:
    - receiving a query comprising at least one phrase;
      
      retrieving a plurality of documents responsive to the query to form the set of documents as a search result set including the first document and the second document; and
      
      wherein discarding at least one of the first document or the second document comprises discarding at least one of the first document or the second document from the search result set.
  - 5. The method of claim 3, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - for each of the first and second documents, generating a document description by selecting sentences of the document, and ordering in the document description as a function of a number of phrases in each sentence.
  - 6. The method of claim 3, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - retrieving for each of the first document and the second document a stored document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of phrases in each sentence.
  - 7. The method of claim 3, wherein comparing the document description of the first document with a document description of the second document further comprises:
    - generating for the first document a document description by selecting sentences of the first document, and ordering in the document description as a function of a number of phrases in each sentence; and
      
      retrieving for the second document a stored document description comprising selected sentences of the second document, wherein the selected sentences are ordered in the document description as a function of a number of phrases in each sentence.
  - 8. The method of claim 3, wherein selecting a first document and a second document from a set of documents further comprises:
    - selecting the first document and second document during indexing of the first document.
  - 9. The method of claim 3, wherein the phrases as a function of which the sentences of the first document description are ordered are phrases related to the first document, and the phrases as a function of which the sentences of the second document description are ordered are phrases related to the second document.
  - 10. The method of claim 3, wherein a document description is stored in association with the document to which the document description corresponds.
  - 11. The method of claim 10, wherein the association is accomplished by concatenating the sentences of the document description, computing a hash value of the concatenated sentences, and storing an identifier of the document in an associative data structure at a data structure location corresponding to the computed hash value.
  - 12. The method of claim 3, wherein the document description of the first document matches the document description of the second document when a hash value of the first document description equals a hash value of the second document description.
  - 13. The method of claim 3, wherein the document discarded has a lower document significance measure.
  - 14. The method of claim 13, wherein the document significance measure is page rank.
  - 15. The method of claim 13, wherein discarding at least one of the first document or the second document from the set of documents comprises removing the first document or the second document from an index.

16. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
- selecting a first document and a second document from a set of documents;
  
  comparing a document description of the first document with a document description of the second document, wherein the document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and
  
  responsive to the document description of the first document matching the document description of the second document, discarding at least one of the first document or the second document from the set of documents.

17. A system for detecting a duplicate document, comprising:
- a document description system, executed by a processor, and configured to associate a set of documents with a set of corresponding document descriptions and store the associations in a memory, wherein the corresponding document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and
  
  a duplicate detection system, executed by a processor and configured to;
  
  select a first document and a second document from the document description system;
  
  compare the document description corresponding to the first document with the document description corresponding to the second document, andresponsive to the document description corresponding to the first document matching the document description corresponding to the second document, disassociate at least one of the first document or the second document from the set of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna Lynn

Granted Patent

US 7,711,679 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/6
CPC Class Codes

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/243   Natural language query form...

G06F 16/24578   using ranking

G06F 16/3322   using system suggestions G0...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06Q 10/10   Office automation; Time man...

PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links