Phrase-based detection of duplicate documents in an information retrieval system

US 8,108,412 B2
Filed: 03/04/2010
Issued: 01/31/2012
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. A method of detecting duplicate documents in search results, the method comprising:

receiving a query comprising at least one phrase;

retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;

for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold;

responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

218 Citations

38 Claims

1. A method of detecting duplicate documents in search results, the method comprising:
- receiving a query comprising at least one phrase;
  
  retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;
  
  for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold;
  
  responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the information gain of g_jwith respect to g_kis a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents.
  - 3. The method of claim 1, wherein the document description of a first document matches the document description of a second document when a hash value of the first document description equals a hash value of the second document description.
  - 4. The method of claim 1, further comprising:
    - concatenating the selected sentences of the document descriptions of each of the retrieved documents;
      
      computing a hash value of the concatenated sentences of the document descriptions of each of the retrieved documents; and
      
      comparing the hash values for the retrieved documents to determine if the document descriptions for two retrieved documents match.
  - 5. The method of claim 1, wherein a discarded document has a lower document significance measure than document whose document description matches the document description of the discarded document.
  - 6. The method of claim 5, wherein the document significance measure includes a page rank of the document.
  - 7. The method of claim 1, wherein the related phrases in the selected sentences of a document are related to other phrases of the document.
  - 8. The method of claim 1, wherein the related phrases in the selected sentences of a document are related to other phrases of the set of document.
  - 9. The method of claim 1, wherein the number (N) of selected sentences in the document description for the discarded document is identical to the number of selected sentences in the document description for a document whose document description matches the document description of the discarded document.
  - 10. The method of claim 9, wherein N is between 5 and 10.

11. A method of detecting duplicate documents in search results, the method comprising:
- receiving a query comprising at least one phrase;
  
  retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;
  
  for each of the retrieved documents, by operation of a processor within a computer system, retrieving a stored document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each sentence, wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold;
  
  responsive to the document description at least two documents matching, discarding at least one of the two documents from the search result.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, wherein the information gain of g_jwith respect to g_kis a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents.
  - 13. The method of claim 11, wherein the document description of a first document matches the document description of a second document when a hash value of the first document description equals a hash value of the second document description.
  - 14. The method of claim 11, wherein a discarded document has a lower document significance measure than document whose document description matches the document description of the discarded document.
  - 15. The method of claim 14, wherein the document significance measure includes a page rank of the document.
  - 16. The method of claim 11, wherein the related phrases in the selected sentences of a document are related to other phrases of the document.
  - 17. The method of claim 11, wherein the related phrases in the selected sentences of a document are related to other phrases of the set of documents.
  - 18. The method of claim 11, wherein the number (N) of selected sentences in the document description for the discarded document is identical to the number of selected sentences in the document description for a document whose document description matches the document description of the discarded document.
  - 19. The method of claim 18, wherein N is between 5 and 10.

20. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
- receiving a query comprising at least one phrase;
  
  retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;
  
  for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold;
  
  responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 21. The computer readable storage medium of claim 20, wherein the information gain of g_jwith respect to g_kis a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents.
  - 22. The computer readable storage medium of claim 20, wherein the document description of a first document matches the document description of a second document when a hash value of the first document description equals a hash value of the second document description.
  - 23. The computer readable storage medium of claim 20, wherein the operations of the computer program further comprise:
    - concatenating the selected sentences of the document descriptions of each of the retrieved documents;
      
      computing a hash value of the concatenated sentences of the document descriptions of each of the retrieved documents; and
      
      comparing the hash values for the retrieved documents to determine if the document descriptions for two retrieved documents match.
  - 24. The computer readable storage medium of claim 20, wherein a discarded document has a lower document significance measure than document whose document description matches the document description of the discarded document.
  - 25. The computer readable storage medium of claim 24, wherein the document significance measure includes a page rank of the document.
  - 26. The computer readable storage medium of claim 20, wherein the related phrases in the selected sentences of a document are related to other phrases of the document.
  - 27. The computer readable storage medium of claim 20, wherein the related phrases in the selected sentences of a document are related to other phrases of the set of document.
  - 28. The computer readable storage medium of claim 20, wherein the number (N) of selected sentences in the document description for the discarded document is identical to the number of selected sentences in the document description for a document whose document description matches the document description of the discarded document.
  - 29. The computer readable storage medium of claim 28, wherein N is between 5 and 10.

30. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
- receiving a query comprising at least one phrase;
  
  retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;
  
  for each of the retrieved documents, by operation of a processor within a computer system, retrieving a stored document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each sentence, wherein a phrase g_jis a related phrase of another phrase g_koccurring in the set of documents when an information gain of g_jwith respect to g_kexceeds a predetermined threshold;
  
  responsive to the document description at least two documents matching, discarding at least one of the two documents from the search result.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38)
- - 31. The computer readable storage medium of claim 30, wherein the information gain of g_jwith respect to g_kis a function of both actual and expected co-occurrence rates of g_jand g_kin the set of documents.
  - 32. The computer readable storage medium of claim 30, wherein the document description of a first document matches the document description of a second document when a hash value of the first document description equals a hash value of the second document description.
  - 33. The computer readable storage medium of claim 30, wherein a discarded document has a lower document significance measure than document whose document description matches the document description of the discarded document.
  - 34. The computer readable storage medium of claim 33, wherein the document significance measure includes a page rank of the document.
  - 35. The computer readable storage medium of claim 30, wherein the related phrases in the selected sentences of a document are related to other phrases of the document.
  - 36. The computer readable storage medium of claim 30, wherein the related phrases in the selected sentences of a document are related to other phrases of the set of documents.
  - 37. The computer readable storage medium of claim 30, wherein the number (N) of selected sentences in the document description for the discarded document is identical to the number of selected sentences in the document description for a document whose document description matches the document description of the discarded document.
  - 38. The computer readable storage medium of claim 37, wherein N is between 5 and 10.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna L.
Primary Examiner(s)
Vy, Hung T

Application Number

US12/717,687
Publication Number

US 20100161625A1
Time in Patent Office

698 Days
Field of Search

707/705, 707/706, 707/709, 707/711, 707/722, 707/741, 707/758, 707/754, 715/205
US Class Current

707/754
CPC Class Codes

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/243   Natural language query form...

G06F 16/24578   using ranking

G06F 16/3322   using system suggestions G0...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06Q 10/10   Office automation; Time man...

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

218 Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

218 Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links