Detecting query-specific duplicate documents

US 8,214,359 B1
Filed: 07/19/2010
Issued: 07/03/2012
Est. Priority Date: 02/22/2000
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving a plurality of search results responsive to a query, wherein the query includes one or more search keywords, and wherein the plurality of search results have an associated order, where the particular order is determined using a ranking criteria;

processing each search result in the plurality of search results according to the order for the plurality of search results to generate a final group of search results, the final group of search results including a plurality of final search results from the plurality of search results, the processing including,adding a first search result in the plurality of search results to the final group of search results, wherein the first search result is first in the order for the plurality of search results, andfor each other search result of the plurality of search results;

determining whether a first document corresponding to the search result is a query-specific duplicate of a second document corresponding to any of the search results in the final group of search results, andif the first document corresponding to the search result is not a query-specific duplicate of the second document corresponding to any of the remaining search results in the final group of search results, adding the search result to the final set of search results before processing any other search result following the search result in the order, and otherwise not adding the search result to the final set of search results; and

providing the final group of search results.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as “snippets”) is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.

20 Citations

View as Search Results

41 Claims

1. A computer-implemented method, comprising:
- receiving a plurality of search results responsive to a query, wherein the query includes one or more search keywords, and wherein the plurality of search results have an associated order, where the particular order is determined using a ranking criteria;
  
  processing each search result in the plurality of search results according to the order for the plurality of search results to generate a final group of search results, the final group of search results including a plurality of final search results from the plurality of search results, the processing including,adding a first search result in the plurality of search results to the final group of search results, wherein the first search result is first in the order for the plurality of search results, andfor each other search result of the plurality of search results;
  
  determining whether a first document corresponding to the search result is a query-specific duplicate of a second document corresponding to any of the search results in the final group of search results, andif the first document corresponding to the search result is not a query-specific duplicate of the second document corresponding to any of the remaining search results in the final group of search results, adding the search result to the final set of search results before processing any other search result following the search result in the order, and otherwise not adding the search result to the final set of search results; and
  
  providing the final group of search results.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 39)
- - 2. The method of claim 1, wherein determining whether the first document is the query-specific duplicate of the second document further comprises comparing one or more first query-relevant parts of the first document and one or more second query-relevant parts of the second document, where each first query-relevant part and each second query-relevant part includes at least one of the one or more keywords.
  - 3. The method of claim 2, wherein determining whether the first document is the query-specific duplicate of the second document further comprises determining whether the one or more first query-relevant parts are identical to the one or more second query-relevant parts.
  - 4. The method of claim 2, wherein determining whether the first document is the query-specific duplicate of a second document further comprises determining whether the one or more first query-relevant parts are similar to the one or more second query-relevant parts.
  - 5. The method of claim 4, further comprising determining whether the one or more first query-relevant parts are similar to the one or more second query relevant-parts according to one or more of an edit distance between the one or more first query-relevant parts and the one or more second query-relevant parts, a cosine distance between a feature vector for the one or more first query-relevant parts and a feature vector for the one or more second query-relevant parts, or an analysis of shingles in the one or more first query-relevant parts and shingles in the one or more second query-relevant parts.
  - 6. The method of claim 2, wherein determining whether the first document is the query-specific duplicate of the second document further comprises extracting one or more query-relevant parts from the first document and extracting one or more query-relevant parts from the second document.
  - 7. The method of claim 2, wherein the one or more first query-relevant parts are a limited portion of the first document and the one or more second query-relevant parts are a limited portion of the second document.
  - 8. The method of claim 1, wherein determining whether the first document corresponding to the search result is the query-specific duplicate of the second document corresponding to any of the search results in the final group of search results further comprises comparing query-relevant parts of the first document to query-relevant parts of a respective document corresponding to each search result in the final set of search results.
  - 9. The method of claim 1, wherein the order of the plurality of search results is based on a degree of match, for each search result, between the search keywords and words in a document corresponding to the search result.
  - 10. The method of claim 9, wherein the order of the plurality of search results is further based on an estimate of a quality of each search result.
  - 39. The method of claim 1, wherein the ranking criteria is based at least in part on content associated with the respective search results.

11. A system, comprising:
- one or more computers configured to perform operations comprising;
  
  receiving a plurality of search results responsive to a query, wherein the query includes one or more search keywords, and wherein the plurality of search results have an associated order, where the particular order is determined using a ranking criteria;
  
  processing each search result in the plurality of search results according to the order for the plurality of search results to generate a final group of search results, the final group of search results including a plurality of final search results from the plurality of search results, the processing including,adding a first search result in the plurality of search results to the final group of search results, wherein the first search result is first in the order for the plurality of search results, andfor each other search result of the plurality of search results;
  
  determining whether a first document corresponding to the search result is a query-specific duplicate of a second document corresponding to any of the search results in the final group of search results, andif the first document corresponding to the search result is not a query-specific duplicate of the second document corresponding to any of the remaining search results in the final group of search results, adding the search result to the final set of search results before processing any other search result following the search result in the order, and otherwise not adding the search result to the final set of search results; and
  
  providing the final group of search results.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 40)
- - 12. The system of claim 11, wherein determining whether the first document is the query-specific duplicate of the second document further comprises comparing one or more first query-relevant parts of the first document and one or more second query-relevant parts of the second document, where each first query-relevant part and each second query-relevant part includes at least one of the one or more keywords.
  - 13. The system of claim 12, wherein determining whether the first document is the query-specific duplicate of the second document further comprises determining whether the one or more first query-relevant parts are identical to the one or more second query-relevant parts.
  - 14. The system of claim 12, wherein determining whether the first document is the query-specific duplicate of a second document further comprises determining whether the one or more first query-relevant parts are similar to the one or more second query-relevant parts.
  - 15. The system of claim 14, wherein the operations further comprise determining whether the one or more first query-relevant parts are similar to the one or more second query relevant-parts according to one or more of an edit distance between the one or more first query-relevant parts and the one or more second query-relevant parts, a cosine distance between a feature vector for the one or more first query-relevant parts and a feature vector for the one or more second query-relevant parts, or an analysis of shingles in the one or more first query-relevant parts and shingles in the one or more second query-relevant parts.
  - 16. The system of claim 12, wherein determining whether the first document is the query-specific duplicate of the second document further comprises extracting one or more query-relevant parts from the first document and extracting one or more query-relevant parts from the second document.
  - 17. The system of claim 12, wherein the one or more first query-relevant parts are a limited portion of the first document and the one or more second query-relevant parts are a limited portion of the second document.
  - 18. The system of claim 11, wherein determining whether the first document corresponding to the search result is the query-specific duplicate of the second document corresponding to any of the search results in the final group of search results further comprises comparing query-relevant parts of the first document to query-relevant parts of a respective document corresponding to each search result in the final set of search results.
  - 19. The system of claim 11, wherein the order of the plurality of search results is based on a degree of match, for each search result, between the search keywords and words in a document corresponding to the search result.
  - 20. The system of claim 19, wherein the order of the plurality of search results is further based on an estimate of a quality of each search result.
  - 40. The system of claim 11, wherein the ranking criteria is based at least in part on content associated with the respective search results.

21. A computer storage medium encoded with a computer program, the program comprising instructions that when executed by data processing apparatus cause the data processing apparatus to perform operations comprising:
- receiving a plurality of search results responsive to a query, wherein the query includes one or more search keywords, and wherein the plurality of search results have an associated order, where the particular order is determined using a ranking criteria;
  
  processing each search result in the plurality of search results according to the order for the plurality of search results to generate a final group of search results, the final group of search results including a plurality of final search results from the plurality of search results, the processing including,adding a first search result in the plurality of search results to the final group of search results, wherein the first search result is first in the order for the plurality of search results, andfor each other search result of the plurality of search results;
  
  determining whether a first document corresponding to the search result is a query-specific duplicate of a second document corresponding to any of the search results in the final group of search results, andif the first document corresponding to the search result is not a query-specific duplicate of the second document corresponding to any of the remaining search results in the final group of search results, adding the search result to the final set of search results before processing any other search result following the search result in the order, and otherwise not adding the search result to the final set of search results; and
  
  providing the final group of search results.
- View Dependent Claims (36, 37, 38, 41)
- - 36. The computer storage medium of claim 21, wherein determining whether the first document corresponding to the search result is the query-specific duplicate of the second document corresponding to any of the search results in the final group of search results further comprises comparing query-relevant parts of the first document to query-relevant parts of a respective document corresponding to each search result in the final set of search results.
  - 37. The computer storage medium of claim 21, wherein the order of the plurality of search results is based on a degree of match, for each search result, between the search keywords and words in a document corresponding to the search result.
  - 38. The computer storage medium of claim 37, wherein the order of the plurality of search results is further based on an estimate of a quality of each search result.
  - 41. The computer storage medium of claim 21, wherein the ranking criteria is based at least in part on content associated with the respective search results.

22. A computer storage medium encoded with a computer program, the program comprising instructions that when executed by data processing apparatus cause the data processing apparatus to perform operations comprising:
- receiving search results in response to a query, the query including one or more keywords, the search results including a first search result and a second search result;
  
  generating a set of final search results from the received search results with one or more processors, including;
  
  adding the first search result to the set of final search results;
  
  determining that a first document corresponding to the first search result and a second document corresponding to the second search result are query-specific duplicate documents from a comparison of one or more first query-relevant parts of the first document and one or more second query-relevant parts of the second document, where each query-relevant part includes at least one of the one or more keywords; and
  
  in response to the determination, not adding the second search result to the set of final search results; and
  
  providing the set of final search results.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 23. The computer storage medium of claim 22, wherein:
    - the received search results further include a third search result; and
      
      generating the set of final search results further includes;
      
      determining that the first document and a third document corresponding to the third search result are not query-specific duplicate documents based on a comparison of the one or more first query-relevant parts of the first document and one or more third query-relevant parts of the third document; and
      
      in response to the determination, adding the third search result to the set of final search results.
  - 24. The computer storage medium of claim 22, wherein the set of final search results includes web pages.
  - 25. The computer storage medium of claim 22, wherein the receiving search results and generating a set of final search results are performed automatically, without the need for user intervention.
  - 26. The computer storage medium of claim 22, wherein the query-relevant parts include a predetermined number of characters.
  - 27. The computer storage medium of claim 22, wherein the query-relevant parts include a predetermined number of words.
  - 28. The computer storage medium of claim 22, wherein the query-relevant parts are sentences.
  - 29. The computer storage medium of claim 22, wherein the query-relevant parts are paragraphs.
  - 30. The computer storage medium of claim 22, wherein determining whether the first document is the query-specific duplicate of the second document further comprises comparing one or more first query-relevant parts of the first document and one or more second query-relevant parts of the second document, where each first query-relevant part and each second query-relevant part includes at least one of the one or more keywords.
  - 31. The computer storage medium of claim 30, wherein determining whether the first document is the query-specific duplicate of the second document further comprises determining whether the one or more first query-relevant parts are identical to the one or more second query-relevant parts.
  - 32. The computer storage medium of claim 30, wherein determining whether the first document is the query-specific duplicate of the second document further comprises determining whether the one or more first query-relevant parts are similar to the one or more second query-relevant parts.
  - 33. The computer storage medium of claim 32, wherein the operations further comprise determining whether the one or more first query-relevant parts are similar to the one or more second query relevant-parts according to one or more of an edit distance between the one or more first query-relevant parts and the one or more second query-relevant parts, a cosine distance between a feature vector for the one or more first query-relevant parts and a feature vector for the one or more second query-relevant parts, or an analysis of shingles in the one or more first query-relevant parts and shingles in the one or more second query-relevant parts.
  - 34. The computer storage medium of claim 30, wherein determining whether the first document is the query-specific duplicate of the second document further comprises extracting one or more query-relevant parts from the first document and extracting one or more query-relevant parts from the second document.
  - 35. The computer storage medium of claim 30, wherein the one or more first query-relevant parts are a limited portion of the first document and the one or more second query-relevant parts are a limited portion of the second document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Gomes, Benedict A., Smith, Benjamin T.
Primary Examiner(s)
LY, CHEYNE D

Application Number

US12/839,164
Time in Patent Office

715 Days
Field of Search

707/692, 707/721, 707/728, 707/737, 707/748
US Class Current

707/728
CPC Class Codes

G06F 16/338   Presentation of query results

G06F 16/38   Retrieval characterised by ...

G06F 40/226   Validation

G06F 40/232   Orthographic correction, e....

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Detecting query-specific duplicate documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting query-specific duplicate documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links