Detecting query-specific duplicate documents

US 8,452,766 B1
Filed: 07/02/2012
Issued: 05/28/2013
Est. Priority Date: 02/22/2000
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document;

identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word;

for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment;

for each of the documents, ranking the segments of the document based on the respective counts of the segments; and

identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as “snippets”) is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.

22 Citations

View as Search Results

16 Claims

1. A computer-implemented method comprising:
- receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document;
  
  identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word;
  
  for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment;
  
  for each of the documents, ranking the segments of the document based on the respective counts of the segments; and
  
  identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising sending the search results that identify the documents and the query-relevant parts of the documents to a computing device in a response to the query.
  - 3. The method of claim 1 wherein each segment is a sentence or a paragraph.
  - 4. The method of claim 1 wherein sliding the fixed-length window over the content of the document comprises sliding a fixed-length window over content of the document a character at a time.
  - 5. The method of claim 1, further comprising determining that a pair of the documents are duplicates based on a comparison of the respective query-relevant parts of the pair of documents.
  - 6. The method of claim 5 wherein the respective query-relevant parts of the pair of documents are identical.
  - 7. The method of claim 5 wherein the respective query-relevant parts of the pair of documents are similar.
  - 8. The method of claim 1 wherein determining a respective count of occurrences of the one or more words of the query that occur in the segment comprises counting each occurrence of any of the one or more words of the query that occur in the segment.
  - 9. The method of claim 1 wherein determining a respective count of occurrences of the one or more words of the query that occur in the segment comprises counting only a first occurrence of any of the one or more words of the query that occur in the segment.

10. A system comprising:
- one or more computers comprising one or more processors programmed to perform operations comprising;
  
  receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document;
  
  identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word;
  
  for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment;
  
  for each of the documents, ranking the segments of the document based on the respective counts of the segments; and
  
  identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, wherein the operations further comprise sending the search results that identify the documents and the query-relevant parts of the documents to a computing device in a response to the query.
  - 12. The system of claim 10 wherein each segment is a sentence or a paragraph.
  - 13. The system of claim 10 wherein sliding the fixed-length window over the content of the document comprises sliding a fixed-length window over content of the document a character at a time.
  - 14. The system of claim 10 wherein determining a respective count of occurrences of the one or more words of the query that occur in the segment comprises counting each occurrence of any of the one or more words of the query that occur in the segment.
  - 15. The system of claim 10 wherein determining a respective count of occurrences of the one or more words of the query that occur in the segment comprises counting only a first occurrence of any of the one or more words of the query that occur in the segment.

16. A non-transitory storage medium having instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
- receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document;
  
  identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word;
  
  for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment;
  
  for each of the documents, ranking the segments of the document based on the respective counts of the segments; and
  
  identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Gomes, Benedict A., Smith, Benjamin T.
Primary Examiner(s)
LY, CHEYNE D

Application Number

US13/540,450
Time in Patent Office

330 Days
Field of Search

707/728, 707/692, 707/721, 707/737, 707/748
US Class Current

707/728
CPC Class Codes

G06F 16/338   Presentation of query results

G06F 16/38   Retrieval characterised by ...

G06F 40/226   Validation

G06F 40/232   Orthographic correction, e....

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Detecting query-specific duplicate documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

22 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting query-specific duplicate documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links