Detecting query-specific duplicate documents
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document;
identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word;
for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment;
for each of the documents, ranking the segments of the document based on the respective counts of the segments; and
identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.
2 Assignments
0 Petitions
Accused Products
Abstract
An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as “snippets”) is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.
22 Citations
16 Claims
-
1. A computer-implemented method comprising:
-
receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document; identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word; for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment; for each of the documents, ranking the segments of the document based on the respective counts of the segments; and identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
one or more computers comprising one or more processors programmed to perform operations comprising; receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document; identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word; for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment; for each of the documents, ranking the segments of the document based on the respective counts of the segments; and identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document. - View Dependent Claims (11, 12, 13, 14, 15)
-
16. A non-transitory storage medium having instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
-
receiving a plurality of search results responsive to a query, wherein the query includes one or more words, wherein each search result identifies a respective document that comprises a plurality of segments, wherein each segment is a distinct sequence of consecutive characters in the respective document; identifying the plurality of segments for each document by sliding a fixed-length window over content of the document, wherein the content of the document encompassed by the window starting at a particular position in the document defines a particular segment in the document, wherein sliding the fixed-length window over the content of the document comprises skipping space characters and characters that would result in a last character of the window splitting a word; for each of the plurality of segments in each of the documents, determining a respective count of occurrences of the one or more words of the query that occur in the segment; for each of the documents, ranking the segments of the document based on the respective counts of the segments; and identifying one or more highest ranked segments for each of the documents as a query-relevant part of the document.
-
Specification