Method and apparatus for indexing document content and content comparison with World Wide Web search service
First Claim
1. A method for comparing the contents of a query document to the content on the World Wide Web, the method comprising:
- (a) indexing the contents of a query document;
(b) retrieving content from the World Wide Web;
(c) indexing said content from the World Wide Web;
(d) comparing said World Wide Web index to said query document index; and
(e) continuously repeating steps (b) through (d) for different content from the World Wide Web.
0 Assignments
0 Petitions
Accused Products
Abstract
Methods and related systems for indexing the contents of documents for comparison with the contents of other documents to identify matching content. A method for comparing the contents of a query document to the content on the World Wide Web is set forth. The contents of a query document are indexed and compared to content from the World Wide Web which is continuously retrieved and indexed. The method for indexing may comprise selecting substrings from the document, hashing the substrings to generate a plurality of hash values having a known range of values, selecting certain hash values to save from the generated hash values, and sorting the saved hash values. Methods for selecting certain hash values to save are set forth.
-
Citations
22 Claims
-
1. A method for comparing the contents of a query document to the content on the World Wide Web, the method comprising:
-
(a) indexing the contents of a query document;
(b) retrieving content from the World Wide Web;
(c) indexing said content from the World Wide Web;
(d) comparing said World Wide Web index to said query document index; and
(e) continuously repeating steps (b) through (d) for different content from the World Wide Web. - View Dependent Claims (2, 3, 4, 5, 9, 10, 11, 12)
-
-
6. The method of 5, wherein said memory structure is a signature file.
-
7. The method of 6, wherein said step of creating a signature file which summarizes the selected hash values saved from a query document comprises:
-
creating a bit array in memory;
initializing all bit positions in said bit array to a prescribed logical value;
identifying bit positions in said bit array by applying a series of hash functions to each hash value in the selected hash values from the query document; and
setting said identified bit positions in said bit array to the opposite value of said previously prescribed logical value. - View Dependent Claims (8)
-
-
13. A system for detecting partially or wholly duplicated documents on the World Wide Web comprising:
-
a plurality of servers, each server of said plurality of servers containing the indexed contents of a plurality of URLs; and
a user interface for querying said indexed contents on said plurality of servers. - View Dependent Claims (14)
-
-
15. A method for comparing the contents of a query document to the content on the World Wide Web, the method comprising:
-
(a) indexing the contents of a plurality of URLs from the World Wide Web;
(b) storing said index of contents of a plurality of URLs from the World Wide Web on a plurality of servers;
(c) indexing the contents of a query document (d) comparing said query document index to said index of contents of the World Wide Web. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
Specification