SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER
First Claim
Patent Images
1. A method comprising:
- following at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network;
parsing each of said second documents into content and location information;
hashing said content to produce a content file for each of said second documents;
hashing said location information to produce a location file for each of said second documents;
combining said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files;
comparing said combination files to identify duplicate second documents;
eliminating said duplicate second documents;
storing ones of said second documents that are not duplicate second documents;
indexing said ones of said second documents that are stored; and
performing data mining upon said ones of said second documents that are stored.
1 Assignment
0 Petitions
Accused Products
Abstract
As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.
109 Citations
20 Claims
-
1. A method comprising:
-
following at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network; parsing each of said second documents into content and location information; hashing said content to produce a content file for each of said second documents; hashing said location information to produce a location file for each of said second documents; combining said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files; comparing said combination files to identify duplicate second documents; eliminating said duplicate second documents; storing ones of said second documents that are not duplicate second documents; indexing said ones of said second documents that are stored; and performing data mining upon said ones of said second documents that are stored. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
following at least one link contained in a first web page to locate a plurality of second web pages, wherein said first web page and said second web pages are accessible through the Internet; parsing each of said second web pages into content and location information; hashing said content to produce a content file for each of said second web pages; hashing said location information to produce a location file for each of said second web pages; combining said content file and said location file into a combination file for each of said second web pages to produce a plurality of combination files; comparing said combination files to identify duplicate second web pages; eliminating said duplicate second web pages, comprising eliminating duplicate custom error web pages, wherein said duplicate custom error web pages comprise a similar content, a similar content provider, and a different uniform resource locator (URL); storing ones of said second web pages that are not duplicate second web pages; indexing said ones of said second web pages that are stored; and performing data mining upon said ones of said second web pages that are stored. - View Dependent Claims (8, 9, 10)
-
-
11. A system comprising:
-
a browser adapted to follow at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network; a parser operatively connected to said browser, wherein said parser is adapted to parse each of said second documents into content and location information; a hasher operatively connected to said parser, wherein said hasher is adapted to hash said content to produce a content file for each of said second documents, and wherein said hasher is adapted to hash said location information to produce a location file for each of said second documents; a processor operatively connected to said hasher, wherein said processor is adapted to combine said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files; a comparator operatively connected to said processor, wherein said comparator is adapted to compare said combination files to identify duplicate second documents; a filter operatively connected to said comparator, wherein said filter is adapted to eliminate said duplicate second documents; a memory operatively connected to said filter, wherein said memory is adapted to store ones of said second documents that are not duplicate second documents; an indexer operatively connected to said memory, wherein said indexer is adapted to index said ones of said second documents that are stored; and a data miner operatively connected to said indexer, wherein said data miner is adapted to perform data mining upon said ones of said second documents that are stored. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
a browser adapted to follow at least one link contained in a first web page to locate a plurality of second web pages, wherein said first web page and said second web pages are accessible through the Internet; a parser operatively connected to said browser, wherein said parser is adapted to parse each of said second web pages into content and location information; a hasher operatively connected to said parser, wherein said hasher is adapted to hash said content to produce a content file for each of said second web pages, and wherein said hasher is adapted to hash said location information to produce a location file for each of said second web pages; a processor operatively connected to said hasher, wherein said processor is adapted to combine said content file and said location file into a combination file for each of said second web pages to produce a plurality of combination files; a comparator operatively connected to said processor, wherein said comparator is adapted to compare said combination files to identify duplicate second web pages; a filter operatively connected to said comparator, wherein said filter is adapted to eliminate said duplicate second web pages, and wherein said filter is further adapted to eliminate duplicate custom error web pages, wherein said duplicate custom error web pages comprise a similar content, a similar content provider, and a different uniform resource locator (URL); a memory operatively connected to said filter, wherein said memory is adapted to store ones of said second web pages that are not duplicate second web pages; an indexer operatively connected to said memory, wherein said indexer is adapted to index said ones of said second web pages that are stored; and a data miner operatively connected to said indexer, wherein said data miner is adapted to perform data mining upon said ones of said second web pages that are stored. - View Dependent Claims (18, 19, 20)
-
Specification