SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER

US 20080235163A1
Filed: 03/22/2007
Published: 09/25/2008
Est. Priority Date: 03/22/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method comprising:

following at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network;

parsing each of said second documents into content and location information;

hashing said content to produce a content file for each of said second documents;

hashing said location information to produce a location file for each of said second documents;

combining said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files;

comparing said combination files to identify duplicate second documents;

eliminating said duplicate second documents;

storing ones of said second documents that are not duplicate second documents;

indexing said ones of said second documents that are stored; and

performing data mining upon said ones of said second documents that are stored.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.

109 Citations

View as Search Results

20 Claims

1. A method comprising:
- following at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network;
  
  parsing each of said second documents into content and location information;
  
  hashing said content to produce a content file for each of said second documents;
  
  hashing said location information to produce a location file for each of said second documents;
  
  combining said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files;
  
  comparing said combination files to identify duplicate second documents;
  
  eliminating said duplicate second documents;
  
  storing ones of said second documents that are not duplicate second documents;
  
  indexing said ones of said second documents that are stored; and
  
  performing data mining upon said ones of said second documents that are stored.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein said eliminating of said duplicate second documents eliminates duplicate custom error documents, wherein said duplicate custom error documents comprise a similar content, a similar content provider, and a different uniform resource locator (URL).
  - 3. The method according to claim 1, wherein said combining of said content file and said location file comprises eliminating creation of partially constructed mirror sites.
  - 4. The method according to claim 1, further comprising removing hypertext markup language (HTML) tags of said document.
  - 5. The method according to claim 1, wherein said storing and said indexing are performed during a crawling process.
  - 6. The method according to claim 1, wherein said comparing of said combination files to identify said duplicate documents comprises:
    - storing a first combination file in a lookup structure; and
      
      determining if a subsequent combination file is in said lookup structure.

7. A method comprising:
- following at least one link contained in a first web page to locate a plurality of second web pages, wherein said first web page and said second web pages are accessible through the Internet;
  
  parsing each of said second web pages into content and location information;
  
  hashing said content to produce a content file for each of said second web pages;
  
  hashing said location information to produce a location file for each of said second web pages;
  
  combining said content file and said location file into a combination file for each of said second web pages to produce a plurality of combination files;
  
  comparing said combination files to identify duplicate second web pages;
  
  eliminating said duplicate second web pages, comprising eliminating duplicate custom error web pages, wherein said duplicate custom error web pages comprise a similar content, a similar content provider, and a different uniform resource locator (URL);
  
  storing ones of said second web pages that are not duplicate second web pages;
  
  indexing said ones of said second web pages that are stored; and
  
  performing data mining upon said ones of said second web pages that are stored.
- View Dependent Claims (8, 9, 10)
- - 8. The method according to claim 7, wherein said combining of said content file and said location file comprises eliminating creation of partially constructed mirror sites.
  - 9. The method according to claim 7, further comprising removing hypertext markup language (HTML) tags of said web page.
  - 10. The method according to claim 7, wherein said storing and said indexing are performed during a crawling process.

11. A system comprising:
- a browser adapted to follow at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network;
  
  a parser operatively connected to said browser, wherein said parser is adapted to parse each of said second documents into content and location information;
  
  a hasher operatively connected to said parser, wherein said hasher is adapted to hash said content to produce a content file for each of said second documents, and wherein said hasher is adapted to hash said location information to produce a location file for each of said second documents;
  
  a processor operatively connected to said hasher, wherein said processor is adapted to combine said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files;
  
  a comparator operatively connected to said processor, wherein said comparator is adapted to compare said combination files to identify duplicate second documents;
  
  a filter operatively connected to said comparator, wherein said filter is adapted to eliminate said duplicate second documents;
  
  a memory operatively connected to said filter, wherein said memory is adapted to store ones of said second documents that are not duplicate second documents;
  
  an indexer operatively connected to said memory, wherein said indexer is adapted to index said ones of said second documents that are stored; and
  
  a data miner operatively connected to said indexer, wherein said data miner is adapted to perform data mining upon said ones of said second documents that are stored.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system according to claim 11, wherein said filter is further adapted to eliminate duplicate custom error documents, wherein said duplicate custom error documents comprise a similar content, a similar content provider, and a different uniform resource locator (URL).
  - 13. The system according to claim 11, wherein said filter is further adapted to eliminate creation of partially constructed mirror sites.
  - 14. The system according to claim 11, wherein said hasher is further adapted to remove hypertext markup language (HTML) tags of said document.
  - 15. The system according to claim 11, wherein said memory and said indexer are further adapted to perform said storing and said indexing during a crawling process.
  - 16. The system according to claim 11, wherein said memory and said comparator are further adapted to:
    - store a first combination file in a lookup structure; and
      
      determine if a subsequent combination file is in said lookup structure.

17. A system comprising:
- a browser adapted to follow at least one link contained in a first web page to locate a plurality of second web pages, wherein said first web page and said second web pages are accessible through the Internet;
  
  a parser operatively connected to said browser, wherein said parser is adapted to parse each of said second web pages into content and location information;
  
  a hasher operatively connected to said parser, wherein said hasher is adapted to hash said content to produce a content file for each of said second web pages, and wherein said hasher is adapted to hash said location information to produce a location file for each of said second web pages;
  
  a processor operatively connected to said hasher, wherein said processor is adapted to combine said content file and said location file into a combination file for each of said second web pages to produce a plurality of combination files;
  
  a comparator operatively connected to said processor, wherein said comparator is adapted to compare said combination files to identify duplicate second web pages;
  
  a filter operatively connected to said comparator, wherein said filter is adapted to eliminate said duplicate second web pages, and wherein said filter is further adapted to eliminate duplicate custom error web pages, wherein said duplicate custom error web pages comprise a similar content, a similar content provider, and a different uniform resource locator (URL);
  
  a memory operatively connected to said filter, wherein said memory is adapted to store ones of said second web pages that are not duplicate second web pages;
  
  an indexer operatively connected to said memory, wherein said indexer is adapted to index said ones of said second web pages that are stored; and
  
  a data miner operatively connected to said indexer, wherein said data miner is adapted to perform data mining upon said ones of said second web pages that are stored.
- View Dependent Claims (18, 19, 20)
- - 18. The system according to claim 17, wherein said filter is further adapted to eliminate creation of partially constructed mirror sites.
  - 19. The system according to claim 17, wherein said hasher is further adapted to remove hypertext markup language (HTML) tags of said web page.
  - 20. The system according to claim 17, wherein said memory and said indexer are further adapted to perform said storing and said indexing during a crawling process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Balasubramanian, Srinivasan, Desai, Rajesh M., Jalan, Piyoosh

Application Number

US11/689,551
Publication Number

US 20080235163A1
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

109 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links