×

SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER

  • US 20080235163A1
  • Filed: 03/22/2007
  • Published: 09/25/2008
  • Est. Priority Date: 03/22/2007
  • Status: Abandoned Application
First Claim
Patent Images

1. A method comprising:

  • following at least one link contained in a first document to locate a plurality of second documents, wherein said first document and said second documents are accessible through a computerized network;

    parsing each of said second documents into content and location information;

    hashing said content to produce a content file for each of said second documents;

    hashing said location information to produce a location file for each of said second documents;

    combining said content file and said location file into a combination file for each of said second documents to produce a plurality of combination files;

    comparing said combination files to identify duplicate second documents;

    eliminating said duplicate second documents;

    storing ones of said second documents that are not duplicate second documents;

    indexing said ones of said second documents that are stored; and

    performing data mining upon said ones of said second documents that are stored.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×