Method and system for detecting duplicate documents in web crawls

US 6,547,829 B1
Filed: 06/30/1999
Issued: 04/15/2003
Est. Priority Date: 06/30/1999
Status: Expired due to Term

First Claim

Patent Images

1. A computer-based method for use in crawling a computer-readable document store, and particularly for detecting duplicate documents during a crawl so as to avoid unnecessarily retrieving and processing such duplicates, comprising the following acts:

(a) obtaining from the document store a content identifier (CID) corresponding to a particular document, wherein the CID is characterized in that;

(1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;

(b) determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and

(c) if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A Web crawler application takes advantage of a document store'"'"'s ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a document. If the CID attribute is not supported by the document store, the crawler fetches the document, filters it to obtain a hash function, and commits the document to an index if the hash function is not present in a history table. If the CID is available from the document store, the CID is fetched from the document store. The crawler then determines whether the CID is present in the history table, which indicates whether an identical copy of the document in question has already been indexed under a different URL. If the CID is present, indicating that the document has already been indexed, the new URL is placed in the history file but the document itself is not retrieved from the document store, nor is it filtered again to obtain a CID. If the CID is not present in the history table, the full document is retrieved and indexed. The CID data structure is an extension of a known globally unique ID (GUID). Whereas the GUID is a 16-byte number, the CID comprises a 16-byte GUID plus an additional 6-byte number.

Citations

22 Claims

1. A computer-based method for use in crawling a computer-readable document store, and particularly for detecting duplicate documents during a crawl so as to avoid unnecessarily retrieving and processing such duplicates, comprising the following acts:
- (a) obtaining from the document store a content identifier (CID) corresponding to a particular document, wherein the CID is characterized in that;
  
  (1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;
  
  (b) determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and
  
  (c) if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. A method as recited in claim 1, wherein the CID is a number that has a prescribed format and is globally unique.
  - 3. A method as recited in claim 2, wherein the CIDs of any two different documents will have different values.
  - 4. A method as recited in claim 3, wherein the CID is generated as a value which is a function of the physical storage location of the document.
  - 5. A method as recited in claim 4, wherein the CID of a document that is copied from a first storage location to a second storage location remains unchanged if the document in unmodified.
  - 6. A method as recited in claim 1, wherein the CID is obtained from the document store by querying the document store with the address specifier of the particular document.
  - 7. A method as recited in claim 1, further comprising indexing the particular document after it has been fetched from the document store.
  - 8. A method as recited in claim 1, further comprising, if the value of the CID is the same as the value of a previously obtained CID, storing the address specifier of the particular document in a history table, without fetching the particular document from the document store.
  - 9. A method as recited in claim 1, wherein the method is executed by a server computer coupled by a network to the document store.
  - 10. A method as recited in claim 1, wherein the method is employed in connection with a Web crawler application.
  - 11. A method as recited in claim 1, wherein the method is employed in connection with a mail server application.
  - 12. A method as recited in claim 1, wherein the method is employed in connection with a directory service.
  - 13. A method as recited in claim 1, wherein the method is employed in connection with a system requiring indexing or one-way replication of data, to optimize replication by not copying duplicate data.

14. A Web crawling method, comprising:
- providing a history table containing URLs of documents that have been indexed during a previous crawl, and content identifiers (CIDs) for such documents;
  
  for a first URL encountered during an incremental crawl, fetching from a document store a CID for the document corresponding to the first URL;
  
  determining whether a CID having the same value as the one just obtained from the document store exists in the history table;
  
  if a CID having the same value is not present in the history table, performing the following acts;
  
  (1) fetching the document corresponding to the first URL from the document store;
  
  (2) committing the first URL and CID to the history table; and
  
  (3) committing the document corresponding to the first URL to an index; and
  
  if a CID having the same value is present in the history table, committing the first URL to the history table.
- View Dependent Claims (15, 16, 17, 18)
- - 15. A method as recited in claim 14, wherein the CID comprises a data structure that is an extension of a globally unique identifier (GUID).
  - 16. A method as recited in claim 15, wherein the CID data structure includes (1) a 60-bit system time;
    - (2) a 4-bit version number;
      
      (3) a 16-bit clock sequence 48; and
      
      (4) a 48-bit network address; and
      
      (5) a local counter value.
  - 17. A method as recited in claim 16, wherein the local counter value is a six-byte number.
  - 18. A computer-readable storage medium containing computer executable code for instructing a computer to carry out the steps recited in claim 14.

19. A computer system comprising:
- a server computer;
  
  a document store operatively coupled to the server computer, wherein the document store contains a plurality of electronic documents, and wherein the document store provides content identifiers (CIDs) for documents in the document store, wherein the CID is characterized in that;
  
  (1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;
  
  a computer readable storage medium operatively coupled to the server computer; and
  
  a computer-executable crawler application stored on the computer readable storage medium, wherein the crawler application is provided with the CIDs of selected documents on request.
- View Dependent Claims (20, 21, 22)
- - 20. A system as recited in claim 19, wherein the crawler application, when executed by the server, causes the following acts to be carried out by the server:
21. A system as recited in claim 20, wherein the server computer comprises a member of a group consisting of a Web server, a mail server, a file server and a database server.
22. A system as recited in claim 19, wherein each CID has a value which is a function of the physical storage location of the document to which it relates.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Meyerzon, Dmitriy, Terek, F. Soner, Shoroff, Srikanth, Norin, Scott
Primary Examiner(s)
Feild, Joseph H.
Assistant Examiner(s)
DESAI, RACHNA SINGH

Application Number

US09/343,511
Time in Patent Office

1,385 Days
Field of Search

707/500.1, 707/3, 707/2, 707/10, 707/104.1
US Class Current

715/234
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99945   Object-oriented database st...

Method and system for detecting duplicate documents in web crawls

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for detecting duplicate documents in web crawls

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links