Method and system for detecting duplicate documents in web crawls
First Claim
1. A computer-based method for use in crawling a computer-readable document store, and particularly for detecting duplicate documents during a crawl so as to avoid unnecessarily retrieving and processing such duplicates, comprising the following acts:
- (a) obtaining from the document store a content identifier (CID) corresponding to a particular document, wherein the CID is characterized in that;
(1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;
(b) determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and
(c) if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store.
2 Assignments
0 Petitions
Accused Products
Abstract
A Web crawler application takes advantage of a document store'"'"'s ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a document. If the CID attribute is not supported by the document store, the crawler fetches the document, filters it to obtain a hash function, and commits the document to an index if the hash function is not present in a history table. If the CID is available from the document store, the CID is fetched from the document store. The crawler then determines whether the CID is present in the history table, which indicates whether an identical copy of the document in question has already been indexed under a different URL. If the CID is present, indicating that the document has already been indexed, the new URL is placed in the history file but the document itself is not retrieved from the document store, nor is it filtered again to obtain a CID. If the CID is not present in the history table, the full document is retrieved and indexed. The CID data structure is an extension of a known globally unique ID (GUID). Whereas the GUID is a 16-byte number, the CID comprises a 16-byte GUID plus an additional 6-byte number.
-
Citations
22 Claims
-
1. A computer-based method for use in crawling a computer-readable document store, and particularly for detecting duplicate documents during a crawl so as to avoid unnecessarily retrieving and processing such duplicates, comprising the following acts:
-
(a) obtaining from the document store a content identifier (CID) corresponding to a particular document, wherein the CID is characterized in that;
(1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;
(b) determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and
(c) if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A Web crawling method, comprising:
-
providing a history table containing URLs of documents that have been indexed during a previous crawl, and content identifiers (CIDs) for such documents;
for a first URL encountered during an incremental crawl, fetching from a document store a CID for the document corresponding to the first URL;
determining whether a CID having the same value as the one just obtained from the document store exists in the history table;
if a CID having the same value is not present in the history table, performing the following acts;
(1) fetching the document corresponding to the first URL from the document store;
(2) committing the first URL and CID to the history table; and
(3) committing the document corresponding to the first URL to an index; and
if a CID having the same value is present in the history table, committing the first URL to the history table. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer system comprising:
-
a server computer;
a document store operatively coupled to the server computer, wherein the document store contains a plurality of electronic documents, and wherein the document store provides content identifiers (CIDs) for documents in the document store, wherein the CID is characterized in that;
(1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;
a computer readable storage medium operatively coupled to the server computer; and
a computer-executable crawler application stored on the computer readable storage medium, wherein the crawler application is provided with the CIDs of selected documents on request. - View Dependent Claims (20, 21, 22)
obtaining from the document store the CID corresponding to a particular document;
determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and
if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store.
-
-
21. A system as recited in claim 20, wherein the server computer comprises a member of a group consisting of a Web server, a mail server, a file server and a database server.
-
22. A system as recited in claim 19, wherein each CID has a value which is a function of the physical storage location of the document to which it relates.
Specification