Method and system for incremental web crawling
First Claim
1. A computer-based method for performing an incremental crawl of a computer-readable document store in a manner that facilitates an efficient determination of whether and how the document store has been incremented from a prior state, comprising the following acts:
- (a) determining from the document store whether a deleted documents count (DDC) for a first folder has changed from a value of the DDC as determined during a previous crawl of the document store;
(b) if the DDC has changed, identifying the documents that have been deleted from the first folder subsequent to the previous crawl; and
(c) if the DDC has not changed, determining whether a maximum local commit time (MLCT) associated with the first folder is later than a value of the MLCT as determined during the previous crawl, and, if it is later, identifying the documents that have been added to the folder or modified subsequent to the previous crawl.
2 Assignments
0 Petitions
Accused Products
Abstract
A Web crawler creates an index of documents in a document store on a computer network. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined “seed” URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a local commit time (LCT) for each document and a deleted documents count (DDC) and LCT or maximum LCT (MLCT) for each folder (this assumes that the store supports a folder hierarchy and the MLCT, LCT and DDC properties). Thereafter, in an incremental crawl, the crawler determines, for each folder, (1) whether the DDC for that folder has changed and (2) whether the MLCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. The deleted documents are then deleted from the History Table and index. If the MLCT is more recent, the crawler queries the document store for the URLs of linked documents having a LCT more recent than the MLCT in the History Table for the folder. The History Table and index are then updated accordingly to reflect the changes to the document store.
205 Citations
23 Claims
-
1. A computer-based method for performing an incremental crawl of a computer-readable document store in a manner that facilitates an efficient determination of whether and how the document store has been incremented from a prior state, comprising the following acts:
-
(a) determining from the document store whether a deleted documents count (DDC) for a first folder has changed from a value of the DDC as determined during a previous crawl of the document store;
(b) if the DDC has changed, identifying the documents that have been deleted from the first folder subsequent to the previous crawl; and
(c) if the DDC has not changed, determining whether a maximum local commit time (MLCT) associated with the first folder is later than a value of the MLCT as determined during the previous crawl, and, if it is later, identifying the documents that have been added to the folder or modified subsequent to the previous crawl. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-executable crawler application stored on a computer readable storage medium that is accessible to a server computer coupled by a network to a document store, wherein the document store contains a plurality of electronic documents and folders containing references to one or more documents, and wherein the document store provides local commit times (LCTs) and maximum LCTs (MLCTs) for documents and folders in the document store and deleted documents counts (DDCs) for folders in the document store, comprising:
-
(a) executable code for determining whether the DDC for a first folder has changed from a value of the DDC as determined during a previous crawl of the document store, and, if the DDC has changed, identifying the documents that have been deleted from the first folder since the previous crawl; and
(b) executable code for determining whether the MLCT associated with the first folder is later than a value of the MLCT as determined during the previous crawl, and, if it is later, identifying the documents that have been added to the folder or modified subsequent to the previous crawl. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer system comprising:
-
a server computer; and
a document store operatively coupled to the server computer, wherein the document store contains a plurality of electronic documents and folders containing references to one or more documents, and wherein the document store provides properties including local commit times (LCTs) and maximum LCTs (MLCTs) for documents and folders in the document store and deleted documents counts (DDCs) for folders in the document store;
wherein the LCT, MLCT and DDC properties are provided for each folder, and a LCT is provided for each document;
wherein the LCT for a folder changes whenever a folder specific property is modified;
wherein the MLCT for the folder changes whenever any contained document'"'"'s LCT changes; and
wherein the LCT of a document changes when a document is modified.- View Dependent Claims (20)
a computer readable storage medium operatively coupled to the server computer; and
a computer-executable crawler application stored on the computer readable storage medium.
-
-
21. A computer system comprising:
-
a server computer;
a document store operatively coupled to the server computer, wherein the document store contains a plurality of electronic documents and folders containing references to one or more documents, and wherein the document store provides local commit times (LCTs) and maximum LCTs (MLCTs) for documents and folders in the document store and deleted documents counts (DDCs) for folders in the document store;
a computer readable storage medium operatively coupled to the server computer; and
a computer-executable crawler application stored on the computer readable storage medium;
wherein the crawler application, when executed by the server, causes the following acts to be carried out by the server;
determining whether the DDC for a first folder has changed from a value of the DDC as determined during a previous crawl of the document store, and, if the DDC has changed, identifying the documents that have been deleted from the first folder since the previous crawl; and
determining whether the MLCT associated with the first folder is later than a value of the MLCT as determined during the previous crawl, and, if it is later, identifying the documents that have been added to the folder or modified subsequent to the previous crawl. - View Dependent Claims (22)
-
-
23. A computer-readable document store, comprising a plurality of electronic documents and folders containing references to one or more documents, wherein the document store includes properties including local commit times (LCTs) and maximum LCTs (MLCTs) for documents and folders in the document store and deleted documents counts (DDCs) for folders in the document store;
- wherein the LCT, MLCT and DDC properties are provided for each folder, and a LCT is provided for each document;
wherein the LCT for a folder changes whenever a folder specific property is modified;
wherein the MLCT for the folder changes whenever any contained document'"'"'s LCT changes; and
wherein the LCT of a document changes when a document is modified.
- wherein the LCT, MLCT and DDC properties are provided for each folder, and a LCT is provided for each document;
Specification