Synchronizing crawler with notification source
First Claim
1. A computer-based method of retrieving and maintaining information above electronic documents stored on a computer network, each electronic document having an associated document address specification, the method comprising:
- (a) retrieving information about selected electronic documents, including their document address specification, from the computer network, for each selected document the retrieving including;
(i) adding the associated document address specification to a transaction log;
(ii) returning a copy of the electronic document and marking the document address specification in the transaction log;
(iii) parsing the returned electronic copy to identify links to other electronic documents having document address specifications;
(iv) adding the document address specifications of the other electronic documents to the transaction log unless previously in the transaction log; and
(v) repeating (ii)-(iv) until there are no unmarked document address specifications in the transaction log;
(b) storing at least some of the data associated with each returned electronic document copy in a data store, the document address specification being associated with the data for retrieval of the data from the data store;
(c) providing a notification source for monitoring the document address specification corresponding to each returned electronic document copy for a change made to the electronic document associated with the document address specification;
(d) sending a notification message from the notification source when the monitoring by a notification source detects a change in the electronic document data associated with the document address specification;
(e) processing the notification message so as to cause an electronic document copy containing the detected changes to be returned; and
(f) updating at least some of the data associated with the returned electronic copy in the data store based on the returned electronic document copy containing the detected changes.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for the processing and maintenance of electronic information retrieved from electronic documents stored on a computer network. The gatherer program of the present invention employs a crawler to crawl a portion of the computer network to retrieve electronic documents found during the crawl and that meet a set of crawl restriction rules. Some or all of the data contained in the copies of electronic documents is then stored in a data store such as an index. The invention keeps the data in the data store current by accepting notifications of when a previously retrieved document has changed. The notifications are sent by a notification source that monitors a space containing the previously retrieved documents for changes occurring after the document was last retrieved by the gatherer program. Because the document is being monitored for changes by the notification source, the gatherer program only needs to retrieve the document again when the gatherer program has been notified that the document has changed. If the notification source experiences a discontinuity, such as a system shutdown, the notification source requests that the gatherer perform an initialization crawl to retrieve any documents that changed while the notification source was not operational.
-
Citations
25 Claims
-
1. A computer-based method of retrieving and maintaining information above electronic documents stored on a computer network, each electronic document having an associated document address specification, the method comprising:
-
(a) retrieving information about selected electronic documents, including their document address specification, from the computer network, for each selected document the retrieving including;
(i) adding the associated document address specification to a transaction log;
(ii) returning a copy of the electronic document and marking the document address specification in the transaction log;
(iii) parsing the returned electronic copy to identify links to other electronic documents having document address specifications;
(iv) adding the document address specifications of the other electronic documents to the transaction log unless previously in the transaction log; and
(v) repeating (ii)-(iv) until there are no unmarked document address specifications in the transaction log;
(b) storing at least some of the data associated with each returned electronic document copy in a data store, the document address specification being associated with the data for retrieval of the data from the data store;
(c) providing a notification source for monitoring the document address specification corresponding to each returned electronic document copy for a change made to the electronic document associated with the document address specification;
(d) sending a notification message from the notification source when the monitoring by a notification source detects a change in the electronic document data associated with the document address specification;
(e) processing the notification message so as to cause an electronic document copy containing the detected changes to be returned; and
(f) updating at least some of the data associated with the returned electronic copy in the data store based on the returned electronic document copy containing the detected changes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
synchronizing the notification source with the data store, the notification source and the data store being synchronized when the at least some of the data associated with each returned electronic document copy stored in the data store corresponds to the returned electronic document.
-
-
3. The method of claim 2, wherein synchronizing the notification source with the data store includes performing an initialization crawl, the initialization crawl comprising:
-
(a) seeding the transaction log with at least one document address specification;
(b) iteratively retrieving document address specifications from the transaction log; and
(c) processing each documents address specification retrieved from the transaction log, the processing comprising;
(i) retrieving an electronic copy of the electronic document data associated with the document address specification; and
(ii) updating the data store with the electronic copy of the electronic document data.
-
-
4. The method of claim 3, wherein the initialization crawl is invoked when the method is first instanced.
-
5. The method of claim 3, wherein the initialization crawl is invoked by the notification source when the notification source sends an initialization message.
-
6. The method of claim 5, wherein the initialization message is sent by the notification source when the notification source is first instanced.
-
7. The method of claim 5, wherein the initialization message is sent by the notification source after the notification source experiences a discontinuity.
-
8. The method of claim 3, wherein the initialization crawl is a first crawl, the first crawl further comprising creating a new instance of the data store for us during the first crawl.
-
9. The method of claim 3, wherein the transaction log is seeded with document address specifications monitored by the notification source.
-
10. The method of claim 3, wherein the transaction log is seeded with a set of document address specifications stored in a history map, the history map comprising a list of document address specifications associated with data associated with returned electronic document copies stored in the data store;
- and wherein the initialization crawl also comprises;
determining if a timestamp associated with each electronic document copy associated with a document address specification matches the data associated with the electronic document copy;
if the electronic document copy does not have a timestamp that matches the timestamp associated with the electronic document data associated with the electronic document copy returning the electronic document data; and
if the electronic document copy has a timestamp that matches the timestamp associated with the electronic document data associated with the electronic copy, not returning the electronic document data.
- and wherein the initialization crawl also comprises;
-
11. The method of claim 12, wherein the initialization crawl is performed in response to an initialize message sent by the notification source.
-
12. The method of claim 3, wherein the notification messages are continuously processed from a time immediately following the initialization crawl until the initialization crawl ends.
-
13. The method of claim 1, wherein the notification source monitors a plurality of electronic documents.
-
14. A computer-based system for retrieving and maintaining information associated with a plurality of electronic documents stored on a computer network, the system comprising:
-
(a) a gatherer for performing an initialization crawl, said initialization crawl comprising;
(i) adding a document address specification to a transaction log;
(ii) retrieving a copy of a source electronic document from a location on the computer network, the location of the source electronic document defined by a document address specification currently in the transaction log and marking the document address specification in the transaction log;
(iii) parsing the retrieved source electronic document to identify links to other source electronic documents having document address specifications;
(iv) adding the document address specifications of the other source electronic documents to the transaction log unless previously in the transaction log;
(v) storing at least some of an original information content from the copy of the source electronic document in a data store;
(vi) associating the original information content that is stored in the data store with the document address specification of the source electronic document; and
(vii) repeating (ii)-(vi) until there are no unmarked document address specifications in the transaction log;
(b) a notification source for monitoring the source electronic document stored at a document address specification, said monitoring the source electronic document comprising;
(i) detecting when a change has been made to the original information content of the source electronic document; and
(ii) sending a notification message to the gatherer when the notification source detects that the source electronic document has been changed, the notification message including the document address specification of the source electronic document that has been changed; and
(c) said notification retrieval being performed by said gatherer when said gatherer receives a notification message, said notification retrieval comprising;
(i) retrieving a second copy of the source electronic document from a location on said computer networks;
(ii) updating at least some of the original information content in the data store based on the retrieved second copy of the source electronic document; and
(iii) maintaining the association of the updated original information contained in the data store with the document address specification of the source electronic document. - View Dependent Claims (15, 16, 17)
-
-
18. A computer readable medium having computer-executable instructions for retrieving and maintaining information about electronic documents stored on a computer network, each electronic document having an associated document address specification, comprising:
-
(a) performing an initialization crawl, the initialization crawl comprising;
(i) adding a document address specification to a transaction log;
(ii) returning a copy of the electronic document associated with the document address specification and marking the document address specification in the transaction log;
(iii) parsing the returned electronic document copy to identify links to other electronic documents having document address specifications;
(iv) adding the document address specifications of the other electronic documents to the transaction log unless previously in the transaction log; and
(v) repeating (ii)-(iv) until there are not unmarked document address specifications in the transaction log;
(b) monitoring with a notification source the electronic documents associated with said document address specifications for a change to the electronic document;
(c) sending a notification message from the notification source when an electronic document has changed; and
(d) performing a notification retrieval inn response to the notification message, the notification retrieval comprising retrieving a new copy of the electronic document from the computer network. - View Dependent Claims (19, 20, 21)
storing the associated information content of the electronic document in a data store.
-
-
21. The computer-readable medium containing computer-executable instructions for retrieving information from a computer network of claim 20, further comprising:
updating the associated information content of the electronic document in the data store with the information content derived from the new copy of the electronic document.
-
22. A computer-based system for retrieving and maintaining information about electronic documents stored on a computer network, each of said electronic documents having a documents address specification, the system comprising:
-
(a) means for performing an initialization crawl including;
(i) means for adding a document address specification to a transaction log;
(ii) means for returning a copy of the electronic document associated with the document address specification and marking the document address specification in the transaction log;
(iii) means for parsing the returned electronic document copy to identify links to other electronic documents having document address specifications;
(iv) means for adding the document address specifications of the other electronic documents to the transaction log unless previously in the transaction log; and
(v) means for repeating (ii)-(iv) until there are no unmarked document address specifications in the transaction log;
(b) means for detecting changes in the electronic documents associated with said document address specifications; and
(c) means for performing an notification retrieval in response to the detection of a change in an electronic document, the notification retrieval comprising retrieving a new copy of the electronic document from the computer network. - View Dependent Claims (23, 24, 25)
means for storing information associated with retrieved electronic documents in a data store; and
means for updating the associated information stored in the data store based on information contained in the retrieved new copy of the electronic document.
-
-
24. The computer-based system of claim 23, wherein the initialization crawl is performed in response to a request made by the means for detecting a change in the electronic documents when the means for detecting a change in the electronic documents first begins to operate.
-
25. The computer-based system of claim 23, wherein the initialization crawl is performed in response to a request made by the means for detecting a change in the electronic documents when the means for detecting a change in the electronic documents experiences a discontinuity.
Specification