Checksum-comparing change-detection tool indicating degree and location of change of internet documents
First Claim
1. A change-detection web server comprising:
- a network connection for transmitting and receiving packets from a remote client and a remote document server;
a responder, coupled to the network connection, for communicating with the remote client, the responder registering a document for change detection by receiving from the remote client a uniform-resource-locator (URL) identifying the document, the responder fetching the document from the remote document server and generating an original checksum for a checked portion of the document, the checked portion being less than the entire document;
archival storage means, coupled to the responder, for receiving the URL and the original checksum from the responder when the document is registered by the remote client, the archival storage means for storing a plurality of records each containing a URL and a checksum for a registered document;
a periodic fetcher, coupled to the archival storage means and the network connection, for periodically re-fetching the document from the remote document server by transmitting the URL from the archival storage means to the network connection, the periodic fetcher receiving a fresh copy of the document from the remote document server, a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum, whereby a change in the document is detected by comparing a checksum for the checked portion of the document, wherein changes in portions of the document outside the checked portion are not signaled to the remote client.
3 Assignments
0 Petitions
Accused Products
Abstract
A change-detection web server automatically checks web-page documents for recent changes. The server retrieves and compares documents one or more times a week. The user is notified by electronic mail when a change is detected. The user registers a web-page document by submitting his e-mail address and the uniform resource locator (URL) of the desired document. The document is fetched and the user can select text on the page of interest. Non-selected text is ignored; only changes in the selected text are reported back to the user. Thus changes to less relevant parts of the document are ignored. The document is divided into sections bounded by hyper-text markup-language (HTML) tags. A checksum is generated and stored for each HTML-bound section. Storage requirements are reduced since only checksums are stored rather than the original documents. During periodic comparisons a fresh copy of the document is retrieved, divided into HTML-bound sections and checksums generated for each section. The freshly-generated checksums are compared to the archived checksums. Sections with non-matching checksums are highlighted as changed, and the percentage of changed sections is reported. The user-defined selection is also stored as a checksum and compared to a freshly-generated checksum. Changed checksums outside the user-defined selection do not generate a change notification. Re-ordering of sections does not generate a change notification when the checksums otherwise match. Thus format and layout changes do not generate change notifications, and the frequency of notices to user is reduced.
-
Citations
17 Claims
-
1. A change-detection web server comprising:
-
a network connection for transmitting and receiving packets from a remote client and a remote document server;
a responder, coupled to the network connection, for communicating with the remote client, the responder registering a document for change detection by receiving from the remote client a uniform-resource-locator (URL) identifying the document, the responder fetching the document from the remote document server and generating an original checksum for a checked portion of the document, the checked portion being less than the entire document;
archival storage means, coupled to the responder, for receiving the URL and the original checksum from the responder when the document is registered by the remote client, the archival storage means for storing a plurality of records each containing a URL and a checksum for a registered document;
a periodic fetcher, coupled to the archival storage means and the network connection, for periodically re-fetching the document from the remote document server by transmitting the URL from the archival storage means to the network connection, the periodic fetcher receiving a fresh copy of the document from the remote document server, a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum, whereby a change in the document is detected by comparing a checksum for the checked portion of the document, wherein changes in portions of the document outside the checked portion are not signaled to the remote client. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
whereby storage requirements for the archival storage means are reduced by archiving checksums and not entire documents. -
3. The change-detection web server of claim 2 further comprising:
-
selection means, coupled to the responder, for receiving a selection from the remote client, the selection identifying boundaries of the checked portion of the document;
parsing means, coupled to the checksum generator, for parsing the fresh copy and generating checksums for a plurality of portions of the fresh copy;
compare means, coupled to the parsing means, for signaling a match when any of the checksums generated by the parsing means matches the original checksum from the archival storage means, whereby a change in the document is detected when the match is not signaled by the compare means, the parsing means generating a plurality of checksums for the plurality of portions of the fresh copy.
-
-
4. The change-detection web server of claim 3 wherein the archival storage means further comprises:
-
a length field for indicating a size of the checked portion, the length field written by the selection means, the parsing means generating each checksum for portions having the size of the checked portion, whereby the size of the checked portion is stored and used by the parsing means.
-
-
5. The change-detection web server of claim 1 wherein the document is a hyper-text markup-language (HTML) document containing HTML tags, the HTML tags for indicating formatting, layout, and hyper-links specifying URLs of other servers, the change-detection web server further comprising:
-
divider means, coupled to the responder, for dividing the document into portions bound by the HTML tags;
checksum means for generating original checksums, an original checksum generated for each portion bound by HTML tags;
the archival storage means storing the original checksums for the portions bound by the HTML tags;
the checksum generator further comprising;
second divider means for dividing the fresh copy of the document into portions bound by the HTML tags;
second checksum means for generating fresh checksums for portions of the fresh copy bound by HTML tags in the fresh copy of the document;
compare means, receiving the fresh checksums of the fresh copy from the second checksum means, for comparing the fresh checksums to the original checksums from the archival storage means;
report means for signaling a change in the document when an original checksum for the document has no matching fresh checksum, whereby checksums are generated and stored for portions of the document bound by the HTML tags.
-
-
6. The change-detection web server of claim 5 wherein the report means further comprises:
-
mailer means, coupled to the network connection, for sending a change notification message to the remote client when the change is signaled, wherein the responder receives an electronic-mail address from the remote client, the responder storing the electronic-mail address of the remote client in the archival storage means, and the mailer means reading the electronic-mail address from the archival storage means, the change notification message being sent to the remote client as an electronic-mail message addressed to the electronic-mail address, whereby the remote client is notified of the change by electronic mail.
-
-
7. The change-detection web server of claim 6 further comprising:
-
change statistics generator, coupled to the compare means, for counting a total number of portions in the document and for determining a number of original checksums without matching fresh checksums, the change statistics generator coupled to the mailer means to include in the electronic-mail message an indication of a degree of changes in the document, wherein the degree of changes is determined for the document and included in the electronic-mail message to the remote client when a change is detected.
-
-
8. The change-detection web server of claim 7 wherein the degree of changes in the document is the number of original checksums without matching fresh checksums divided by the total number of portions in the document,
whereby the degree of change reported to the remote client indicates a fraction of portions of the document which have changed. -
9. The change-detection web server of claim 7 further comprising:
-
highlighting means, coupled to the mailer means, for attaching the fresh copy of the document to the electronic-mail message, the fresh copy having highlighting marks inserted to indicate which portions of the document have mismatching checksums, whereby the fresh copy of the document is highlighted to indicate changes to the remote client.
-
-
10. The change-detection web server of claim 9 wherein the packets transmitted to the network connection are TCP/IP packets and wherein the remote client and the remote document server are on the Internet.
-
-
11. A computer-implemented method for detecting recent changes in a document and notifying a user of the recent changes, the method comprising the steps of:
-
registering the document by receiving an address of the user and a locator for the document;
fetching the document from a remote server by transmitting the locator to a network server;
determining when the document is a web page with hidden tags;
when the document is a web page with hidden tags;
dividing the document into sections, each section beginning and ending with a tag, the tag not directly visible to a user viewing the document on a browser;
generating a cyclical-redundancy-checksum (CRC) for each section of the document;
storing the CRC generated for each section of the document in a database together with the locator of the document and the address of the user;
after a period of time;
reading the locator from the database and transmitting the locator to remote server to fetch a recent copy of the document;
when the document is a web page with hidden tags;
dividing the recent copy of the document into sections, each section beginning and ending with a tag;
generating a recent cyclical-redundancy-checksum (CRC) for each section of the recent copy of the document;
reading the CRC'"'"'s from the database and comparing the CRC'"'"'s to the recent CRC'"'"'s to determine which CRC'"'"'s from the database do not have a matching recent CRC;
signaling that a change is detected when a CRC'"'"'s from the database does not have a matching recent CRC, whereby the document is not stored in the database which stored CRC'"'"'s for tag-bound sections of web page with hidden tags. - View Dependent Claims (12, 13, 14, 15, 16, 17)
reading the address of the user from the database and sending a message to the address of the user stating that a change has occurred, whereby the user is notified by a message when a change is detected.
-
-
13. The computer-implemented method of claim 12 wherein the step of signaling that a change is detected further comprises:
-
including an indication of a degree of change in the message to the user, the degree of change for the document being a function of a number of CRC'"'"'s from the database that do not have a matching recent CRC, whereby the message to the user indicates the degree of change to the document.
-
-
14. The computer-implemented method of claim 13 wherein the degree of change is expressed as the number of CRC'"'"'s from the database that do not have a matching recent CRC, as a percentage of a total number of CRC'"'"'s for the document,
whereby the percentage of change of the document is sent to the user in the message. -
15. The computer-implemented method of claim 12 wherein the document is a web-page document on the world-wide web and the locator is a uniform-resource locator (URL).
-
16. The computer-implemented method of claim 12 wherein the period of time is about a week.
-
17. The computer-implemented method of claim 16 wherein the tags are not included when generating the CRC'"'"'s,
whereby formatting changes embedded in the tags do not signal a change, reducing occurrences of change notifications when only minor formatting changes occur to the document.
Specification