Method of web crawling utilizing crawl numbers
First Claim
1. A computer based method of retrieving information from a computer network (Web) having a plurality of electronic documents stored thereon, wherein each electronic document has a corresponding document address specification that provides information for locating the electronic document, the method including performing a current Web crawl comprising:
- assigning a current crawl number to the current Web crawl, said current crawl number being the next number in a numerical sequence of numbers;
determining whether an electronic document has been retrieved during a previous Web crawl and associated with a crawl number modified;
if the electronic document has not been retrieved during a previous Web crawl and associated with a crawl number modified, associating the current crawl number with the electronic document as its crawl number modified;
if the electronic document has been retrieved during a previous Web crawl and associated with a crawl number modified, determining whether the actual content of the electronic document has been modified subsequent to the previous retrieval; and
if the actual content of the electronic document has been modified subsequent to the previous retrieval, associating the current crawl number with the electronic document as its crawl number modified.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer based system and method of retrieving information pertaining to electronic documents on a computer network is disclosed. The method includes maintaining a database that associates each electronic document with a corresponding crawl number that indicates the most recent crawl during which a change to the document was detected. During a subsequent crawl, electronic documents that have changed since the previous crawl are retrieved, and selected data is stored in a database. The retrieved document information is marked with a crawl number. During subsequent searches, crawl numbers are used to determine documents that have changed since a specified crawl.
281 Citations
29 Claims
-
1. A computer based method of retrieving information from a computer network (Web) having a plurality of electronic documents stored thereon, wherein each electronic document has a corresponding document address specification that provides information for locating the electronic document, the method including performing a current Web crawl comprising:
-
assigning a current crawl number to the current Web crawl, said current crawl number being the next number in a numerical sequence of numbers;
determining whether an electronic document has been retrieved during a previous Web crawl and associated with a crawl number modified;
if the electronic document has not been retrieved during a previous Web crawl and associated with a crawl number modified, associating the current crawl number with the electronic document as its crawl number modified;
if the electronic document has been retrieved during a previous Web crawl and associated with a crawl number modified, determining whether the actual content of the electronic document has been modified subsequent to the previous retrieval; and
if the actual content of the electronic document has been modified subsequent to the previous retrieval, associating the current crawl number with the electronic document as its crawl number modified. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
determining whether the electronic document has an associated time stamp matching a previously stored time stamp associated with the electronic document;
if the electronic document does not have an associated time stamp matching the previously stored time stamp, retrieving the electronic document by using a document address specification; and
if the electronic document has an associated time stamp matching the previously stored time stamp, not retrieving the electronic document.
-
-
3. The method of claim 1, wherein determining whether the actual content of the electronic document has been modified subsequent to the previous retrieval comprises:
-
determining current representation data corresponding to the electronic document; and
comparing the current representation data corresponding to the electronic document with previous representation data corresponding to the electronic document and determined prior to performing the current Web crawl.
-
-
4. The method of claim 3, wherein the current representation data is a hash value and determining the representation data comprises performing a hash function.
-
5. The method of claim 4, wherein the hash function is a secure hash function.
-
6. The method of claim 3, wherein determining representation data comprises:
-
filtering out selected data from the electronic document; and
determining representation data representative of data from the electronic document that has not been filtered out.
-
-
7. The method of claim 6, wherein filtering out selected data includes filtering out text format specification data.
-
8. The method of claim 1, wherein determining whether the actual content of the electronic document has been modified subsequent to the previous retrieval comprises:
-
(a) determining whether the electronic document has an associated time stamp matching a previously stored time stamp associated with the electronic document;
(b) if the electronic document does not have an associated time stamp matching the previously stored time stamp, performing a document comparison by;
(i) retrieving the electronic document;
(ii) determining current representation data corresponding to the electronic document; and
(iii) comparing the current representation data corresponding to the electronic document with previous representation data corresponding to the electronic document and determined prior to the current Web crawl.
-
-
9. The method of claim 2, wherein determining whether the actual content of the electronic document has been modified subsequent to the previous retrieval further comprises:
-
sending a request to a server to transfer the electronic document, wherein the transfer is based on whether the time stamp associated with the electronic document is more recent than a time stamp included in the request; and
in the event that the server does not transfer the electronic document, determining that the electronic document has not been modified.
-
-
10. The method of claim 1, further comprising:
-
receiving a request to retrieve a list of electronic documents that match a query, wherein the query includes a criteria to match electronic documents that have been modified subsequent to performing a previous Web crawl; and
in response to receiving the request to retrieve a list of electronic documents, retrieving a set of document address specifications corresponding to electronic documents having an associated crawl number modified assigned to the current Web crawl.
-
-
11. The method of claim 1, further comprising:
-
receiving a request to retrieve a list of electronic documents that have been modified subsequent to performing a previous Web crawl; and
in response to receiving the request to retrieve a list of electronic documents, retrieving a set of document address specifications corresponding to electronic documents having an associated crawl number modified assigned to a crawl more recent than said previous Web crawl.
-
-
12. The method of claim 1, further comprising:
-
receiving, prior to performing the current Web crawl, a first request to retrieve a list of electronic documents that match a specified criteria;
in response to receiving the first request, providing a list of electronic documents that match the specified criteria;
receiving, after the current Web crawl, a second request to retrieve a list of electronic documents that match the specified criteria;
in response to receiving the second request, retrieving a second list of electronic documents that were modified after the current Web crawl and that match the first specified criteria; and
providing the second list of electronic documents.
-
-
13. The method of claim 1, wherein performing the current Web crawl further comprises:
-
(a) determining at least one hyperlink contained within the electronic document, each hyperlink including a hyperlink document address specification;
(b) determining whether each hyperlink document address specification included corresponds to an electronic document retrieved prior to the current Web crawl;
(c) in the event that the hyperlink document address specification corresponds to a linked electronic document retrieved prior to the current Web crawl, processing the hyperlink document address specification, said processing comprising;
(i) determining whether the actual content of the linked electronic document has been modified subsequent to the prior retrieval of the electronic document; and
(ii) in the event that the actual content of the linked electronic document has been modified, storing data from the linked electronic document and associating the current crawl number to the linked electronic document.
-
-
14. A computer based method of retrieving information from a computer network (Web) having a plurality of electronic documents stored thereon, wherein each electronic document has a corresponding document address specification that provides information for locating the electronic document, the method comprising:
-
(a) performing a Web crawl, wherein performing the Web crawl includes;
(i) assigning a current crawl number to the Web crawl, said current crawl number establishing an order in which the Web crawl occurred;
(ii) retrieving at least a portion of information contained within each of a plurality of electronic documents that have not previously been retrieved in a prior Web crawl;
(iii) retrieving at least a portion of information contained within each of a plurality of electronic documents that have been modified subsequent to a prior Web crawl; and
(iv) storing, in an index, the information retrieved from each of the plurality of electronic documents that have not been previously retrieved in a prior Web crawl and each of the plurality of electronic documents that have been modified subsequent to a prior Web crawl and associating the information with a crawl number modified that corresponds to the current crawl number assigned to the Web crawl; and
(b) in response to receiving, subsequent to said Web crawl, a request to retrieve a list of electronic documents that have been modified subsequent to said prior Web crawl, selectively retrieving, from the index, said information corresponding to electronic documents that have a corresponding crawl number modified that exceeds the current crawl number of the said prior Web crawl. - View Dependent Claims (15, 25)
-
-
16. A computer-readable medium having computer-executable instructions for retrieving information from a computer network (Web), wherein retrieving information from the computer network includes performing a current Web crawl, wherein performing the current Web crawl comprises:
-
assigning a current crawl number to the current Web crawl, said current crawl number establishing an order in which the Web crawl occurred;
receiving a document address specification corresponding to an electronic document stored on the computer network;
determining whether the electronic document has been retrieved during a previous Web crawl;
if the electronic document has not been retrieved during a previous Web crawl, storing data from the electronic document and associating the data from the electronic document with a crawl number modified corresponding to the current crawl number assigned to the current Web crawl;
if the electronic document has been retrieved during a previous Web crawl, determining whether the actual content of the electronic document has been modified subsequent to the previous Web crawl; and
if the actual content of the electronic document has been modified subsequent to the previous Web crawl, storing data from the electronic document and associating the data from the electronic document with a crawl number modified corresponding to the current crawl number assigned to the current Web crawl. - View Dependent Claims (17, 18, 19, 20, 26)
retrieving the electronic document;
calculating a current hash value corresponding to the electronic document;
comparing the current hash value with a previously determined hash value corresponding to the electronic document;
if the current hash value matches the previously determined hash value, determining that the actual content of the electronic document is not modified; and
if the current hash value does not match the previously determined hash value, determining that the actual content of the electronic document is modified.
-
-
18. The computer-readable medium of claim 16, wherein the computer-executable instructions for determining whether the actual content of the electronic document has been modified comprises computer-executable instructions for:
-
filtering out selected data from the electronic document; and
calculating the current hash value based on data from the electronic document that has not been filtered out.
-
-
19. The computer-readable medium of claim 16, having further computer-executable instructions for:
-
receiving a request to retrieve a list of electronic documents that have been modified subsequent to performing a previous Web crawl; and
in response to receiving the request to retrieve a list of electronic documents, retrieving a set of document address specifications corresponding to electronic documents having an associated crawl number modified that is equal to or greater than the current crawl number assigned to the previous Web crawl.
-
-
20. The computer-readable medium of claim 16, having further computer-executable instructions for:
-
receiving a request to retrieve a list of electronic documents that have been modified subsequent to performing a previous Web crawl; and
in response to receiving the request to retrieve a list of electronic documents, filtering out document address specifications corresponding to electronic documents having an associated crawl number modified that matches the current crawl number assigned to said previous Web crawl.
-
-
26. The computer-readable medium as recited in claim 16, wherein the crawl number is a next number in a numerical sequence of numbers.
-
21. A system for retrieving information stored on a computer network (Web), the system comprising:
-
(a) a computer network (Web) including at least one server having a plurality of electronic documents stored thereon, including a first electronic document, each electronic document having a corresponding Web address;
(b) a database containing information corresponding to the plurality of electronic documents, including information corresponding to the first electronic document; and
(c) a crawler program for performing a current Web crawl, the crawler program comprising computer-executable instructions for;
(i) assigning a current crawl number to the current Web crawl, the current crawl number establishing an order in which the Web crawl occurred;
(ii) retrieving a Web address corresponding to the first electronic document;
(iii) determining whether the first electronic document has information corresponding to it in the database;
(iv) if the first electronic document does not have information corresponding to it in the database, storing information corresponding to the first electronic document in the database, including a crawl number modified that corresponds to the current crawl number;
(v) if the first electronic document has information corresponding to it in the database, determining whether the first electronic document is more recent than the database information corresponding to the first electronic document; and
(vi) if the first electronic document is more recent than the database information corresponding to the first electronic document, storing information corresponding to the first electronic document in the database, including a crawl number modified that corresponds to the current crawl number. - View Dependent Claims (22, 23, 24, 27)
retrieving a previously calculated hash value corresponding to the first electronic document from the database;
calculating a new hash value corresponding to the first electronic document; and
if the new hash value is different from the previously calculated hash value, determining that the first electronic document is more recent than the database information corresponding to the first electronic document.
-
-
23. The system of claim 21, wherein the crawler program further comprises computer-executable instructions for filtering the first electronic document to exclude a portion of the data contained within the first electronic document prior to calculating the new hash value corresponding to the first electronic document.
-
24. The system of claim 21, further comprising a search engine containing computer-executable instructions for:
-
determining a set of electronic documents corresponding to a specified criteria, the specified criteria including a specification of a crawl number modified; and
retrieving a list of electronic documents based on the specified criteria, including the specification of the crawl number modified.
-
-
27. The system as recited in claim 21, wherein the crawl number is a next number in a numerical sequence of numbers.
-
28. A computer based method of retrieving information from a computer network (Web) having a plurality of electronic documents stored thereon, wherein each electronic document has a corresponding document address specification that provides information for locating the electronic document, the method comprising:
-
(a) performing a Web crawl, wherein performing the Web crawl includes;
(i) assigning a current crawl number to the Web crawl, said current crawl number establishing an order in which the Web crawl occurred;
(ii) retrieving at least a portion of information contained within each of a plurality of electronic documents that have not previously been retrieved in a prior Web crawl;
(iii) retrieving at least a portion of information contained within each of a plurality of electronic documents that have been modified subsequent to a prior Web crawl; and
(iv) storing, in an index, the information retrieved from each of the plurality of electronic documents that have not been previously retrieved in a prior Web crawl and each of the plurality of electronic documents that have been modified subsequent to a prior Web crawl and associating the information with a crawl number modified that corresponds to the current crawl number assigned to the Web crawl; and
(b) obtaining a request to retrieve of list of electronic documents that have been modified subsequent to an identified Web crawl;
(c) associating a crawl number with the identified Web crawl; and
(d) selectively retrieving, from the index, information corresponding to electronic documents that have a corresponding crawl number modified that exceeds the current crawl number of the crawl number associated with the identified Web crawl. - View Dependent Claims (29)
-
Specification