System and method for enhanced browser-based web crawling
First Claim
1. A method for indexing dynamic data documents, the method comprising:
- retrieving, to a server, with a web crawler from a network address, a dynamic data document with client-side scripting code therein;
executing, at the server, a web-browser, as part of the web crawler, wherein the web-browser renders an in-memory copy of the dynamic data document which has been retrieved, wherein the in-memory copy of the dynamic data document maintains a rendered web-browser display format and a rendered web-browser display layout of the dynamic data document when the web-browser renders the in-memory copy of the dynamic data document;
executing, at the server instead of a client system, a browser scripting engine as part of the web-browser, wherein the browser scripting engine executes the client-side scripting code and loads content as directed by the client-side scripting code into the in-memory copy creating a final web-browser display representation of the dynamic data document so that the final web-browser display representation is substantially similar to when the dynamic data document is rendered at a user'"'"'s web-browser and viewed by a user in the user'"'"'s web-browser running on the client system when all the dynamic data is viewed; and
indexing, at the server, the content in the memory, wherein the content being indexed is the content which has been loaded by the browser scripting engine in order to index the dynamic data document as if being viewed by the user in the user'"'"'s web-browser on the client system.
1 Assignment
0 Petitions
Accused Products
Abstract
This invention pioneers an enhanced crawling mechanism and technique called “Enhanced Browser Based Web Crawling”. It permits the fault-tolerant gathering of dynamic data documents on the World Wide Web (WWW). The Enhanced Browser Based Web Crawler technology of this invention is implemented by incorporating the intricate functionality of a web browser into the crawler engine so that documents are properly analyzed. Essentially, the Enhanced Browser Based Crawler acts similarly to a web browser after retrieving the initially requested document. It then loads additional or included documents as needed or required (e.g. inline-frames, frames, images, applets, audio, video, or equivalents.). The Crawler then executes client side script or code and produces the final HTML markup. This final HTML markup is ordinarily used for the rendering for user presentation process. However, unlike a web browser this invention does not render the composed document for viewing purposes. Rather it analyzes or summarizes it, thereby extracting valuable metadata and other important information contained within the document. Also, this invention introduces the integration of optical character recognition (OCR) techniques into the crawler architecture. The reason for this is to enable the web crawler summarization process to properly summarize image content (e.g. GIF, JPEG or an equivalent) without errors.
-
Citations
13 Claims
-
1. A method for indexing dynamic data documents, the method comprising:
-
retrieving, to a server, with a web crawler from a network address, a dynamic data document with client-side scripting code therein; executing, at the server, a web-browser, as part of the web crawler, wherein the web-browser renders an in-memory copy of the dynamic data document which has been retrieved, wherein the in-memory copy of the dynamic data document maintains a rendered web-browser display format and a rendered web-browser display layout of the dynamic data document when the web-browser renders the in-memory copy of the dynamic data document; executing, at the server instead of a client system, a browser scripting engine as part of the web-browser, wherein the browser scripting engine executes the client-side scripting code and loads content as directed by the client-side scripting code into the in-memory copy creating a final web-browser display representation of the dynamic data document so that the final web-browser display representation is substantially similar to when the dynamic data document is rendered at a user'"'"'s web-browser and viewed by a user in the user'"'"'s web-browser running on the client system when all the dynamic data is viewed; and indexing, at the server, the content in the memory, wherein the content being indexed is the content which has been loaded by the browser scripting engine in order to index the dynamic data document as if being viewed by the user in the user'"'"'s web-browser on the client system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
Specification