System and method for enhanced browser-based web crawling

US 7,519,902 B1
Filed: 06/30/2000
Issued: 04/14/2009
Est. Priority Date: 06/30/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for indexing dynamic data documents, the method comprising:

retrieving, to a server, with a web crawler from a network address, a dynamic data document with client-side scripting code therein;

executing, at the server, a web-browser, as part of the web crawler, wherein the web-browser renders an in-memory copy of the dynamic data document which has been retrieved, wherein the in-memory copy of the dynamic data document maintains a rendered web-browser display format and a rendered web-browser display layout of the dynamic data document when the web-browser renders the in-memory copy of the dynamic data document;

executing, at the server instead of a client system, a browser scripting engine as part of the web-browser, wherein the browser scripting engine executes the client-side scripting code and loads content as directed by the client-side scripting code into the in-memory copy creating a final web-browser display representation of the dynamic data document so that the final web-browser display representation is substantially similar to when the dynamic data document is rendered at a user'"'"'s web-browser and viewed by a user in the user'"'"'s web-browser running on the client system when all the dynamic data is viewed; and

indexing, at the server, the content in the memory, wherein the content being indexed is the content which has been loaded by the browser scripting engine in order to index the dynamic data document as if being viewed by the user in the user'"'"'s web-browser on the client system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This invention pioneers an enhanced crawling mechanism and technique called “Enhanced Browser Based Web Crawling”. It permits the fault-tolerant gathering of dynamic data documents on the World Wide Web (WWW). The Enhanced Browser Based Web Crawler technology of this invention is implemented by incorporating the intricate functionality of a web browser into the crawler engine so that documents are properly analyzed. Essentially, the Enhanced Browser Based Crawler acts similarly to a web browser after retrieving the initially requested document. It then loads additional or included documents as needed or required (e.g. inline-frames, frames, images, applets, audio, video, or equivalents.). The Crawler then executes client side script or code and produces the final HTML markup. This final HTML markup is ordinarily used for the rendering for user presentation process. However, unlike a web browser this invention does not render the composed document for viewing purposes. Rather it analyzes or summarizes it, thereby extracting valuable metadata and other important information contained within the document. Also, this invention introduces the integration of optical character recognition (OCR) techniques into the crawler architecture. The reason for this is to enable the web crawler summarization process to properly summarize image content (e.g. GIF, JPEG or an equivalent) without errors.

Citations

13 Claims

1. A method for indexing dynamic data documents, the method comprising:
- retrieving, to a server, with a web crawler from a network address, a dynamic data document with client-side scripting code therein;
  
  executing, at the server, a web-browser, as part of the web crawler, wherein the web-browser renders an in-memory copy of the dynamic data document which has been retrieved, wherein the in-memory copy of the dynamic data document maintains a rendered web-browser display format and a rendered web-browser display layout of the dynamic data document when the web-browser renders the in-memory copy of the dynamic data document;
  
  executing, at the server instead of a client system, a browser scripting engine as part of the web-browser, wherein the browser scripting engine executes the client-side scripting code and loads content as directed by the client-side scripting code into the in-memory copy creating a final web-browser display representation of the dynamic data document so that the final web-browser display representation is substantially similar to when the dynamic data document is rendered at a user'"'"'s web-browser and viewed by a user in the user'"'"'s web-browser running on the client system when all the dynamic data is viewed; and
  
  indexing, at the server, the content in the memory, wherein the content being indexed is the content which has been loaded by the browser scripting engine in order to index the dynamic data document as if being viewed by the user in the user'"'"'s web-browser on the client system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method as defined in claim 1, wherein the content that is loaded into the in-memory copy comprises one or more images with textual content embedded therein including at least one of an in-line GIF image and an in-line JPEG image.
  - 3. The method as defined in claim 1, wherein the executing a browser scripting engine as part of the web-browser for loading content as directed by the client-side scripting code into the in-memory copy further comprises executing on the server one or more Java applets with textual content embedded therein.
  - 4. The method as defined in claim 3, wherein the executing a browser scripting engine as part of the web-browser for loading content as directed by the client-side scripting code into the in-memory copy further comprises executing on the server one or more Java Script components with textual content embedded therein.
  - 5. The method as defined in claim 1, wherein the executing a browser scripting engine as part of the web-browser for loading content as directed by the client-side scripting code into the in-memory copy further comprises executing the client-side scripting code on the server which directs the loading of web documents selected from the group of documents consisting of in-line frames, frames, and equivalents.
  - 6. The method as defined in claim 1, wherein the retrieving the dynamic data document further comprises performing the following sub-steps of:
    - initializing a first list with seed values;
      
      checking if there are any URLs to be processed and in response that any URL exists to be processed then performing the following sub-steps of;
      
      determining if a URL is in a second list; and
      
      in response that a URL is not in the second list then performing the following sub-steps of;
      
      inserting the URL into the first list;
      
      scheduling the URL for crawling;
      
      crawling the URL when scheduled to do so;
      
      removing the URL from the first list after the scheduled crawling;
      
      entering the URL into the second list; and
      
      repeating the checking step until there are no more URLs to be processed;
      
      where if the determining step determines that the URL is in the second list then repeating the checking step until there are no more URLs to be processed.
  - 7. The method as defined in claim 6, wherein the sub-step of initializing a first list with seed values further includes the list being a URL pool.
  - 8. The method as defined in claim 6, wherein the sub-step of determining if a URL is in a second list further includes the second list being a visited pool.
  - 9. The method as defined in claim 6, wherein the sub-step of crawling further comprises the sub-steps of:
    - issuing an HTTP command to a web server named in the URL;
      
      receiving contents of an HTML page as a result of the issued HTTP command; and
      
      passing on the contents of the HTML page to a Page Rendering subroutine.
  - 10. The method as defined in claim 9, further including the sub-steps performed by the Page Rendering subroutine comprising:
    - receiving the contents of the HTML page in the Page Rendering subroutine;
      
      building an in-memory copy of a web-browser layout for the HTML page and if more data is needed to properly form the copy, then performing the sub-steps of;
      
      requesting additional web-based information;
      
      gathering this additional web-based information;
      
      inserting any URLs associated with this additional web-based information into the second list and a URL cache;
      
      building a final amended in-memory copy; and
      
      forwarding the final amended in-memory copy to an Extraction subroutine;
      
      wherein, if no more data is needed to properly form the in-memory representation, then forwarding the in-memory copy to the Extraction subroutine.
  - 11. The method as defined in claim 10, further including the sub-steps performed by the Page Extraction subroutine comprising:
    - accessing a set of memory structures of the Page Renderer;
      
      copying a text portion of the structures into a text map;
      
      inspecting any in-line GIF and JPEG image references in the memory structures;
      
      extracting alternate text attributes;
      
      adding the alternate text attributes to a text map;
      
      invoking an optical character recognition engine;
      
      analyzing any in-line GIF and JPEG images using the optical character recognition engine for text content;
      
      extracting text content from the GIF and JPEG images;
      
      adding text content from the images to the text map; and
      
      forwarding the text map to a Page Summarizer subroutine.
  - 12. The method as defined in claim 11, further including the sub-steps performed by the Page Summarizer subroutine comprising:
    - receiving a text map from the Page Extractor subroutine;
      
      processing the text map in an application-specific manner;
      
      applying data extraction patterns to the text map;
      
      translating resultant data from the applying step;
      
      forwarding any URLs present in the text map to a manager subroutine; and
      
      forwarding any extracted data and metadata to application logic.
  - 13. The method of claim 1, wherein the indexing further comprises:
    - analyzing and summarizing, at the server, the final web-browser display representation of the dynamic data document to produce a text map for the dynamic data document; and
      
      using optical character recognition on the content that has been loaded into the in-memory copy to extract textual content for adding to the textual map for the dynamic data document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kraft, Reiner, Myllymaki, Jussi P.
Primary Examiner(s)
Hong; Stephen S
Assistant Examiner(s)
Stork; Kyle R

Application Number

US09/607,370
Time in Patent Office

3,210 Days
Field of Search

715/501.1, 715/500, 715/513, 715/511, 715/526, 715/523, 715/517, 715/530, 715/234, 715/243, 715/254, 709/225, 707/104.1
US Class Current

715/234
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

System and method for enhanced browser-based web crawling

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for enhanced browser-based web crawling

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links