Method and system for obtaining script related information for website crawling

US 20060190561A1
Filed: 03/03/2006
Published: 08/24/2006
Est. Priority Date: 06/19/2002
Status: Abandoned Application

First Claim

Patent Images

1. A virtual browser for obtaining script related information for website crawling, the virtual browser comprising:

an HTML transformer for transforming an HTML document included in a web page of the website into an XML document;

a DOM builder for building a document object model (DOM) based on the XML document;

a script extractor for extracting one or more scripts from the DOM;

a BOM provider for providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution; and

a script execution engine for executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM provided by the BOM provider to capture script related information generated by execution of the scripts.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A web crawler system has an automatic website crawler and a virtual browser that provides script related information to the website crawler. The virtual browser transforms an HTML document included in a web page of the website into an XML document, and builds a document object model containing document objects in a tree structure based on the XML document. The virtual browser extracts from the DOM scripts that are potentially executable, and executes the extracted scripts using a browser object model provided for the virtual browser containing objects and methods and properties that are used for script execution so as to capture script related information generated by execution of the scripts.

Citations

33 Claims

1. A virtual browser for obtaining script related information for website crawling, the virtual browser comprising:
- an HTML transformer for transforming an HTML document included in a web page of the website into an XML document;
  
  a DOM builder for building a document object model (DOM) based on the XML document;
  
  a script extractor for extracting one or more scripts from the DOM;
  
  a BOM provider for providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution; and
  
  a script execution engine for executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM provided by the BOM provider to capture script related information generated by execution of the scripts.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The virtual browser as claimed in claim 1, wherein the script related information includes a URL generated by a script, HTML content generated by a script, a cookie generated by a script, and/or a HTTP request initiated by scripts.
  - 3. The virtual browser as claimed in claim 1, wherein the DOM builder builds the DOM having a tree structure representing elements in the HTML document as represented by the XML document.
  - 4. The virtual browser as claimed in claim 1, wherein the script extractor comprises:
    - a script location list containing potential locations for a script to reside in the DOM;
      
      a script locator for locating the scripts in the DOM using the script location list; and
      
      an script extraction handler for handling extraction of the located scripts.
  - 5. The virtual browser as claimed in claim 4, wherein the script location list includes location information of scripts related to specified tags and event handlers.
  - 6. The virtual browser as claimed in claim 4, wherein the script extractor further comprises a set of location queries that permit extraction of scripts contained in event handlers;
    - and the script extraction handler extracts a script contained in an event handler in the DOM using a relevant location query.
  - 7. The virtual browser as claimed in claim 1, wherein the BOM provider provides the BOM objects that allow capturing of the script related information during the execution of the scripts.
  - 8. The virtual browser as claimed in claim 7, wherein the virtual browser further comprises an information handler for interfacing with the BOM objects to capture the script related information generated by the script execution.
  - 9. The virtual browser as claimed in claim 1, wherein the BOM provider provides the BOM objects that allow retrieval, modification, addition and/or deletion of information contained in the DOM by one or more of the scripts.
  - 10. A web crawler system for crawling website, the web crawler system comprising:
    - a website crawler for automatically crawling website; and
      
      the virtual browser recited in claim 1.

11. A method of obtaining script related information for website crawling;
- the method comprising the steps of;
  
  receiving a web page of a website;
  
  transforming an HTML document included in the web page into an XML document;
  
  building a document object model (DOM) based on the XML document;
  
  extracting one or more scripts from the DOM;
  
  providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
  
  executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM; and
  
  capturing script related information generated by the execution of the scripts.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 12. The method as claimed in claim 11, wherein the capturing step captures the script related information including a URL generated by a script, HTML content generated by a script, a cookie generated by a script, and/or a HTTP request initiated by a script.
  - 13. The method as claimed in claim 11, wherein the DOM building step builds the DOM having a tree structure representing elements in the HTML document as represented by the XML document.
  - 14. The method as claimed in claim 11, wherein the script extracting step comprises the step of locating the scripts in the DOM using a script location list containing potential locations for a script to reside in the DOM.
  - 15. The method as claimed in claim 14, wherein the script locating step uses the script location list including location information of scripts related to specified tags and event handlers.
  - 16. The method as claimed in claim 14, wherein the script extracting step comprising the steps of:
    - providing a set of location queries that permit extraction of scripts contained in event handlers; and
      
      extracting a script contained in an event handler in the DOM using a relevant location query selected from the set of location queries.
  - 17. The method as claimed in claim 11, wherein the BOM providing step provides the BOM objects that allow capturing of the script related information.
  - 18. The method as claimed in claim 17, wherein:
    - the executing step allows the scripts to make calls into relevant ones of the BOM objects; and
      
      the capturing step interfaces with the BOM objects to capture the script related information generated by the script execution.
  - 19. The method as claimed in claim 11, wherein the BOM providing step provides the BOM objects that allow changes of information contained in the DOM by execution of one or more scripts.
  - 20. The method as claimed in claim 11, wherein the BOM providing step provides the BOM objects that are free of behaviours that are undesirable for performing web crawling.
  - 21. The method as claimed in claim 11 further comprising the step of:
    - providing the script related information to a website crawler; and
      
      automatically crawling website by the website crawler using the script related information.

22. A computer readable medium storing instructions or statements for use in the execution in a computer of a method of obtaining script related information for website crawling, the method comprising steps of:
- receiving a web page of a website;
  
  transforming an HTML document included in the web page into an XML document;
  
  building a document object model (DOM) based on the XML document;
  
  extracting one or more scripts from the DOM;
  
  providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
  
  executing the scripts extracted by the script extractor using one or more of the relevant objects and methods of the BOM; and
  
  capturing script related information generated by the execution of the scripts.

23. A propagated signal carrier carrying signals containing computer executable instructions that can be read and executed by a computer, the computer executable instructions being used to execute a method of obtaining script related information for website crawling, the method comprising the steps of:
- receiving a web page of a website;
  
  transforming an HTML document included in the web page into an XML document;
  
  building a document object model (DOM) based on the XML document;
  
  extracting one or more scripts from the DOM;
  
  providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
  
  executing the scripts extracted by the script extractor using one or more of the relevant objects and methods of the BOM; and
  
  capturing script related information generated by the execution of the scripts.

24. A URL resolution system for resolving Universal Resource Locators (URLs), the URL resolution system comprising:
- a website crawler for crawling a website and for locating script code which is used to dynamically create at least one script URL; and
  
  a script URL resolution component for causing examination of the script code located during the crawling and causing execution of the script code to obtain the script URL.
- View Dependent Claims (25, 26, 27, 28, 29)
- - 25. The URL resolution system as claimed in claim 24 wherein the website includes one or more web pages, and the website crawler crawls individual web pages associated with websites, and has a crawling controller for controlling the website crawler.
  - 26. The URL resolution system as claimed in claim 25 wherein the website crawler has a script code detector for determining if a web page uses script code to dynamically create at least one script URL.
  - 27. The URL resolution system as claimed in claim 26 wherein the script code detector has a notification generating function for generating a notification when the script code detector locates a web page that uses script code to dynamically create at least one script URL.
  - 28. The URL resolution system as claimed in claim 25 wherein the crawling controller receives results of script code examination from the script URL resolution component, and controls the website crawler based on the examination results.
  - 29. The URL resolution system as claimed in claim 24 wherein the website includes one or more web pages, the script code has a specific part that is used to create the script URL, and the script URL resolution component comprises:
    - a web page loading controller for instructing a web page examiner to load the web page located by the website crawler; and
      
      a script code execution controller for instructing the web page examiner to execute the specific part of the script code used in the loaded web page to obtain the script URL.

30. A method for resolving Universal Resource Locators (URLs), the method comprising steps of:
- locating script code which creates at least one script URL while crawling a website; and
  
  examining the script code to obtain the script URL from the examination result by executing the script code.
- View Dependent Claims (31, 32, 33)
- - 31. The method as claimed in claim 30 wherein a website has one or more web pages;
    - the locating step locates a web page that uses script code to dynamically create at least one script URL, the script code having a specific part that is used for the creation of the script URL; and
      
      the examination step comprises steps of;
      
      loading the located web page; and
      
      executing the specific part of the script code in the loaded web page to resolve the script URL.
  - 32. The method as claimed in claim 31 further comprising a step of continuing crawling of a web page identified by the script URL.
  - 33. The method as claimed in claim 30 further comprising steps of:
    - obtaining examination results including the script URL when the examination step is successful and a failure result when the examination step fails to obtain the script URL; and
      
      presenting to a user the examination result including the script URL and/or the failure result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Watchfire Corp. (Ontario) (International Business Machines Corporation)
Inventors
Chorneyko, Darcy Steven, Grancharov, Constantine, Conboy, Craig, Smith, Duncan, McDougall, Derek Lawrence Ross, Rolleston, Andrew

Application Number

US11/367,752
Publication Number

US 20060190561A1
Time in Patent Office

Days
Field of Search
US Class Current

709/217
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Method and system for obtaining script related information for website crawling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for obtaining script related information for website crawling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links