Method and system for obtaining script related information for website crawling
First Claim
1. A virtual browser for obtaining script related information for website crawling, the virtual browser comprising:
- an HTML transformer for transforming an HTML document included in a web page of the website into an XML document;
a DOM builder for building a document object model (DOM) based on the XML document;
a script extractor for extracting one or more scripts from the DOM;
a BOM provider for providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution; and
a script execution engine for executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM provided by the BOM provider to capture script related information generated by execution of the scripts.
2 Assignments
0 Petitions
Accused Products
Abstract
A web crawler system has an automatic website crawler and a virtual browser that provides script related information to the website crawler. The virtual browser transforms an HTML document included in a web page of the website into an XML document, and builds a document object model containing document objects in a tree structure based on the XML document. The virtual browser extracts from the DOM scripts that are potentially executable, and executes the extracted scripts using a browser object model provided for the virtual browser containing objects and methods and properties that are used for script execution so as to capture script related information generated by execution of the scripts.
-
Citations
33 Claims
-
1. A virtual browser for obtaining script related information for website crawling, the virtual browser comprising:
-
an HTML transformer for transforming an HTML document included in a web page of the website into an XML document;
a DOM builder for building a document object model (DOM) based on the XML document;
a script extractor for extracting one or more scripts from the DOM;
a BOM provider for providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution; and
a script execution engine for executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM provided by the BOM provider to capture script related information generated by execution of the scripts. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of obtaining script related information for website crawling;
- the method comprising the steps of;
receiving a web page of a website;
transforming an HTML document included in the web page into an XML document;
building a document object model (DOM) based on the XML document;
extracting one or more scripts from the DOM;
providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
executing the scripts extracted by the script extractor using one or more of the objects and methods of the BOM; and
capturing script related information generated by the execution of the scripts. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- the method comprising the steps of;
-
22. A computer readable medium storing instructions or statements for use in the execution in a computer of a method of obtaining script related information for website crawling, the method comprising steps of:
-
receiving a web page of a website;
transforming an HTML document included in the web page into an XML document;
building a document object model (DOM) based on the XML document;
extracting one or more scripts from the DOM;
providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
executing the scripts extracted by the script extractor using one or more of the relevant objects and methods of the BOM; and
capturing script related information generated by the execution of the scripts.
-
-
23. A propagated signal carrier carrying signals containing computer executable instructions that can be read and executed by a computer, the computer executable instructions being used to execute a method of obtaining script related information for website crawling, the method comprising the steps of:
-
receiving a web page of a website;
transforming an HTML document included in the web page into an XML document;
building a document object model (DOM) based on the XML document;
extracting one or more scripts from the DOM;
providing a browser object model (BOM) containing BOM objects and methods that are usable by the scripts for script execution;
executing the scripts extracted by the script extractor using one or more of the relevant objects and methods of the BOM; and
capturing script related information generated by the execution of the scripts.
-
-
24. A URL resolution system for resolving Universal Resource Locators (URLs), the URL resolution system comprising:
-
a website crawler for crawling a website and for locating script code which is used to dynamically create at least one script URL; and
a script URL resolution component for causing examination of the script code located during the crawling and causing execution of the script code to obtain the script URL. - View Dependent Claims (25, 26, 27, 28, 29)
-
-
30. A method for resolving Universal Resource Locators (URLs), the method comprising steps of:
-
locating script code which creates at least one script URL while crawling a website; and
examining the script code to obtain the script URL from the examination result by executing the script code. - View Dependent Claims (31, 32, 33)
-
Specification