System and method for automatically gathering dynamic content and resources on the world wide web by stimulating user interaction and managing session information
First Claim
Patent Images
1. An automated method of gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information, the method comprising the steps of:
- providing a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions (“
DTD”
) including descriptions of how to interact with the dynamic websites;
identifying and retrieving at least one uniform resource locator (“
URL”
) for a dynamic website to be analyzed;
identifying and retrieving a session data and DTD for said URL from the site database;
creating a query template for the retrieved URL using said identified DTD describing how to interact with the URL to simulate user interaction;
identifying at least one search topic to be searched on said URL;
inserting said at least one search topic into said query template to form a search query string querying said URL with said query string comprising said identified DTD and said at least one search topic;
retrieving at least one result of said query, thereby automatically simulating user interaction with said dynamic website to gather and extract said at least one result.
3 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method for a web crawler to automatically simulate user interaction with a dynamic website in order to gather and extract information from the site. This interactive web crawler will be able to create a search query string for any one of a number of desired search topics and systematically crawl dynamic personalized content on a website and retrieve the information desired by the user/client.
271 Citations
20 Claims
-
1. An automated method of gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information, the method comprising the steps of:
-
providing a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions (“
DTD”
) including descriptions of how to interact with the dynamic websites;
identifying and retrieving at least one uniform resource locator (“
URL”
) for a dynamic website to be analyzed;
identifying and retrieving a session data and DTD for said URL from the site database;
creating a query template for the retrieved URL using said identified DTD describing how to interact with the URL to simulate user interaction;
identifying at least one search topic to be searched on said URL;
inserting said at least one search topic into said query template to form a search query string querying said URL with said query string comprising said identified DTD and said at least one search topic;
retrieving at least one result of said query, thereby automatically simulating user interaction with said dynamic website to gather and extract said at least one result. - View Dependent Claims (2, 3, 4)
determining if said URL is to be searched with at least one additional search topic;
performing at least one additional query of said URL with said DTD and said at least one additional search topic;
retrieving at least one result of said at least one additional search topic query; and
repeating the foregoing steps for a plurality of at least one additional search topic to be searched on said URL.
-
-
3. The method of claim 1 wherein said search query string is adapted to be submitted to said URL to perform a hypertext transfer protocol request.
-
4. The method of claim 1 further comprising the steps, after the step of retrieving at least one search result, of:
-
determining if additional search results are available;
performing a page navigation to retrieve at least one additional search result from at least one page of search results.
-
-
5. An article of manufacture comprising:
-
a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions (“
DTD”
) including descriptions of how to interact with the dynamic websites; and
a computer usable medium having computer readable program code means for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information, the computer readable program code means in said article of manufacture comprising;
computer readable program code means to identify and retrieve a URL for a dynamic website to be queried;
computer readable program code means to identify and retrieve a session data and DTD for said URL from the site database;
computer readable program code means to create a query template for the retrieved URL using said identified DTD describing how to interact with the URL to simulate user interaction;
computer readable program code means to identify at least one search topic to be searched on said URL;
computer readable program code means to insert said at least one search topic into said query template to form a search query string;
computer readable program code means to query said URL with said query string comprising said identified DTD and said at least one search topic;
computer readable program code means to retrieve at least one result of said query, thereby automatically simulating user interaction with said dynamic website to gather and extract said at least one result. - View Dependent Claims (6, 7, 8)
computer readable program code means to determine if said URL is to be searched with at least one additional search topic;
computer readable program code means to perform at least one additional query of said URL with said DTD and said at least one additional search topic;
computer readable program code means to retrieve at least one result of said at least one additional query; and
computer readable program code means to repeat the foregoing steps for a plurality of at least one additional search topic to be searched on said URL.
-
-
7. The article of claim 5 wherein said search query string is adapted to be submitted to said URL to perform a hypertext transfer protocol request.
-
8. The article of claim 5 further comprising:
-
computer readable program code means for determining if additional search results are available;
computer readable program code means for performing a page navigation to retrieve at least one additional search result from at least one page of search results.
-
-
9. A computer program product comprising:
-
a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions (“
DTD”
) including descriptions of how to interact with the dynamic websites; and
a computer usable medium having computer readable program code means embodied in said medium for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information, said computer program product having;
computer readable program code means for causing a computer to identify and retrieve a URL for a dynamic website to be queried;
computer readable program code means for causing a computer to identify and retrieve a session data and DTD for said URL from the site database;
computer readable program code means to create a query template for the retrieved URL using said identified DTD describing how to interact with the URL to simulate user interaction;
computer readable program code means for causing a computer to identify at least one search topic to be searched on said URL;
computer readable program code means to insert said at least one search topic into said query template to form a search query string;
computer readable program code means for causing a computer to query said URL with said query string comprising said identified DTD and said at least one search topic;
computer readable program code means for causing a computer to retrieve at least one result of said query, thereby automatically simulating user interaction with said dynamic website to gather and extract said at least one result. - View Dependent Claims (10, 11, 12)
computer readable program code means for causing a computer to determine if said URL is to be searched with a second search topic;
computer readable program code means for causing a computer to perform a second query of said URL with said DTD and said second search topic;
computer readable program code means for causing a computer to retrieve at least one result of said second query; and
computer readable program code means for causing a computer to repeat the foregoing steps for a plurality of search topics to be searched on said URL.
-
-
11. The computer product of claim 9 wherein said search query string is adapted to be submitted to said URL to perform a hypertext transfer protocol request.
-
12. The computer product of claim 9 further comprising:
-
computer readable program code means for causing a computer to determine if additional search results are available;
computer readable program code means for causing a computer to performing a page navigation to retrieve at least one additional search result from at least one page of search results.
-
-
13. A computer program product for automatically gathering dynamic content and resources on the world wide web, said computer program product comprising:
-
a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions including descriptions of how to interact with the dynamic websites; and
a computer usable medium having computer readable program code means embodied in said medium for causing a computer to simulate user interaction and managing session information with a website, said computer program product having;
computer readable program code means for causing a computer to determine at least one dynamic website to be searched, said website having a uniform resource locator;
computer readable program code means for causing a computer to determine a session data and document type definition, from the site database, for said at least one dynamic website to be searched;
computer readable program code means for causing a computer to create a query template for a website to simulate user interaction, said query template containing said uniform resource locator and said document type definition describing how to interact with the uniform resource locator;
computer readable program code means for causing a computer to determine at least one search topic to be searched on said website;
computer readable program code means for causing a computer to insert said topic into said query template to form a search query string;
computer readable program code means for causing a computer to query said website with said query string;
computer readable program code means for causing a computer to receive at least one result from said query;
computer readable program code means for causing a computer to determine if there is a second search topic to be searched on said website;
computer readable program code means for causing a computer to create a second search query string containing said uniform resource locator and said document type definition for said website and said second topic to be searched;
computer readable program code means for causing a computer to execute a second query of said website with said second search query string;
computer readable program code means for causing a computer to receive at least one result from said second query;
computer readable program code means for causing a computer to execute a plurality of queries for a plurality of search topics to be searched on said website, thereby automatically simulating user interaction with said website to gather and extract results from said website.
-
-
14. An automated method of gathering and extracting content and information from a dynamic website comprising the steps of:
-
identifying and retrieving a uniform resource locator (“
URL”
) for a website to be searched;
determining from the site database if said URL is a dynamic website requiring interaction to download content thereof;
if said URL is a dynamic website, obtaining a session data for said URL and storing said data in a site database of dynamic websites, said site database further containing document type definitions including descriptions of how to interact with the dynamic websites;
formatting a query template for said URL using said session data and a document type definition describing how to interact with the dynamic website from said site database to simulate user interaction;
formatting said query template with a first topic to be searched to form a first search query string;
performing a hypertext transfer protocol request of said dynamic website with said first search query string;
processing a first set of search results for said first search query string, thereby automatically simulating user interaction with said dynamic website to gather and extract said set of search results from said dynamic website. - View Dependent Claims (15, 16)
determining if there is at least one additional topic to be searched on said website;
inserting said at least one additional topic into said search query string to form at least one additional topic search query string;
performing a hypertext transfer protocol request of said website with said at least one additional topic search query string;
processing at least one additional topic set of search results for said at least one additional topic search query string;
repeating the foregoing for a plurality of at least one additional topic to be searched on said website.
-
-
16. The method of claim 14 wherein said step of determining if said URL is a dynamic website further comprises the steps of:
-
performing a hypertext transfer protocol GET method of said website;
downloading a content of said website into said site database, said content containing a header;
scanning said header for said session data, said session data represented by a cookie.
-
-
17. An article of manufacture comprising:
-
a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions including descriptions of how to interact with the dynamic websites; and
a computer usable medium having computer readable program code means for automatically gathering and extracting content and information from a dynamic website, the computer readable program code means in said article of manufacture comprising;
computer readable program code means to identify and retrieve a URL for a website to be queried;
computer readable program code means to determine if said URL is a dynamic website requiring interaction to download content thereof;
computer readable program code means for obtaining a session data for said URL and storing said data in said site database;
computer readable program code means for formatting a query template for said URL using said session data and a document type definition describing how to interact with the dynamic website from said site database to simulate user interaction;
computer readable program code means for formatting said query template with a first topic to be searched to form a first search query string;
computer readable program code means for performing a hypertext transfer protocol request of said dynamic website with said first search query string;
computer readable program code means for processing a first set of search results for said first search query string, thereby automatically simulating user interaction with said dynamic website to gather and extract said set of search results from said dynamic website. - View Dependent Claims (18)
computer readable program code means for performing a hypertext transfer protocol GET method of said website;
computer readable program code means for downloading a content of said website into said site database, said content containing a header;
computer readable program code means scanning said header for said session data, said session data represented by a cookie.
-
-
19. A computer program product comprising:
-
a site database of dynamic websites requiring interaction to download contents thereof, said site database containing session data for the dynamic websites and document type definitions including descriptions of how to interact with the dynamic websites; and
a computer usable medium having computer readable program code means embodied in said medium for of gathering and extracting content and information from a dynamic website, said computer program product having;
computer readable program code means for causing a computer to identify and retrieve a uniform resource locator (“
URL”
) for a website to be searched;
computer readable program code means for causing a computer to determine if said URL is a dynamic website requiring interaction to download content thereof;
computer readable program code means for causing a computer to obtain a session data for said URL and storing said data in said site database;
computer readable program code means for causing a computer to format a query template for said URL using said session data and a document type definition describing how to interact with the dynamic website from said site database to simulate user interaction;
computer readable program code means for causing a computer to format said query template with a first topic to be searched to form a first search query string;
computer readable program code means for causing a computer to perform a hypertext transfer protocol request of said dynamic website with said first search query string;
computer readable program code means for causing a computer to process a first set of search results for said first search query string, thereby automatically simulating user interaction with said dynamic website to gather and extract said set of search results from said dynamic website. - View Dependent Claims (20)
computer readable program code means for causing a computer to perform a hypertext transfer protocol GET method of said website;
computer readable program code means for causing a computer to download a content of said website into said site database, said content containing a header;
computer readable program code means for causing a computer to scan said header for said session data, said session data represented by a cookie.
-
Specification