Automated access to web content based on log analysis
First Claim
1. A method of determining parameter combinations for automated web crawler access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, said method comprising:
- maintaining at least one log file containing user queries resulting from previous real user HTML interactions with said World Wide Web, said user queries comprising entries;
analyzing said log file to determine parameter combinations and to generate synthetic queries for input to said web crawler, said web crawler using said input for automated access to said World Wide Web content, said analyzing step further comprising;
ranking entries according to their frequency of occurence;
for a set of entries resulting from unlimited text entries, excluding entries ranked below a predetermined number; and
wherein said synthetic queries are determined by producing combinations of entries from each set of entries.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a manner for providing Web crawlers capable of efficiently accessing Web content not accessible via static hyperlinks. Log files are maintained of communications between a Web browser and a Web server resulting from real user accesses to the content associated with dynamic hyperlinks. These log files represent past user'"'"'s accesses to the content and are used to generate Web crawler accesses. This approach allows a crawler to accurately mimic real users, resulting in a capability of the crawler to automatically access all the content that real users would have access to.
31 Citations
10 Claims
-
1. A method of determining parameter combinations for automated web crawler access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, said method comprising:
-
maintaining at least one log file containing user queries resulting from previous real user HTML interactions with said World Wide Web, said user queries comprising entries; analyzing said log file to determine parameter combinations and to generate synthetic queries for input to said web crawler, said web crawler using said input for automated access to said World Wide Web content, said analyzing step further comprising;
ranking entries according to their frequency of occurence;
for a set of entries resulting from unlimited text entries, excluding entries ranked below a predetermined number; andwherein said synthetic queries are determined by producing combinations of entries from each set of entries. - View Dependent Claims (2, 3, 4)
-
-
5. A method of increasing web crawler penetration of Web databases accessible via HTML forms, said method comprising:
-
reviewing previous real user form input data, said previous real user form input data maintained in a log file, said log file maintained in a proxy server; identifying possible HTML form input data for said Web crawler from said previous real user form input data by synthesis of entries for any of;
predefined sets, limited text entries or unlimited text entries; andproviding said identified form input data to said Web crawler during an instantiation of automated access to said Web databases by said Web crawler. - View Dependent Claims (6, 7)
-
-
8. A method of emulating real user access to World Wide Web content dynamically accessible via an HTML form, said method comprising:
-
maintaining a log containing real user entries into each input item of said HTML form; ranking entries for each input item according to their frequency of occurrence; for each unlimited text entry input item, excluding entries ranked below a predetermined number; determining combinations of entries from each set of entries; and emulating real user access to World Wide Web content dynamically accessible via an HTML form by automatically accessing said content using said combinations of entries as HTML input for a webcrawler. - View Dependent Claims (9, 10)
-
Specification