Automated access to web content based on log analysis

US 7,483,910 B2
Filed: 01/11/2002
Issued: 01/27/2009
Est. Priority Date: 01/11/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining parameter combinations for automated web crawler access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, said method comprising:

maintaining at least one log file containing user queries resulting from previous real user HTML interactions with said World Wide Web, said user queries comprising entries;

analyzing said log file to determine parameter combinations and to generate synthetic queries for input to said web crawler, said web crawler using said input for automated access to said World Wide Web content, said analyzing step further comprising;

ranking entries according to their frequency of occurence;

for a set of entries resulting from unlimited text entries, excluding entries ranked below a predetermined number; and

wherein said synthetic queries are determined by producing combinations of entries from each set of entries.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a manner for providing Web crawlers capable of efficiently accessing Web content not accessible via static hyperlinks. Log files are maintained of communications between a Web browser and a Web server resulting from real user accesses to the content associated with dynamic hyperlinks. These log files represent past user'"'"'s accesses to the content and are used to generate Web crawler accesses. This approach allows a crawler to accurately mimic real users, resulting in a capability of the crawler to automatically access all the content that real users would have access to.

31 Citations

View as Search Results

10 Claims

1. A method of determining parameter combinations for automated web crawler access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, said method comprising:
- maintaining at least one log file containing user queries resulting from previous real user HTML interactions with said World Wide Web, said user queries comprising entries;
  
  analyzing said log file to determine parameter combinations and to generate synthetic queries for input to said web crawler, said web crawler using said input for automated access to said World Wide Web content, said analyzing step further comprising;
  
  ranking entries according to their frequency of occurence;
  
  for a set of entries resulting from unlimited text entries, excluding entries ranked below a predetermined number; and
  
  wherein said synthetic queries are determined by producing combinations of entries from each set of entries.
- View Dependent Claims (2, 3, 4)
- - 2. A method of determining parameter combinations for automated access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, as per claim 1, wherein said synthetic queries are determined by producing all combinations of entries from each set of entries.
  - 3. A method of determining parameter combinations for automated access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, as per claim 1, wherein entries resulting from limited text entries and unlimited text entries have stop words removed and remaining words stemmed.
  - 4. A method of determining parameter combinations for automated access to World Wide Web content that is accessible based on parameters resulting from real user interactions with a World Wide Web site, as per claim 1, wherein said log file is maintained by a proxy server that logs communications between a client and a Web server resulting from real user accesses to said World Wide Web content.

5. A method of increasing web crawler penetration of Web databases accessible via HTML forms, said method comprising:
- reviewing previous real user form input data, said previous real user form input data maintained in a log file, said log file maintained in a proxy server;
  
  identifying possible HTML form input data for said Web crawler from said previous real user form input data by synthesis of entries for any of;
  
  predefined sets, limited text entries or unlimited text entries; and
  
  providing said identified form input data to said Web crawler during an instantiation of automated access to said Web databases by said Web crawler.
- View Dependent Claims (6, 7)
- - 6. A method of increasing web crawler penetration of Web databases accessible via HTML forms, as per claim 5, wherein said synthesis comprises:
    - ranking any entries for predetermined sets;
      
      ranking any entries for limited text entries;
      
      ranking any entries for unlimited text entries;
      
      excluding entries for unlimited text entries ranked below a predetermined number; and
      
      pairing entries from each set of ranked entries.
  - 7. A method of increasing web crawler penetration of Web databases accessible via HTML forms, as per claim 6, wherein said synthesis further comprises:
    - removing stop words and stemming remaining words for entries resulting from limited text entries and unlimited text entries.

8. A method of emulating real user access to World Wide Web content dynamically accessible via an HTML form, said method comprising:
- maintaining a log containing real user entries into each input item of said HTML form;
  
  ranking entries for each input item according to their frequency of occurrence;
  
  for each unlimited text entry input item, excluding entries ranked below a predetermined number;
  
  determining combinations of entries from each set of entries; and
  
  emulating real user access to World Wide Web content dynamically accessible via an HTML form by automatically accessing said content using said combinations of entries as HTML input for a webcrawler.
- View Dependent Claims (9, 10)
- - 9. A method of emulating real user access to World Wide Web content dynamically accessible via an HTML form, as per claim 8, wherein entries resulting from limited text entries and unlimited text entries have stop words removed and remaining words stemmed.
  - 10. A method of emulating real user access to World Wide Web content dynamically accessible via an HTML form, as per claim 8, wherein said log file is maintained by a proxy server that logs communications between a client and a Web server resulting from real user accesses to said World Wide Web content.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Myllymaki, Jussi Petri, Beyer, Kevin Scott
Primary Examiner(s)
AL HASHEMI, SANA A

Application Number

US10/042,367
Publication Number

US 20030135487A1
Time in Patent Office

2,573 Days
Field of Search

707/3, 707/4, 707/5, 707/10, 707100-102, 705/10
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Automated access to web content based on log analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

31 Citations

10 Claims

Specification

Use Cases

Quick Links

Others

Automated access to web content based on log analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

10 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others