Enabling a web-crawling robot to collect information from web sites that tailor information content to the capabilities of accessing devices
First Claim
1. A server, comprising:
- a proxy function unit for relaying data exchanged between a web server site on a network and a web-crawling robot collecting contents by accessing the site;
a link deriving unit for expanding a link and acquiring information on a user agent corresponding to content of a link destination when said proxy function unit receives a response from the site to a content retrieval request issued from said web-crawling robot to said site and if a link destination of the link included in the response has dynamic content that differs according to a type of user agent which is an access source; and
a user agent information editing unit for converting user agent information included in the content retrieval request to user agent information corresponding to said content of the link destination when said proxy function unit receives the content retrieval request from said web-crawling robot issued based on derived links.
1 Assignment
0 Petitions
Accused Products
Abstract
A web-crawling robot retrieves information from a web server that tailors information content to the capability of an accessing device. A link deriving unit in a proxy server for relaying data exchanged between the robot and the site analyzes a response from the site to the robot, and acquires information on a user agent corresponding to a particular kind of content of a link destination. On the basis of the information, a user agent information editing unit in the proxy server adds user agent information to the content retrieval request from the web-crawling robot to the site so as to disguise it as a content retrieval request issued from a given user agent, thereby acquiring a response corresponding to capabilities of the user agent.
78 Citations
16 Claims
-
1. A server, comprising:
-
a proxy function unit for relaying data exchanged between a web server site on a network and a web-crawling robot collecting contents by accessing the site;
a link deriving unit for expanding a link and acquiring information on a user agent corresponding to content of a link destination when said proxy function unit receives a response from the site to a content retrieval request issued from said web-crawling robot to said site and if a link destination of the link included in the response has dynamic content that differs according to a type of user agent which is an access source; and
a user agent information editing unit for converting user agent information included in the content retrieval request to user agent information corresponding to said content of the link destination when said proxy function unit receives the content retrieval request from said web-crawling robot issued based on derived links. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A server, comprising:
-
first transmitter-receiver means for exchanging data with a web-crawling robot;
editing means for disguising user agent information included in a request from said web-crawling robot to a web site, received by said first transmitter-receiver means, as a content retrieval request issued by a given user agent; and
second transmitter-receiver means for exchanging data with said site and sending user agent information disguised by said editing means to said destination site. - View Dependent Claims (7, 8)
-
-
9. A server, comprising:
-
a proxy function unit for relaying data exchanged between a web site on a network and a web-crawling robot collecting information by accessing the web site;
a link deriving unit for replacing uniform resource locators (URLs) of a destination of a link included in a response, from the web site to a content retrieval request issued from said web-crawling robot, with substitute URLs individually corresponding to said user agent corresponding to a web content concerned when said proxy function unit receives a response from the web site and if the link destination of the link included in the response has web content that varies according to a type of user agent which is an access source;
a URL conversion unit for replacing the substitute URL with the original URL of the link destination when said proxy function unit receives a content retrieval request issued from said web-crawling robot to said substitute URL as a destination; and
a user agent information editing unit for converting user agent information in a hypertext transfer protocol header of the content retrieval request to said user agent information corresponding to the substitute URL when said proxy function unit receives the content retrieval request to said alternate URL as a destination. - View Dependent Claims (10)
-
-
11. An method of collecting information from a web site on a network by using a computer connected thereto, comprising:
-
receiving a response from the site to a content retrieval request and, if a link destination of a link included in the received response has content that varies according to a type of user agent which is an access source, expanding the link and acquiring information on a user agent corresponding to the content of the link destination;
sending a content retrieval request disguised as a content retrieval request issued from said user agent to said link destination of the link included in said response on the basis of said user agent information; and
acquiring a response according to the type of user agent from said link destination. - View Dependent Claims (12)
-
-
13. A method for collecting information from a site on a network by using a computer connected thereto, comprising:
-
receiving a response from the site to a content retrieval request issued from a given web-crawling robot and, if a link destination of a link included in the received response has web content that varies according to a type of user agent which is an access source, replacing a URL which is the link destination of the link included in the response with a substitute URL individually corresponding to said user agent corresponding to the web content and sending the response to the web-crawling robot;
receiving the content retrieval request issued from said web-crawling robot to said substitute URL as a destination, replacing said substitute URL with the original link destination URL, converting user agent information at an HTTP header of the content retrieval request to information on said user agent corresponding to the substitute URL, and sending the content retrieval request to the link destination; and
receiving a response from said link destination to said content retrieval request whose user agent information was converted, adding identification information of the user agent to the response, and sending the response to said web-crawling robot.
-
-
14. A program product for controlling a computer connected to a network, the program enabling the computer to serve as:
-
transmitter-receiver means for relaying data exchanged between a site on said network and a web-crawling robot;
link expansion means for receiving a response from the site to a content retrieval request issued from said web-crawling robot by using said transmitter-receiver means and, if a link destination of a link included in the response has content that varies according to a type of user agent which is an access source, expanding the link and acquiring information on the user agent corresponding to the content of the link destination; and
user agent information editing means for converting user agent information included in the content retrieval request to said user agent information corresponding to said content of the link destination when said transmitter-receiver means receives the content retrieval request from said web-crawling robot issued, based on said expanded link.
-
-
15. A program product for controlling a computer connected to a network, the program enabling the computer to serve as:
-
transmitter-receiver means for relaying data exchanged between a web site on the network and a web-crawling robot collecting information by accessing the web site;
link expansion means for receiving a response from the web site to a content retrieval request issued from said web-crawling robot site by using said transmitter-receiver means and, if a link destination of a link included in the response has web content that varies according to a type of user agent which is an access source, replacing a URL of the link destination of the link included in the response with a substitute URL individually corresponding to said user agent corresponding to the web content;
URL conversion means for replacing the substitute URL with the original link destination URL when said transmitter-receiver means receives the content retrieval request issued from said web-crawling robot to said substitute URL as a destination; and
user agent information editing means for converting user agent information in an HTTP header of the content retrieval request to information on said user agent corresponding to the substitute URL when said transmitter-receiver means receives the content retrieval request issued from said web-crawling robot to said substitute URL as a destination. - View Dependent Claims (16)
-
Specification