Method and apparatus for generating time-series data from web pages
First Claim
1. A computer-implemented method of generating time-series data from Web pages, comprising:
- collecting Web pages, which match a user'"'"'s designated collection condition, from a plurality of Web sites, the collecting including storing the collected Web pages in a storage device;
dividing a set of Web pages stored in the storage device into a plurality of clusters, based on URL information of the Web pages;
extracting a date expression from Web pages included in each of the clusters;
determining a typical date expression form for each of the clusters, based on the extracted date expression;
dividing the Web pages included in each of the clusters into a plurality of items with reference to a location where a date expression of the date expression form appears, based on the date expression form;
generating time-series data for each of the clusters by sorting the items for each of the clusters in order of time, based on date expressions corresponding to the items;
providing the Web pages stored in the storage device with URL features, which represent features of URL information of the Web pages, as features of the Web pages, the providing including extracting the URL features by dividing the URL information; and
generating feature vectors, which represent the URL features provided for the Web pages, as URL feature vectors, based on the URL features,wherein;
the set of Web pages corresponding to the URL feature vector of each of the Web pages is divided into a plurality of clusters, based on the URL feature vector;
the URL features each include part of URL information from which each of the URL features is extracted, as an attribute;
the URL feature vectors have all attributes of the URL features in common without any, redundancy; and
the providing includes;
dividing the URL information into a plurality of divided character strings by a plurality of delimiters; and
determining each of the divided character strings as one of an attribute and an attribute value of each of the URL features, the determining including setting each of the divided character strings to one of an attribute having presence or absence of a character string as an attribute value, an attribute having a divided character string subsequent to one of the delimiters as an attribute value, and an attribute value having a divided character string precedent to one of the delimiters as an attribute, in accordance with types of the delimiters, and wherein;
the delimiters are classified into two types of a first delimiter and a second delimiter;
the dividing the URL information includes dividing the URL information into a plurality of divided character strings by the first delimiter, and dividing some of the divided character strings, which include the second delimiter, into a pair of divided character strings by the second delimiter; and
the determining each of the divided character strings includes determining the divided character strings, which are obtained by the first delimiter and exclude the second delimiter, as an attribute, determining presence or absence of the character strings as an attribute value of the attribute, determining one of the divided character strings obtained by the second delimiter, which is precedent to the second delimiter, as an attribute, and determining other of the divided character strings, which is subsequent to the second delimiter, as an attribute value.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment, the Web pages that match a user'"'"'s designated collection condition are collected from a plurality of Web sites. The collected Web pages are divided into a plurality of clusters, based on URL information of the Web pages. A date expression is extracted from Web pages included in each of the clusters. A typical date expression form is determined for each of the clusters, based on the extracted date expression. The Web pages included in each of the clusters are divided into a plurality of items, based on the date expression form. The items are sorted for each of the clusters in order of time, based on date expressions corresponding to the items. Time-series data is generated for each of the clusters by sorting the items.
34 Citations
10 Claims
-
1. A computer-implemented method of generating time-series data from Web pages, comprising:
-
collecting Web pages, which match a user'"'"'s designated collection condition, from a plurality of Web sites, the collecting including storing the collected Web pages in a storage device; dividing a set of Web pages stored in the storage device into a plurality of clusters, based on URL information of the Web pages; extracting a date expression from Web pages included in each of the clusters; determining a typical date expression form for each of the clusters, based on the extracted date expression; dividing the Web pages included in each of the clusters into a plurality of items with reference to a location where a date expression of the date expression form appears, based on the date expression form; generating time-series data for each of the clusters by sorting the items for each of the clusters in order of time, based on date expressions corresponding to the items; providing the Web pages stored in the storage device with URL features, which represent features of URL information of the Web pages, as features of the Web pages, the providing including extracting the URL features by dividing the URL information; and generating feature vectors, which represent the URL features provided for the Web pages, as URL feature vectors, based on the URL features, wherein; the set of Web pages corresponding to the URL feature vector of each of the Web pages is divided into a plurality of clusters, based on the URL feature vector; the URL features each include part of URL information from which each of the URL features is extracted, as an attribute; the URL feature vectors have all attributes of the URL features in common without any, redundancy; and the providing includes; dividing the URL information into a plurality of divided character strings by a plurality of delimiters; and determining each of the divided character strings as one of an attribute and an attribute value of each of the URL features, the determining including setting each of the divided character strings to one of an attribute having presence or absence of a character string as an attribute value, an attribute having a divided character string subsequent to one of the delimiters as an attribute value, and an attribute value having a divided character string precedent to one of the delimiters as an attribute, in accordance with types of the delimiters, and wherein; the delimiters are classified into two types of a first delimiter and a second delimiter; the dividing the URL information includes dividing the URL information into a plurality of divided character strings by the first delimiter, and dividing some of the divided character strings, which include the second delimiter, into a pair of divided character strings by the second delimiter; and the determining each of the divided character strings includes determining the divided character strings, which are obtained by the first delimiter and exclude the second delimiter, as an attribute, determining presence or absence of the character strings as an attribute value of the attribute, determining one of the divided character strings obtained by the second delimiter, which is precedent to the second delimiter, as an attribute, and determining other of the divided character strings, which is subsequent to the second delimiter, as an attribute value. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus that generates time-series data from Web pages, comprising:
-
a processor; a memory coupled to the processor storing instructions causing the processor to implement the following functionalities; a user interface which receives a collection condition from a user; a collection unit to collect Web pages, which match the collection condition received by the user interface, from a plurality of Web sites; a data storage which stores a set of Web pages collected by the collection unit; a data dividing unit to divide the set of Web pages stored in the data storage into a plurality of clusters, based on URL information of the Web pages; a determining unit to extract a date expression from Web pages included in each of the clusters and determine a typical date expression form for each of the clusters, based on the extracted date expression; an item dividing unit to divide the Web pages included in each of the clusters into a plurality of items with reference to a location where a date expression of the date expression form appears, based on the date expression form determined by the determining unit; a time-series data generating unit to generate time-series data for each of the clusters by sorting the items, which are obtained by the item dividing unit, for each of the clusters in order of time, based on date expressions corresponding to the items; a providing unit to provide the Web pages stored in the storage device with URL features, which represent features of URL information of the Web pages, as features of the Web pages, the providing unit including an extracting module to extract the URL features by dividing the URL information; and a generating unit to generate feature vectors, which represent the URL features provided for the Web pages, as URL feature vectors, based on the URL features, wherein; the data dividing unit is to divide the set of Web pages corresponding to the URL feature vector of each of the Web pages into a plurality of clusters, based on the URL feature vector; the URL features each includes part of URL information from which each of the URL features is extracted, as an attribute; the URL feature vectors have all attributes of the URL features in common without any redundancy; and the providing unit includes; a dividing module to divide the URL information into a plurality of divided character strings by a plurality of delimiters; and a determining module to determine each of the divided character strings as one of an attribute and an attribute value of each of the URL features, thereby to set each of the divided character strings to one of an attribute having presence or absence of a character string as an attribute value, an attribute having a divided character string subsequent to one of the delimiters as an attribute value, and an attribute value having a divided character string precedent to one of the delimiters as an attribute, in accordance with types of the delimiters, and wherein; the delimiters are classified into two types of a first delimiter and a second delimiter; the dividing module is to divide the URL information into a plurality of divided character strings by the first delimiter, thereby to divide some of the divided character strings, which include the second delimiter, into a pair of divided character strings by the second delimiter; and the determining module is to determine the divided character strings, which are obtained by the first delimiter and exclude the second delimiter, as an attribute, thereby to determine presence or absence of the character strings as an attribute value of the attribute, the determining module being to determine one of the divided character strings obtained by the second delimiter, which is precedent to the second delimiter, as an attribute, thereby to determine other of the divided character strings, which is subsequent to the second delimiter, as an attribute value. - View Dependent Claims (8, 9, 10)
-
Specification