×

Method and apparatus for generating time-series data from web pages

  • US 7,526,462 B2
  • Filed: 03/16/2006
  • Issued: 04/28/2009
  • Est. Priority Date: 05/26/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method of generating time-series data from Web pages, comprising:

  • collecting Web pages, which match a user'"'"'s designated collection condition, from a plurality of Web sites, the collecting including storing the collected Web pages in a storage device;

    dividing a set of Web pages stored in the storage device into a plurality of clusters, based on URL information of the Web pages;

    extracting a date expression from Web pages included in each of the clusters;

    determining a typical date expression form for each of the clusters, based on the extracted date expression;

    dividing the Web pages included in each of the clusters into a plurality of items with reference to a location where a date expression of the date expression form appears, based on the date expression form;

    generating time-series data for each of the clusters by sorting the items for each of the clusters in order of time, based on date expressions corresponding to the items;

    providing the Web pages stored in the storage device with URL features, which represent features of URL information of the Web pages, as features of the Web pages, the providing including extracting the URL features by dividing the URL information; and

    generating feature vectors, which represent the URL features provided for the Web pages, as URL feature vectors, based on the URL features,wherein;

    the set of Web pages corresponding to the URL feature vector of each of the Web pages is divided into a plurality of clusters, based on the URL feature vector;

    the URL features each include part of URL information from which each of the URL features is extracted, as an attribute;

    the URL feature vectors have all attributes of the URL features in common without any, redundancy; and

    the providing includes;

    dividing the URL information into a plurality of divided character strings by a plurality of delimiters; and

    determining each of the divided character strings as one of an attribute and an attribute value of each of the URL features, the determining including setting each of the divided character strings to one of an attribute having presence or absence of a character string as an attribute value, an attribute having a divided character string subsequent to one of the delimiters as an attribute value, and an attribute value having a divided character string precedent to one of the delimiters as an attribute, in accordance with types of the delimiters, and wherein;

    the delimiters are classified into two types of a first delimiter and a second delimiter;

    the dividing the URL information includes dividing the URL information into a plurality of divided character strings by the first delimiter, and dividing some of the divided character strings, which include the second delimiter, into a pair of divided character strings by the second delimiter; and

    the determining each of the divided character strings includes determining the divided character strings, which are obtained by the first delimiter and exclude the second delimiter, as an attribute, determining presence or absence of the character strings as an attribute value of the attribute, determining one of the divided character strings obtained by the second delimiter, which is precedent to the second delimiter, as an attribute, and determining other of the divided character strings, which is subsequent to the second delimiter, as an attribute value.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×