×

Training set construction for taxonomic classification

  • US 8,122,005 B1
  • Filed: 10/22/2009
  • Issued: 02/21/2012
  • Est. Priority Date: 10/22/2009
  • Status: Active Grant
First Claim
Patent Images

1. A computer system including instructions stored on a computer-readable medium, the computer system comprising:

  • a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including;

    a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data;

    an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set,wherein the extractor includes at least one site-specific extractor that is configured to apply the site-specific extraction template to the crawl data associated with the corresponding top-level site, including extracting instances of the categories from the crawl data and labeling the extracted instances using the categories within the training set, andwherein the training set generator is configured to periodically replace the training set with a modified training set, including re-crawling the top-level sites and re-applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and extracting new instances of the categories from the crawl data and labeling the new instances using the categories within the modified training set.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×