Training set construction for taxonomic classification
First Claim
1. A computer system including instructions stored on a computer-readable medium, the computer system comprising:
- a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including;
a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data;
an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set,wherein the extractor includes at least one site-specific extractor that is configured to apply the site-specific extraction template to the crawl data associated with the corresponding top-level site, including extracting instances of the categories from the crawl data and labeling the extracted instances using the categories within the training set, andwherein the training set generator is configured to periodically replace the training set with a modified training set, including re-crawling the top-level sites and re-applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and extracting new instances of the categories from the crawl data and labeling the new instances using the categories within the modified training set.
2 Assignments
0 Petitions
Accused Products
Abstract
A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.
-
Citations
18 Claims
-
1. A computer system including instructions stored on a computer-readable medium, the computer system comprising:
-
a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including; a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data; an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set, wherein the extractor includes at least one site-specific extractor that is configured to apply the site-specific extraction template to the crawl data associated with the corresponding top-level site, including extracting instances of the categories from the crawl data and labeling the extracted instances using the categories within the training set, and wherein the training set generator is configured to periodically replace the training set with a modified training set, including re-crawling the top-level sites and re-applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and extracting new instances of the categories from the crawl data and labeling the new instances using the categories within the modified training set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer system including instructions stored on a computer-readable medium, the computer system comprising:
-
a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including; a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data; an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set, wherein the training set generator is configured to determine a new category of the hierarchy of categories, based on the crawl data, and configured to augment the taxonomy by adding the new category thereto.
-
-
12. A computer-implemented method comprising:
-
determining a taxonomy including a hierarchy of categories; determining a plurality of top-level sites related to the taxonomy; determining, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories; crawling each of the top-level sites to determine at least one lower-level site associated therewith; storing the top-level sites and associated lower-level sites as crawl data; applying each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data, including applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and further including extracting instances of the categories from the crawl data and labeling the extracted instances using the categories within the training set, and replacing, periodically, the training set with a modified training set, including re-crawling the top-level sites and re-applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and further including extracting new instances of the categories from the crawl data and labeling the new instances using the categories within the modified training set. - View Dependent Claims (13, 14, 15)
-
-
16. A computer program product, the computer program product being tangibly embodied on a computer-readable medium and including executable code that, when executed, is configured to cause a data processing apparatus to:
-
determine a taxonomy including a hierarchy of categories; determine a plurality of top-level sites related to the taxonomy; determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories; crawl each of the top-level sites to determine at least one lower-level site associated therewith; store the top-level sites and associated lower-level sites as crawl data; and apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data, including applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and further including extracting instances of the categories from the crawl data and labeling the extracted instances using the categories within the training set, and replace, periodically, the training set with a modified training set, including re-crawling the top-level sites and re-applying the site-specific extraction template to the crawl data associated with the corresponding top-level site, and further including extracting new instances of the categories from the crawl data and labeling the new instances using the categories within the modified training set. - View Dependent Claims (17, 18)
-
Specification