Training set construction for taxonomic classification
First Claim
1. A computer system comprising:
- at least one processor; and
a computer-readable medium storing instructions that, when executed by the at least one processor, cause the computer system to execute;
a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including;
a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data, the crawler including a site finder configured to receive the plurality of top-level sites from a user, andan extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.
2 Assignments
0 Petitions
Accused Products
Abstract
A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.
-
Citations
20 Claims
-
1. A computer system comprising:
-
at least one processor; and a computer-readable medium storing instructions that, when executed by the at least one processor, cause the computer system to execute; a training set generator configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data, the training set generator including; a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data, the crawler including a site finder configured to receive the plurality of top-level sites from a user, and an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method comprising:
-
determining a taxonomy including a hierarchy of categories; receiving a plurality of top-level sites related to the taxonomy from a user; determining, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories; crawling each of the top-level sites to determine at least one lower-level site associated therewith; storing the top-level sites and associated lower-level sites as crawl data; and applying each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data for a training set. - View Dependent Claims (14, 15, 16)
-
-
17. A computer program product, the computer program product being tangibly embodied on a non-transitory computer-readable medium and including executable code that, when executed, is configured to cause a data processing apparatus to:
-
determine a taxonomy including a hierarchy of categories; receive a plurality of top-level sites related to the taxonomy from a user; determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories; crawl each of the top-level sites to determine at least one lower-level site associated therewith; store the top-level sites and associated lower-level sites as crawl data; and apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data for a training set. - View Dependent Claims (18, 19, 20)
-
Specification