Optimized web domains classification based on progressive crawling with clustering
First Claim
1. A system for optimized web domains classification based on progressive crawling with clustering, comprising:
- a processor configured to;
crawl a domain to collect data for a subset of pages of a corpus of content associated with the domain;
classify each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises;
determine a category for the each of the crawled pages in the domain;
group more than one page having the same category into a first cluster;
determine whether a number of the more than one page of the first cluster exceeds a first threshold; and
in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and
determine which of the one or more category clusters to publish for the domain; and
a memory coupled to the processor and configured to provide the processor with instructions.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques for optimized web domains classification based on progressive crawling with clustering are disclosed. In some embodiments, optimized web domains classification based on progressive crawling with clustering includes crawling a domain (e.g., a web site domain) to collect data for a subset of pages (e.g., web pages) of a corpus of content associated with the domain; classifying each of the crawled pages into one or more category clusters, in which the category clusters represent a content categorization of the corpus of content associated with the domain (e.g., a URL content categorization for the domain, host of that domain, and/or directory of that domain); and determining which of the one or more category clusters to publish for the domain.
40 Citations
25 Claims
-
1. A system for optimized web domains classification based on progressive crawling with clustering, comprising:
-
a processor configured to; crawl a domain to collect data for a subset of pages of a corpus of content associated with the domain; classify each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises; determine a category for the each of the crawled pages in the domain; group more than one page having the same category into a first cluster; determine whether a number of the more than one page of the first cluster exceeds a first threshold; and in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and determine which of the one or more category clusters to publish for the domain; and a memory coupled to the processor and configured to provide the processor with instructions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method of optimized web domains classification based on progressive crawling with clustering, comprising:
-
crawling a domain to collect data for a subset of pages of a corpus of content associated with the domain; classifying each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises; determining a category for the each of the crawled pages in the domain; grouping more than one page having the same category into a first cluster; determining whether a number of the more than one page of the first cluster exceeds a first threshold; and in the event that the number of the more than one page of the first cluster does not exceed the first threshold, selecting a new page within the domain to crawl and classify; and determining which of the one or more category clusters to publish for the domain. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer program product for optimized web domains classification based on progressive crawling with clustering, the computer program product being embodied in a tangible non-transitory computer readable storage medium and comprising computer instructions for:
-
crawling a domain to collect data for a subset of pages of a corpus of content associated with the domain; classifying each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises; determining a category for the each of the crawled pages in the domain; grouping more than one page having the same category into a first cluster; determining whether a number of the more than one page of the first cluster exceeds a first threshold; and in the event that the number of the more than one page of the first cluster does not exceed the first threshold, selecting a new page within the domain to crawl and classify; and determining which of the one or more category clusters to publish for the domain. - View Dependent Claims (18, 19, 20, 21)
-
-
22. A system that implements a cloud service for providing optimized web domains classification based on progressive crawling with clustering, comprising:
-
a processor configured to; distribute a first Uniform Resource Locator (URL) content categorization data feed to a first plurality of subscribers, wherein the first URL content categorization data feed is collected using an optimized web domains classification based on progressive crawling with clustering to determine which category clusters to publish for each categorized web domain, and wherein the distributing of the first Uniform Resource Locator (URL) content categorization data feed to the first plurality of subscribers comprises; receive a request to classify content for a first web domain from a first security device; automatically classify the content for the first web domain, comprising; crawl a plurality of pages in the first web domain; determine a category for the plurality of pages in the first web domain; group more than one page having the same category into a first cluster; determine whether a number of the more than one page of the first cluster exceeds a first threshold; and in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and send the classification for the content for the first web domain to the first security device; and distribute a second URL content categorization data feed to a second plurality of subscribers, wherein the second URL content categorization data feed is collected using an optimized web domains classification based on progressive crawling with clustering to determine which category clusters to publish for each categorized web domain; and a memory coupled to the processor and configured to provide the processor with instructions. - View Dependent Claims (23, 24, 25)
-
Specification