Optimized web domains classification based on progressive crawling with clustering

US 8,972,376 B1
Filed: 01/02/2013
Issued: 03/03/2015
Est. Priority Date: 01/02/2013
Status: Active Grant

First Claim

Patent Images

1. A system for optimized web domains classification based on progressive crawling with clustering, comprising:

a processor configured to;

crawl a domain to collect data for a subset of pages of a corpus of content associated with the domain;

classify each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises;

determine a category for the each of the crawled pages in the domain;

group more than one page having the same category into a first cluster;

determine whether a number of the more than one page of the first cluster exceeds a first threshold; and

in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and

determine which of the one or more category clusters to publish for the domain; and

a memory coupled to the processor and configured to provide the processor with instructions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for optimized web domains classification based on progressive crawling with clustering are disclosed. In some embodiments, optimized web domains classification based on progressive crawling with clustering includes crawling a domain (e.g., a web site domain) to collect data for a subset of pages (e.g., web pages) of a corpus of content associated with the domain; classifying each of the crawled pages into one or more category clusters, in which the category clusters represent a content categorization of the corpus of content associated with the domain (e.g., a URL content categorization for the domain, host of that domain, and/or directory of that domain); and determining which of the one or more category clusters to publish for the domain.

40 Citations

View as Search Results

25 Claims

1. A system for optimized web domains classification based on progressive crawling with clustering, comprising:
- a processor configured to;
  
  crawl a domain to collect data for a subset of pages of a corpus of content associated with the domain;
  
  classify each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises;
  
  determine a category for the each of the crawled pages in the domain;
  
  group more than one page having the same category into a first cluster;
  
  determine whether a number of the more than one page of the first cluster exceeds a first threshold; and
  
  in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and
  
  determine which of the one or more category clusters to publish for the domain; and
  
  a memory coupled to the processor and configured to provide the processor with instructions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The system recited in claim 1, wherein the processor is further configured to:
    - determine a sub-entry point to randomly select as a next web page to crawl of the subset of web pages.
  - 3. The system recited in claim 1, wherein classifying each of the crawled pages into one or more category clusters includes associating each of the crawled pages with a Uniform Resource Locator (URL) content categorization.
  - 4. The system recited in claim 1, wherein the processor is further configured to:
    - promote a cluster to a primary category cluster for the domain.
  - 5. The system recited in claim 1, wherein the processor is further configured to:
    - demote a primary category cluster to a secondary category cluster for the domain.
  - 6. The system recited in claim 1, wherein the processor is further configured to:
    - determine which of the one or more category clusters to promote into a primary category cluster or a secondary category cluster, or to demote.
  - 7. The system recited in claim 1, wherein the processor is further configured to:
    - determine which of the one or more category clusters to promote into a primary category cluster or a secondary category cluster, or to demote using one or more heuristics to determine a confidence level.
  - 8. The system recited in claim 1, wherein the processor is further configured to:
    - determine which of the one or more category clusters to promote into a primary category cluster or a secondary category cluster, or to demote using one or more heuristics to determine a confidence level, including using a rate of cluster size growth.
  - 9. The system recited in claim 1, wherein the processor is further configured to:
    - perform domain classification at a domain level, a host level, and/or a path level.
  - 10. The system recited in claim 1, wherein the processor is further configured to:
    - classify content based on requests for content received from one or more of a plurality of security devices.
  - 11. The system recited in claim 1, further comprising:
    - in the event that the number of the more than one page of the first cluster exceeds the first threshold, identifying a second cluster in the domain based on pages not related to the first cluster, wherein the second cluster includes a number of pages exceeding a second threshold, the second threshold being less than the first threshold.

12. A method of optimized web domains classification based on progressive crawling with clustering, comprising:
- crawling a domain to collect data for a subset of pages of a corpus of content associated with the domain;
  
  classifying each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises;
  
  determining a category for the each of the crawled pages in the domain;
  
  grouping more than one page having the same category into a first cluster;
  
  determining whether a number of the more than one page of the first cluster exceeds a first threshold; and
  
  in the event that the number of the more than one page of the first cluster does not exceed the first threshold, selecting a new page within the domain to crawl and classify; and
  
  determining which of the one or more category clusters to publish for the domain.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12, further comprising:
    - determining a sub-entry point to randomly select as a next web page to crawl of the subset of web pages.
  - 14. The method of claim 12, wherein classifying each of the crawled pages into one or more category clusters includes associating each of the crawled pages with a Uniform Resource Locator (URL) content categorization.
  - 15. The method of claim 12, further comprising:
    - determining which of the one or more category clusters to promote into a primary category cluster or a secondary category cluster, or to demote.
  - 16. The method of claim 12, further comprising:
    - classifying content based on requests for content received from one or more of a plurality of security devices.

17. A computer program product for optimized web domains classification based on progressive crawling with clustering, the computer program product being embodied in a tangible non-transitory computer readable storage medium and comprising computer instructions for:
- crawling a domain to collect data for a subset of pages of a corpus of content associated with the domain;
  
  classifying each of the crawled pages into one or more category clusters, wherein the category clusters represent a content categorization of the corpus of content associated with the domain, and wherein the classifying of the each of the crawled pages into the one or more category clusters comprises;
  
  determining a category for the each of the crawled pages in the domain;
  
  grouping more than one page having the same category into a first cluster;
  
  determining whether a number of the more than one page of the first cluster exceeds a first threshold; and
  
  in the event that the number of the more than one page of the first cluster does not exceed the first threshold, selecting a new page within the domain to crawl and classify; and
  
  determining which of the one or more category clusters to publish for the domain.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The computer program product recited in claim 17, further comprising computer instructions for:
    - determining a sub-entry point to randomly select as a next web page to crawl of the subset of web pages.
  - 19. The computer program product recited in claim 17, wherein classifying each of the crawled pages into one or more category clusters includes associating each of the crawled pages with a Uniform Resource Locator (URL) content categorization.
  - 20. The computer program product recited in claim 17, further comprising computer instructions for:
    - determining which of the one or more category clusters to promote into a primary category cluster or a secondary category cluster, or to demote.
  - 21. The computer program product recited in claim 17, further comprising computer instructions for:
    - classifying content based on requests for content received from one or more of a plurality of security devices.

22. A system that implements a cloud service for providing optimized web domains classification based on progressive crawling with clustering, comprising:
- a processor configured to;
  
  distribute a first Uniform Resource Locator (URL) content categorization data feed to a first plurality of subscribers, wherein the first URL content categorization data feed is collected using an optimized web domains classification based on progressive crawling with clustering to determine which category clusters to publish for each categorized web domain, and wherein the distributing of the first Uniform Resource Locator (URL) content categorization data feed to the first plurality of subscribers comprises;
  
  receive a request to classify content for a first web domain from a first security device;
  
  automatically classify the content for the first web domain, comprising;
  
  crawl a plurality of pages in the first web domain;
  
  determine a category for the plurality of pages in the first web domain;
  
  group more than one page having the same category into a first cluster;
  
  determine whether a number of the more than one page of the first cluster exceeds a first threshold; and
  
  in the event that the number of the more than one page of the first cluster does not exceed the first threshold, select a new page within the domain to crawl and classify; and
  
  send the classification for the content for the first web domain to the first security device; and
  
  distribute a second URL content categorization data feed to a second plurality of subscribers, wherein the second URL content categorization data feed is collected using an optimized web domains classification based on progressive crawling with clustering to determine which category clusters to publish for each categorized web domain; and
  
  a memory coupled to the processor and configured to provide the processor with instructions.
- View Dependent Claims (23, 24, 25)
- - 23. The system recited in claim 22, wherein the first plurality of subscribers are associated with a first geography and/or language, and wherein the second plurality of subscribers are associated with a second geography and/or language.
  - 24. The system recited in claim 22, wherein the processor is further configured to:
    - receive a request to classify content for a second web domain from a second security device;
      
      automatically classify the content for the second web domain; and
      
      send the classification for the content for the second web domain to the second security device.
  - 25. The system recited in claim 22, wherein the processor is further configured to:
    - receive a request to classify content for a second web domain from a second security device;
      
      automatically classify the content for the second web domain; and
      
      send the classification for the content for the second web domain to the second security device;
      
      wherein the first security device stores different URL categorization data than the second security device based on different URL requests passing through first security device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palo Alto Networks Incorporated
Original Assignee
Palo Alto Networks Incorporated
Inventors
Xu, Lin, Lazzarato, Renzo, Gailis, Renars
Primary Examiner(s)
LE, HUNG D

Application Number

US13/732,860
Time in Patent Office

790 Days
Field of Search

None
US Class Current

707/710
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Optimized web domains classification based on progressive crawling with clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

40 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Optimized web domains classification based on progressive crawling with clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links