System and method of analyzing web content
First Claim
Patent Images
1. A method of collecting data associated with a plurality of URLs, the method implemented on one or more computer processors, and comprising:
- receiving a configuration plug-in, the plug-in specifying a web-crawling mode;
receiving URL data;
determining a plurality of work units from the URL data, each work unit comprising a URL;
determining whether one of a plurality of dispatchers is available for receiving one of the plurality of work units;
sending the one of the plurality of work units to the one of the plurality of dispatchers; and
retrieving content associated with the URL of the work unit, using the one of the plurality of dispatchers based on the web-crawling mode specified by the configuration plug-in,wherein at least one of the plurality of dispatchers is configured to retrieve content and store the content in a database,wherein at least one of the plurality of dispatchers is configured to download executable content and execute the executable content in a sandbox environment, andwherein at least one of the plurality of dispatchers is configured to replace the at least one of the plurality of dispatchers configured to download executable content if the at least one of the plurality of dispatchers configured to download executable content is damaged by execution of the executable content.
21 Assignments
0 Petitions
Accused Products
Abstract
A system and method are provided for identifying inappropriate content in websites on a network. Unrecognized uniform resource locators (URLs) or other web content are accessed by workstations and are identified as possibly having malicious content. The URLs or web content may be preprocessed within a gateway server module or some other software module to collect additional information related to the URLs. The URLs may be scanned for known attack signatures, and if any are found, they may be tagged as candidate URLs in need of further analysis by a classification module.
662 Citations
15 Claims
-
1. A method of collecting data associated with a plurality of URLs, the method implemented on one or more computer processors, and comprising:
-
receiving a configuration plug-in, the plug-in specifying a web-crawling mode; receiving URL data; determining a plurality of work units from the URL data, each work unit comprising a URL; determining whether one of a plurality of dispatchers is available for receiving one of the plurality of work units; sending the one of the plurality of work units to the one of the plurality of dispatchers; and retrieving content associated with the URL of the work unit, using the one of the plurality of dispatchers based on the web-crawling mode specified by the configuration plug-in, wherein at least one of the plurality of dispatchers is configured to retrieve content and store the content in a database, wherein at least one of the plurality of dispatchers is configured to download executable content and execute the executable content in a sandbox environment, and wherein at least one of the plurality of dispatchers is configured to replace the at least one of the plurality of dispatchers configured to download executable content if the at least one of the plurality of dispatchers configured to download executable content is damaged by execution of the executable content. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification