System and method of analyzing web content

US 8,020,206 B2
Filed: 07/10/2006
Issued: 09/13/2011
Est. Priority Date: 07/10/2006
Status: Active Grant

First Claim

Patent Images

1. A method of collecting data associated with a plurality of URLs, the method implemented on one or more computer processors, and comprising:

receiving a configuration plug-in, the plug-in specifying a web-crawling mode;

receiving URL data;

determining a plurality of work units from the URL data, each work unit comprising a URL;

determining whether one of a plurality of dispatchers is available for receiving one of the plurality of work units;

sending the one of the plurality of work units to the one of the plurality of dispatchers; and

retrieving content associated with the URL of the work unit, using the one of the plurality of dispatchers based on the web-crawling mode specified by the configuration plug-in,wherein at least one of the plurality of dispatchers is configured to retrieve content and store the content in a database,wherein at least one of the plurality of dispatchers is configured to download executable content and execute the executable content in a sandbox environment, andwherein at least one of the plurality of dispatchers is configured to replace the at least one of the plurality of dispatchers configured to download executable content if the at least one of the plurality of dispatchers configured to download executable content is damaged by execution of the executable content.

View all claims

21 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are provided for identifying inappropriate content in websites on a network. Unrecognized uniform resource locators (URLs) or other web content are accessed by workstations and are identified as possibly having malicious content. The URLs or web content may be preprocessed within a gateway server module or some other software module to collect additional information related to the URLs. The URLs may be scanned for known attack signatures, and if any are found, they may be tagged as candidate URLs in need of further analysis by a classification module.

662 Citations

15 Claims

1. A method of collecting data associated with a plurality of URLs, the method implemented on one or more computer processors, and comprising:
- receiving a configuration plug-in, the plug-in specifying a web-crawling mode;
  
  receiving URL data;
  
  determining a plurality of work units from the URL data, each work unit comprising a URL;
  
  determining whether one of a plurality of dispatchers is available for receiving one of the plurality of work units;
  
  sending the one of the plurality of work units to the one of the plurality of dispatchers; and
  
  retrieving content associated with the URL of the work unit, using the one of the plurality of dispatchers based on the web-crawling mode specified by the configuration plug-in,wherein at least one of the plurality of dispatchers is configured to retrieve content and store the content in a database,wherein at least one of the plurality of dispatchers is configured to download executable content and execute the executable content in a sandbox environment, andwherein at least one of the plurality of dispatchers is configured to replace the at least one of the plurality of dispatchers configured to download executable content if the at least one of the plurality of dispatchers configured to download executable content is damaged by execution of the executable content.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The computer-implemented method of claim 1, wherein the web-crawling mode comprises specifying the protocol with which a dispatcher is to request web content, the protocol comprising one of FTP, NNTP, or HTTP.
  - 3. The method of claim 1, further comprising assigning each of the work units a priority.
  - 4. The method of claim 3, wherein the one of the plurality of dispatchers to which the one of the plurality of work units is sent is determined based on at least the priority assigned to the work unit.
  - 5. The method of claim 1, wherein at least one of the dispatchers comprises a web crawler.
  - 6. The method of claim 1, wherein at least one of the dispatchers comprises an active honey miner.
  - 7. The method of claim 6, wherein the active honey miner is located within the sandbox environment.
  - 8. The method of claim 7, wherein the sandbox comprises a virtual machine emulating an operating system.
  - 9. The method of claim 8, wherein a server records URLs visited by the active honey miner as a result of executing the executable content.
  - 10. The method of claim 2, wherein the plug-in comprises an HTTP plug-in.
  - 11. The method of claim 2, wherein the plug-in comprises an FTP plug-in.
  - 12. The method of claim 2, wherein the plug-in comprises an NNTP plug-in.
  - 13. The method of claim 1, wherein the plug-in specifies that the dispatchers add URL links within the content to the URL data to be analyzed.
  - 14. The method of claim 1, wherein the plurality of dispatchers are controlled by a driver.
  - 15. The method of claim 14, wherein the driver determines when additional URL data is to be received.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Forcepoint LLC
Original Assignee
Forcepoint LLC
Inventors
Baddour, Victor Louie, Verenini, Nicholas Joseph, Hubbard, Dan
Primary Examiner(s)
Orgad; Edan
Assistant Examiner(s)
JACKSON, JENISE E

Application Number

US11/484,240
Publication Number

US 20080010368A1
Time in Patent Office

1,891 Days
Field of Search

726/22, 726/3, 726/26, 713/154, 713/165
US Class Current

726/22
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/951   Indexing; Web crawling tech...

H04L 63/1441   Countermeasures against mal...

System and method of analyzing web content

First Claim

21 Assignments

0 Petitions

Accused Products

Abstract

662 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

System and method of analyzing web content

First Claim

21 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

662 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links