Apparatus and method for gathering of objectional web sites
First Claim
1. A harmful site collection apparatus comprising:
- a start uniform resource locator (URL) database (DB) storing URLs of harmful web pages;
a URL examination and distribution unit providing URLs grouped in relation to predetermined hosts, the URLs obtained by removing redundant URLs that are different to each other but indicate identical web pages, among the URLs stored in the start URL DB, and then among the remaining URLs, removing URLs corresponding to web sites already collected;
a web site collection unit collecting web contents of the web sites corresponding to the URLs received from the URL examination and distribution unit; and
a URL extraction unit extracting URLs in the links included in the web contents collected by the web site collection unit, identifying harmless URLs based on top-level domain names and a harmless URL list among the extracted URLs, and removing the identified harmless URLs from the URLs that are the object of the collection.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus and method for collecting harmful web sites are provided. In the apparatus, a start uniform resource locator (URL) database (DB) stores URLs of harmful web pages. A URL examination and distribution unit provides URLs grouped in relation to predetermined hosts, the URLs obtained by removing redundant URLs that are different to each other but indicate identical web pages, among the URLs stored in the start URL DB, and then among the remaining URLs, removing URLs corresponding web sites already collected. A web site collection unit collects web contents of the web sites corresponding to the URLs received from the URL examination and distribution unit. A URL extraction unit extracts URLs in the links included in the web contents collected by the web site collection unit, identifies harmless URLs based on top-level domain names and a harmless URL list among the extracted URLs, and removes the identified harmless URLs from the URLs that are the object of the collection. According to the apparatus and method, the harmful site database is helped to maintain accurate, abundant, and latest information.
37 Citations
14 Claims
-
1. A harmful site collection apparatus comprising:
-
a start uniform resource locator (URL) database (DB) storing URLs of harmful web pages;
a URL examination and distribution unit providing URLs grouped in relation to predetermined hosts, the URLs obtained by removing redundant URLs that are different to each other but indicate identical web pages, among the URLs stored in the start URL DB, and then among the remaining URLs, removing URLs corresponding to web sites already collected;
a web site collection unit collecting web contents of the web sites corresponding to the URLs received from the URL examination and distribution unit; and
a URL extraction unit extracting URLs in the links included in the web contents collected by the web site collection unit, identifying harmless URLs based on top-level domain names and a harmless URL list among the extracted URLs, and removing the identified harmless URLs from the URLs that are the object of the collection. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9)
-
-
2. The apparatus of claim 2, wherein the web site collection unit determines whether or not a characteristic pattern that occurs when the web site is accessed is similar to a characteristic pattern that occurs when a harmful site is accessed.
-
10. A harmful site collection method comprising:
-
removing redundant URLs that are different to each other but indicate identical web pages, among URLs stored in a start URL DB, then removing URLs corresponding to web sites already collected among the remaining URLs, then dividing the URLs into groups in relation to predetermined hosts and providing the groups of URLs;
collecting web contents of the web sites corresponding to the arranged URLs and based on a characteristic pattern that occurs when a harmful site is accessed, analyzing whether or not the web site is harmful; and
extracting URLs from links included in the collected web contents, identifying harmless URLs among the extracted URLs, based on top-level domain names and a harmless URL list, and removing the identified harmless URLs from the URLs that are the object of the collection. - View Dependent Claims (11, 12, 13, 14)
-
Specification