Method and apparatus for automatic information filtering using URL hierarchical structure and automatic word weight learning
First Claim
1. A method of automatic information filtering for identifying inappropriate information among various information provided through Internet and blocking presentation of identified inappropriate information, comprising the steps of:
- entering an HTML (HyperText Markup Language) information provided through the Internet;
judging whether a URL (Uniform Resource Locator) of said HTML information entered from the Internet is a top page URL or not, the top page URL being a URL ending with a prescribed character string defining according to a URL hierarchical structure by which each URL is constructed;
extracting words appearing in information indicated by the top page URL and carrying out an automatic filtering to judge whether said information indicated by the top page URL is inappropriate or not according to the words extracted from said information indicated by the top page URL, when said URL of said HTML information is the top page URL;
registering an upper level URL derived from the top page URL into an inappropriate upper level URL list and blocking presentation of said information indicated by the top page URL, when said information indicated by the top page URL is judged as inappropriate by the automatic filtering, the upper level URL being derived from the top page URL by keeping a character string constituting the top page URL only up to a rightmost slash;
comparing said URL of said HTML information with each URL registered in the inappropriate upper level URL list and judging whether there is any matching URL in the inappropriate upper level URL list when said URL of said HTML information is not the top page URL, and blocking presentation of information indicated by said URL of said HTML information when there is a matching URL in the inappropriate upper level URL list, the matching URL being one upper level URL whose character string is contained in said URL of said HTML information;
extracting words appearing in said information indicated by said URL of said HTML information, and carrying out the automatic filtering to judge whether said information indicated by said URL of said HTML information is inappropriate or not according to the words extracted from said information indicated by said URL of said HTML information, when there is no matching URL in the inappropriate upper level URL list; and
blocking presentation of said information indicated by said URL of said HTML information when said information indicated by said URL of said HTML information is judged as inappropriate by the automatic filtering.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed method and apparatus for automatic information filtering are capable of improving both precision and recall and accurately judging inappropriateness of the content even for a page that contains very few or no text information and only displays images, by utilizing an upper level URL of a URL given in a hierarchical structure. Also, disclosed method and apparatus for automatic information filtering are capable of setting weights of words easily and accurately and judging inappropriateness of the information by utilizing these weights, by using an automatic learning based on a linear discrimination function that can discriminate the inappropriate information and the appropriate information on a vector space.
70 Citations
6 Claims
-
1. A method of automatic information filtering for identifying inappropriate information among various information provided through Internet and blocking presentation of identified inappropriate information, comprising the steps of:
-
entering an HTML (HyperText Markup Language) information provided through the Internet; judging whether a URL (Uniform Resource Locator) of said HTML information entered from the Internet is a top page URL or not, the top page URL being a URL ending with a prescribed character string defining according to a URL hierarchical structure by which each URL is constructed; extracting words appearing in information indicated by the top page URL and carrying out an automatic filtering to judge whether said information indicated by the top page URL is inappropriate or not according to the words extracted from said information indicated by the top page URL, when said URL of said HTML information is the top page URL; registering an upper level URL derived from the top page URL into an inappropriate upper level URL list and blocking presentation of said information indicated by the top page URL, when said information indicated by the top page URL is judged as inappropriate by the automatic filtering, the upper level URL being derived from the top page URL by keeping a character string constituting the top page URL only up to a rightmost slash; comparing said URL of said HTML information with each URL registered in the inappropriate upper level URL list and judging whether there is any matching URL in the inappropriate upper level URL list when said URL of said HTML information is not the top page URL, and blocking presentation of information indicated by said URL of said HTML information when there is a matching URL in the inappropriate upper level URL list, the matching URL being one upper level URL whose character string is contained in said URL of said HTML information; extracting words appearing in said information indicated by said URL of said HTML information, and carrying out the automatic filtering to judge whether said information indicated by said URL of said HTML information is inappropriate or not according to the words extracted from said information indicated by said URL of said HTML information, when there is no matching URL in the inappropriate upper level URL list; and blocking presentation of said information indicated by said URL of said HTML information when said information indicated by said URL of said HTML information is judged as inappropriate by the automatic filtering. - View Dependent Claims (2)
-
-
3. An automatic information filtering apparatus for identifying inappropriate information among various information provided through Internet and blocking presentation of identified inappropriate information, comprising:
-
an input unit for entering an HTML (HyperText Markup Language) information provided through the Internet; a top page URL judging unit for judging whether a URL (Uniform Resource Locator) of said HTML information entered from the Internet is a top page URL or not, the top page URL being a URL ending with a prescribed character string defined according to a URL hierarchical structure by which each URL is constructed; a first automatic filtering unit for extracting words appearing in information indicated by the top page URL and carrying out an automatic filtering to judge whether said information indicated by the top page URL is inappropriate or not according to the words extracted from said information indicated by the top page URL, when said URL of said HTML information is the top page URL; an inappropriate upper level URL list registration unit for registering an upper level URL derived from the top page URL into an inappropriate upper level URL list and blocking presentation of said information indicated by the top page URL is judged as inappropriate by the automatic filtering, the upper level URL being derived from the top page URL by keeping a character string constituting the top page URL only up to a rightmost slash; an inappropriate URL judging unit for comparing said URL of said HTML information with each URL registered in the inappropriate upper level URL list and judging whether there is any matching URL in the inappropriate upper level URL list when said URL of said HTML information is not the top page URL, and blocking presentation of information indicated by said URL of said HTML information when there is a matching URL in the inappropriate upper level URL list, the matching URL being one upper level URL whose character string is contained in said URL of said HTML information; a second automatic filtering unit for extracting words appearing in said information indicated by said URL of said HTML information, and carrying out the automatic filtering to judge whether said information indicated by said URL of said HTML information is inappropriate or not according to the words extracted from said information indicated by said URL of said HTML information, when there is no matching URL in the inappropriate upper level URL list; and an information presentation blocking unit for blocking presentation of said information indicated by said URL of said HTML information is judged as inappropriate by the automatic filtering. - View Dependent Claims (4)
-
-
5. A computer usable medium having computer readable program codes embodied therein for causing a computer to function as an automatic information filtering apparatus for identifying inappropriate information among various information provided through Internet and blocking presentation of identified inappropriate information, the computer readable program codes include:
-
a first computer readable program code for causing said computer to enter an HTML (HyperText Markup Language) information provided through the Internet; a second computer readable program code for causing said comptuer to judge whether a URL (Uniform Resource Locator) of said HTML information entered from the Internet is a top page URL or not, the top page URL being a URL ending with a prescribed character string defined according to the URL hierarchical structure by which each URL is constructed; a third computer readable program code for causing said computer to extract words appearing in information indicated by the top page URL and carry out an automatic filtering to judge whether said information indicated by the top page URL is inappropriate or not according to the words extracted from said information indicated by the top page URL, when said URL of said HTML information is the top page URL; a fourth computer readable program code for causing said computer to register an upper level URL derived from the top page URL into an inappropriate upper level URL list and block presentation of said information indicated by the top page URL, when said information indicated by the top page URL is judged as inappropriate by the automatic filtering, the upper level URL being derived from the top page URL by keeping a character string constituting the top page URL only up to a rightmost slash; a fifth computer readable program code for causing said computer to compare said URL of said HTML information with each URL registered in the inappropriate upper level URL list and judge whether there is any matching URL in the inappropriate upper level URL list when said URL of said HTML information is not the top page URL, and block presentation of information indicated by said URL of said HTML information when there is a matching URL in the inappropriate upper level URL list, the matching URL being one upper level URL whose character string is contained in said URL of said HTML information; a sixth computer readable program code for causing said computer to extract words appearing in said information indicated by said URL of said HTML information, and carry out the automatic filtering to judge whether said information indicated by said URL of said HTML information is inappropriate or not according to the words extracted from said information indicated by said URL of said HTML information, when there is no matching URL in the inappropriate upper level URL list; and a seventh computer readable program code for causing said computer to block presentation of said information indicated by said URL of said HTML information when said information indicated by said URL of said HTML information is judged as inappropriate by the automatic filtering. - View Dependent Claims (6)
-
Specification