Method and system for efficient and exhaustive URL categorization
First Claim
1. A method for categorizing URLs (Uniform Resource Locators) of web pages accessed by users over an IP (Internet Protocol) based data network, the method comprising:
- collecting by means of at least one monitoring probe real time data from IP data traffic occurring on the IP based data network;
extracting from said collected real time data parameters related to a web page, said parameters including an URL of the web page;
processing said URL with a rule based categorization engine, to associate a matching category to the URL of said web page, the matching category being inferred from a pre-defined list of categories;
when no matching category is inferred, transferring said URL of said web page to a semantic based categorization engine; and
processing said transferred URL by the semantic based categorization engine, said processing consisting in;
extracting textual content from content of said web page associated to said URL,performing a semantic analysis of said textual content, andassociating a matching category to the transferred URL of the web page based on the semantic analysis of the textual content extracted from the web page, the matching category being inferred from a pre-defined list of categories,wherein the URLs for which no matching category has been inferred by the rule based categorization engine over a determined period of time are memorized, wherein only the N URLs having the highest occurrence for which no matching category has been inferred by the rule based categorization engine over the determined period of time are transferred to the semantic based categorization engine, and wherein N is a pre-defined number of URLs.
2 Assignments
0 Petitions
Accused Products
Abstract
The present method and system relate to categorizing URLs (Uniform Resource Locators) of web pages accessed by multiple users over an IP (Internet Protocol) based data network. The method and system collect real time data from IP data traffic occurring on the IP based data network, and extract parameters from the collected real time data, the parameters including an URL of a web page. The URL is processed by a rule based categorization engine, to associate a matching category to the URL of the web page. When no matching category is inferred, the URL is transferred to a semantic based categorization engine. A matching category is associated to the transferred URL by the semantic based categorization engine, based on a semantic analysis of the textual content extracted from the web page associated to the URL.
17 Citations
14 Claims
-
1. A method for categorizing URLs (Uniform Resource Locators) of web pages accessed by users over an IP (Internet Protocol) based data network, the method comprising:
-
collecting by means of at least one monitoring probe real time data from IP data traffic occurring on the IP based data network; extracting from said collected real time data parameters related to a web page, said parameters including an URL of the web page; processing said URL with a rule based categorization engine, to associate a matching category to the URL of said web page, the matching category being inferred from a pre-defined list of categories; when no matching category is inferred, transferring said URL of said web page to a semantic based categorization engine; and processing said transferred URL by the semantic based categorization engine, said processing consisting in; extracting textual content from content of said web page associated to said URL, performing a semantic analysis of said textual content, and associating a matching category to the transferred URL of the web page based on the semantic analysis of the textual content extracted from the web page, the matching category being inferred from a pre-defined list of categories, wherein the URLs for which no matching category has been inferred by the rule based categorization engine over a determined period of time are memorized, wherein only the N URLs having the highest occurrence for which no matching category has been inferred by the rule based categorization engine over the determined period of time are transferred to the semantic based categorization engine, and wherein N is a pre-defined number of URLs. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for categorizing URLs of web pages accessed by users over an IP based data network, the system comprising:
-
at least one monitoring probe for collecting real time data from IP data traffic occurring on the IP based data network, and for extracting from said collected real time data parameters related to a web page, said parameters including an URL of the web page; a computer hardware processor for executing instructions implementing; a rule based categorization engine for processing said URL, to associate a matching category to the URL of said web page, the matching category being inferred from a pre-defined list of categories; a semantic based categorization engine for further processing said URL of said web page, when no matching category is inferred by the rule based categorization engine, the further processing consisting in; extracting textual content from content of said web page associated to said URL, performing a semantic analysis of said textual content, and associating a matching category to the URL of the web page, based on the semantic analysis of the textual content extracted from the web page, the matching category being inferred from a pre-defined list of categories, wherein the URLs for which no matching category has been inferred by the rule based categorization engine over a determined period of time are memorized, wherein only the N URLs having the highest occurrence for which no matching category has been inferred by the rule based categorization engine over the determined period of time are transferred to the semantic based categorization engine, and wherein N is a pre-defined number of URLs. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification