System and method for website categorization
First Claim
1. A method, comprising:
- receiving, by at least one server communicatively coupled to a network, one or more tokens together forming all or part of a string comprising a domain name;
comparing, by the at least one server, each of the one or more tokens to each of a plurality of categories in a category structure to determine, for each pairing of one of the tokens with one of the categories, a token probability that the token belongs to the category;
for one or more of the token probabilities, increasing or reducing the token probability according to a frequency at which the category associated with the token probability is selected as a correct category or declined as an incorrect category for the token associated with the token probability, the frequency identified from a plurality of domain name searches previously processed by a first of the at least one server;
calculating, by the at least one server from the token probabilities, a final probability of the string belonging to each category; and
categorizing, by the at least one server, the token in the category having the highest final probability.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for the categorization of websites are presented. A website is categorized using one or a combination of its domain name and its web page content. The domain name is tokenized, and the tokens compared to categories in a category structure to determine probabilities that the token belongs to each category. Combinations of tokens are similarly compared to the categories. A category may be determined with reference to a vector space in which a training set of websites having known categories is converted according to a methodology into reference vectors containing keyword frequencies. A target website is converted to a target vector using the same methodology, and a distance score of the target vector to each reference vector is calculated. The website represented by the target vector is assigned the category of the reference vector having the lowest distance score.
17 Citations
18 Claims
-
1. A method, comprising:
-
receiving, by at least one server communicatively coupled to a network, one or more tokens together forming all or part of a string comprising a domain name; comparing, by the at least one server, each of the one or more tokens to each of a plurality of categories in a category structure to determine, for each pairing of one of the tokens with one of the categories, a token probability that the token belongs to the category; for one or more of the token probabilities, increasing or reducing the token probability according to a frequency at which the category associated with the token probability is selected as a correct category or declined as an incorrect category for the token associated with the token probability, the frequency identified from a plurality of domain name searches previously processed by a first of the at least one server; calculating, by the at least one server from the token probabilities, a final probability of the string belonging to each category; and categorizing, by the at least one server, the token in the category having the highest final probability. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system, comprising:
at least one server computer in communication with a network, the at least one server computer including a processor configured to; receive one or more tokens together forming all or part of a string comprising a domain name; compare each of the one or more tokens to each of a plurality of categories in a category structure to determine, for each pairing of one of the tokens with one of the categories, a token probability that the token belongs to the category; for one or more of the token probabilities, increase or reduce the token probability according to a first frequency at which the category associated with the token probability is selected as a correct category or declined as an incorrect category for the token associated with the token probability, the first frequency identified from a plurality of domain name searches previously processed by a first of the at least one server; calculate, from the token probabilities for each of the plurality of categories, a final probability of the domain name belonging to the category; and categorize the domain name in the category having the highest final probability. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
Specification