System and method for identifying website verticals
First Claim
1. A method, comprising:
- receiving, by at least one server communicatively coupled to a network, a list of a plurality of first keywords, the plurality of first keywords obtained by scraping each web page of a plurality of web pages in a target website, each of the plurality of web pages having at least one of the first keywords obtained therefrom;
converting, by the at least one server, the list into a target vector representing the target website, the target vector comprising a plurality of elements each associated with a corresponding second keyword of a plurality of second keywords, the plurality of second keywords being selected from a corpus of websites, by;
counting the number of times each second keyword of the plurality of second keywords appears in the list to produce a corresponding frequency of appearance of each second keyword in the target website; and
storing, in each element of the plurality of elements, the corresponding frequency of appearance of the corresponding second keyword;
comparing, by the at least one server, the target vector to a plurality of reference vectors each being assigned one or more categories of a category structure; and
assigning, by the at least one server, the assigned one or more categories of the closest matching reference vector to the target website.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for the categorization of websites are presented. A website is categorized using one or a combination of its domain name and its web page content. The domain name is tokenized, and the tokens compared to categories in a category structure to determine probabilities that the token belongs to each category. Combinations of tokens are similarly compared to the categories. A category may be determined with reference to a vector space in which a training set of websites having known categories is converted according to a methodology into reference vectors containing keyword frequencies. A target website is converted to a target vector using the same methodology, and a distance score of the target vector to each reference vector is calculated. The website represented by the target vector is assigned the category of the reference vector having the lowest distance score.
-
Citations
20 Claims
-
1. A method, comprising:
-
receiving, by at least one server communicatively coupled to a network, a list of a plurality of first keywords, the plurality of first keywords obtained by scraping each web page of a plurality of web pages in a target website, each of the plurality of web pages having at least one of the first keywords obtained therefrom; converting, by the at least one server, the list into a target vector representing the target website, the target vector comprising a plurality of elements each associated with a corresponding second keyword of a plurality of second keywords, the plurality of second keywords being selected from a corpus of websites, by; counting the number of times each second keyword of the plurality of second keywords appears in the list to produce a corresponding frequency of appearance of each second keyword in the target website; and storing, in each element of the plurality of elements, the corresponding frequency of appearance of the corresponding second keyword; comparing, by the at least one server, the target vector to a plurality of reference vectors each being assigned one or more categories of a category structure; and assigning, by the at least one server, the assigned one or more categories of the closest matching reference vector to the target website. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system, comprising:
at least one server computer in communication with a network, the at least one server computer including a processor configured to; receive a list of a plurality of first keywords each collected from one of a plurality of web pages of a target website, each of the plurality of web pages having at least one of the first keywords collected therefrom; create a target vector representing the target website, the target vector comprising a plurality of elements each signifying a frequency of appearance of a corresponding second keyword of a plurality of second keywords within the target website, the plurality of second keywords being selected from a corpus of websites; determine, for each second keyword of the plurality of second keywords, a corresponding count of the number of times the second keyword appears in the list; determine, for each element of the plurality of elements, a corresponding value based on the corresponding count of the corresponding second keyword; compare the target vector to a plurality of reference vectors each being assigned one or more categories of a category structure; and assign the assigned one or more categories of the closest matching reference vector to the target website. - View Dependent Claims (15, 16, 17, 18, 19, 20)
Specification