TECHNIQUES FOR TOKENIZING URLS
First Claim
1. A method for tokenizing URLs, comprising:
- tokenizing, based upon generic delimiters, URLs of each of a plurality of documents of a website into a plurality of components;
for each particular component of the plurality of components, locating website-specific delimiters in the particular component;
calculating a delimiter support threshold for each particular website-specific delimiter of located site-specific delimiters;
determining whether delimiter support for each particular website-specific delimiter is greater than a specified delimiter support threshold;
in response to determining that the site specific delimiter support for the particular website-specific delimiter is greater than the specified delimiter support threshold, tokenizing the particular component based upon the particular website-specific delimiter;
for each particular token of the particular component, calculating a token support threshold for the particular token;
determining whether token support for the particular token is greater than a specified token support threshold; and
in response to determining that the token support for the component token is greater than the specified token support threshold, using the particular token to generate a description of the website.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described for tokenizing a corpus of URLs of web documents. URLs are first tokenized based upon specified generic delimiters to form components. The components are then tokenized using website-specific delimiters. Website-specific delimiters are any non-alphanumerical symbol or a unit change that is specific to a particular website. Support for website-specific delimiters and the tokens resulting from website-specific delimiters are calculated. Support values for website-specific delimiters and the tokens above a specified threshold value are valid. Tokenization may also be performed by generating a graph of the corpus of URLs of web documents. Each node of the graph represents a token and each edge represents a delimiter of the URLs. The graph is traversed and the support of the edges are compared to a specified threshold value. If the support of an edge of a node is greater, then the token corresponding to the node is valid.
25 Citations
24 Claims
-
1. A method for tokenizing URLs, comprising:
-
tokenizing, based upon generic delimiters, URLs of each of a plurality of documents of a website into a plurality of components; for each particular component of the plurality of components, locating website-specific delimiters in the particular component; calculating a delimiter support threshold for each particular website-specific delimiter of located site-specific delimiters; determining whether delimiter support for each particular website-specific delimiter is greater than a specified delimiter support threshold; in response to determining that the site specific delimiter support for the particular website-specific delimiter is greater than the specified delimiter support threshold, tokenizing the particular component based upon the particular website-specific delimiter; for each particular token of the particular component, calculating a token support threshold for the particular token; determining whether token support for the particular token is greater than a specified token support threshold; and in response to determining that the token support for the component token is greater than the specified token support threshold, using the particular token to generate a description of the website. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of tokenizing URLs, comprising:
-
tokenizing, based upon generic delimiters and website-specific delimiters, URLs of each of a plurality of documents of a website into a plurality of components; generating a graph wherein (a) each node of the graph represents components and (b) edges connecting the nodes represent delimiters; associating a weight to each node and each edge; for each particular node of the graph, traversing from the particular node to another node connected by an edge; comparing the weight of the edge to a specified delimiter support threshold; if the weight of the edge is greater than the specified delimiter support threshold, then including the node in a set of validated nodes; if the weight of the edge is not greater than the specified delimiter support threshold, then traversing the graph until reaching a node where the number of incoming edges is equal to the number of nodes in a previous level; and generating a description of the website based at least in part on the validated nodes. - View Dependent Claims (10, 11, 12)
-
-
13. A computer-readable storage medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to:
-
tokenize, based upon generic delimiters, URLs of each of a plurality of documents of a website into a plurality of components; for each particular component of the plurality of components, locate website-specific delimiters in the particular component; calculate a delimiter support threshold for each particular website-specific delimiter of located site-specific delimiters; determine whether delimiter support for each particular website-specific delimiter is greater than a specified delimiter support threshold; in response to determining that the site specific delimiter support for the particular website-specific delimiter is greater than the specified delimiter support threshold, tokenize the particular component based upon the particular website-specific delimiter; for each particular token of the particular component, calculate a token support threshold for the particular token; determine whether token support for the particular token is greater than a specified token support threshold; and in response to determining that the token support for the component token is greater than the specified token support threshold, use the particular token to generate a description of the website. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-readable storage medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to:
-
tokenize, based upon generic delimiters and website-specific delimiters, URLs of each of a plurality of documents of a website into a plurality of components; generate a graph wherein (a) each node of the graph represents components and (b) edges connecting the nodes represent delimiters; associate a weight to each node and each edge; for each particular node of the graph, traverse from the particular node to another node connected by an edge; compare the weight of the edge to a specified delimiter support threshold; if the weight of the edge is greater than the specified delimiter support threshold, then include the node in a set of validated nodes; if the weight of the edge is not greater than the specified delimiter support threshold, then traverse the graph until reaching a node where the number of incoming edges is equal to the number of nodes in a previous level; and generate a description of the website based at least in part on the validated nodes. - View Dependent Claims (22, 23, 24)
-
Specification