TECHNIQUES FOR KEYWORD EXTRACTION FROM URLS USING STATISTICAL ANALYSIS
First Claim
1. A method for post-tokenization processing, comprising:
- generating, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus;
receiving a particular URL of a web document;
determining whether the particular URL corresponds to any of the regular expressions generated from the URL corpus;
if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then(a) tokenizing, based on delimiters and unit changes, the particular URL, and(b) storing each token of the particular URL as a keyword, thereby generating a first set of keywords;
if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then(a) retrieving a regular expression associated with the URL that corresponds to the particular URL, and (b) extracting, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords;
ranking, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and
storing the ranked set.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described for keyword extraction from URLs using regular expression patterns and keyword ranking. Tokenization of URLs also generates regular expressions of URLs from a website. The regular expressions are stored in the form of any type of indexing structure. When a new URL is received, the URL is examined to determine whether the URL is from a website that has previously been tokenized. If the URL is not from such a website, then the URL is tokenized using every delimiter and unit change to extract keywords. If the URL is from a website previously processed, the corresponding regular expression is used to extract keywords from the URL. The keywords extracted from the URLs are then ranked based on any ranking methodology for better relevance and performance.
76 Citations
22 Claims
-
1. A method for post-tokenization processing, comprising:
-
generating, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus; receiving a particular URL of a web document; determining whether the particular URL corresponds to any of the regular expressions generated from the URL corpus; if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then (a) tokenizing, based on delimiters and unit changes, the particular URL, and (b) storing each token of the particular URL as a keyword, thereby generating a first set of keywords; if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then (a) retrieving a regular expression associated with the URL that corresponds to the particular URL, and (b) extracting, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords; ranking, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and storing the ranked set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-readable storage medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to:
-
generate, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus; receive a particular URL of a web document; determine whether the particular URL corresponds to any of the regular expressions generated from the URL corpus; if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then (a) tokenize, based on delimiters and unit changes, the particular URL, and (b) store each token of the particular URL as a keyword, thereby generating a first set of keywords; if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then (a) retrieve a regular expression associated with the URL that corresponds to the particular URL, and (b) extract, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords; rank, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and store the ranked set. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
Specification