×

TECHNIQUES FOR KEYWORD EXTRACTION FROM URLS USING STATISTICAL ANALYSIS

  • US 20090089278A1
  • Filed: 11/08/2007
  • Published: 04/02/2009
  • Est. Priority Date: 09/27/2007
  • Status: Abandoned Application
First Claim
Patent Images

1. A method for post-tokenization processing, comprising:

  • generating, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus;

    receiving a particular URL of a web document;

    determining whether the particular URL corresponds to any of the regular expressions generated from the URL corpus;

    if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then(a) tokenizing, based on delimiters and unit changes, the particular URL, and(b) storing each token of the particular URL as a keyword, thereby generating a first set of keywords;

    if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then(a) retrieving a regular expression associated with the URL that corresponds to the particular URL, and (b) extracting, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords;

    ranking, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and

    storing the ranked set.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×