TECHNIQUES FOR KEYWORD EXTRACTION FROM URLS USING STATISTICAL ANALYSIS

US 20090089278A1
Filed: 11/08/2007
Published: 04/02/2009
Est. Priority Date: 09/27/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method for post-tokenization processing, comprising:

generating, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus;

receiving a particular URL of a web document;

determining whether the particular URL corresponds to any of the regular expressions generated from the URL corpus;

if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then(a) tokenizing, based on delimiters and unit changes, the particular URL, and(b) storing each token of the particular URL as a keyword, thereby generating a first set of keywords;

if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then(a) retrieving a regular expression associated with the URL that corresponds to the particular URL, and (b) extracting, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords;

ranking, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and

storing the ranked set.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described for keyword extraction from URLs using regular expression patterns and keyword ranking. Tokenization of URLs also generates regular expressions of URLs from a website. The regular expressions are stored in the form of any type of indexing structure. When a new URL is received, the URL is examined to determine whether the URL is from a website that has previously been tokenized. If the URL is not from such a website, then the URL is tokenized using every delimiter and unit change to extract keywords. If the URL is from a website previously processed, the corresponding regular expression is used to extract keywords from the URL. The keywords extracted from the URLs are then ranked based on any ranking methodology for better relevance and performance.

76 Citations

View as Search Results

22 Claims

1. A method for post-tokenization processing, comprising:
- generating, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus;
  
  receiving a particular URL of a web document;
  
  determining whether the particular URL corresponds to any of the regular expressions generated from the URL corpus;
  
  if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then(a) tokenizing, based on delimiters and unit changes, the particular URL, and(b) storing each token of the particular URL as a keyword, thereby generating a first set of keywords;
  
  if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then(a) retrieving a regular expression associated with the URL that corresponds to the particular URL, and (b) extracting, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords;
  
  ranking, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and
  
  storing the ranked set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein delimiters comprise “
    - /,”
      
      “
      
      ?”
      
      “
      
      &
      
      ,” and
      
      “
      
      =”
      
      .
  - 3. The method of claim 1, wherein unit changes comprises identifying, in the URL, a change of one particular type of character to another type of character, not of the particular type.
  - 4. The method of claim 3, wherein types of characters comprise a number, letter or symbol.
  - 5. The method of claim 1, wherein information extraction algorithms comprise TF-IDF.
  - 6. The method of claim 1, wherein information extraction algorithms comprise dictionaries.
  - 7. The method of claim 1, wherein information extraction algorithms comprise mutual information.
  - 8. The method of claim 1, wherein information extraction algorithms are based on measures from information theory.
  - 9. The method of claim 1, wherein regular expressions are stored in an indexing structure.
  - 10. The method of claim 1, wherein regular expressions are stored in the form of any of:
    - a suffix tree, a trie, or a prefix tree.
  - 11. The method of claim 1, wherein regular expressions are stored in the form of a custom index structure.

12. A computer-readable storage medium carrying one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to:
- generate, based upon tokenizations of a URL corpus, regular expressions for URLs in the URL corpus;
  
  receive a particular URL of a web document;
  
  determine whether the particular URL corresponds to any of the regular expressions generated from the URL corpus;
  
  if the particular URL does not correspond to any of the regular expressions generated from the URL corpus, then(a) tokenize, based on delimiters and unit changes, the particular URL, and(b) store each token of the particular URL as a keyword, thereby generating a first set of keywords;
  
  if the particular URL corresponds to at least one of the regular expressions generated from the URL corpus, then(a) retrieve a regular expression associated with the URL that corresponds to the particular URL, and (b) extract, based upon the regular expression, keywords from the particular URL, thereby generating a second set of keywords;
  
  rank, based upon an information extraction algorithm, keywords from one of the first set and the second set, thereby producing a ranked set; and
  
  store the ranked set.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The computer-readable storage medium of claim 12, wherein delimiters comprise “
    - /,”
      
      “
      
      ?,”
      
      “
      
      &
      
      ,” and
      
      “
      
      =”
      
      .
  - 14. The computer-readable storage medium of claim 12, wherein unit changes comprises identifying, in the URL, a change of one particular type of character to another type of character, not of the particular type.
  - 15. The computer-readable storage medium of claim 14, wherein types of characters comprise a number, letter or symbol.
  - 16. The computer-readable storage medium of claim 12, wherein information extraction algorithms comprise TF-IDF.
  - 17. The computer-readable storage medium of claim 12, wherein information extraction algorithms comprise dictionaries.
  - 18. The computer-readable storage medium of claim 12, wherein information extraction algorithms comprise mutual information.
  - 19. The computer-readable storage medium of claim 12, wherein information extraction algorithms are based on measures from information theory.
  - 20. The computer-readable storage medium of claim 12, wherein regular expressions are stored in an indexing structure.
  - 21. The computer-readable storage medium of claim 12, wherein regular expressions are stored in the form of any of:
    - a suffix tree, a trie, or a prefix tree.
  - 22. The computer-readable storage medium of claim 12, wherein regular expressions are stored in the form of a custom index structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oath Inc. (Verizon Communications Inc.)
Original Assignee
Oath Inc. (Verizon Communications Inc.)
Inventors
Poola, Krishna Leela, Ramanujapuram, Arun

Application Number

US11/937,417
Publication Number

US 20090089278A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

TECHNIQUES FOR KEYWORD EXTRACTION FROM URLS USING STATISTICAL ANALYSIS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

76 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

TECHNIQUES FOR KEYWORD EXTRACTION FROM URLS USING STATISTICAL ANALYSIS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

76 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links