×

TOKENIZATION PLATFORM

  • US 20100023514A1
  • Filed: 07/24/2008
  • Published: 01/28/2010
  • Est. Priority Date: 07/24/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for tokenizing a character string, comprising:

  • (a) determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string;

    (b) for each matching word or phrase identified in step (a), assigning the matching word or phrase to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path;

    (c) if no matching word or phrase is identified in step (a), then terminating any tokenization path with which the character string is associated;

    (d) recursively performing steps (a), (b) and (c) for any shortened character string generated in step (b) until all tokenization paths are terminated;

    (e) for any tokenization path formed through the performance of steps (a)-(d), calculating a score based on each word or phrase assigned to the tokenization path; and

    (f) selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×