TOKENIZATION PLATFORM
First Claim
1. A method for tokenizing a character string, comprising:
- (a) determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string;
(b) for each matching word or phrase identified in step (a), assigning the matching word or phrase to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path;
(c) if no matching word or phrase is identified in step (a), then terminating any tokenization path with which the character string is associated;
(d) recursively performing steps (a), (b) and (c) for any shortened character string generated in step (b) until all tokenization paths are terminated;
(e) for any tokenization path formed through the performance of steps (a)-(d), calculating a score based on each word or phrase assigned to the tokenization path; and
(f) selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string.
9 Assignments
0 Petitions
Accused Products
Abstract
A tokenization platform and method is described for accurately tokenizing character strings, including but not limited to non-delimited character strings of the type commonly used in Internet domain names and computer filenames, to accurately identify words and phrases occurring therein. In one embodiment, a phased tokenization approach is used in which the final phase is a lexical analysis-based tokenization using a dictionary. The dictionary may be advantageously created and updated based upon one or more query logs associated with respective information retrieval systems, thereby ensuring that the dictionary accurately reflects currently-used terminology and captures alternative spellings and presentations of words and phrases submitted by users.
71 Citations
24 Claims
-
1. A method for tokenizing a character string, comprising:
-
(a) determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string; (b) for each matching word or phrase identified in step (a), assigning the matching word or phrase to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path; (c) if no matching word or phrase is identified in step (a), then terminating any tokenization path with which the character string is associated; (d) recursively performing steps (a), (b) and (c) for any shortened character string generated in step (b) until all tokenization paths are terminated; (e) for any tokenization path formed through the performance of steps (a)-(d), calculating a score based on each word or phrase assigned to the tokenization path; and (f) selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for tokenizing a character string, comprising:
-
populating a dictionary with words and phrases included in a set of search queries submitted by users of one or more information retrieval systems over a first predetermined time period; and identifying one or more series of characters within the character string that match a word or phrase populated within the dictionary. - View Dependent Claims (11, 12)
-
-
13. A computer program product comprising a computer-readable medium having computer program logic recorded thereon for enabling a processing unit to tokenize a character string, the computer program logic comprising:
-
first means for enabling the processing unit to determine if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string; second means for enabling the processing unit to assign each matching word or phrase identified by the first means to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and to remove a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path; third means for enabling the processing unit to terminate any tokenization path with which the character string is associated if no matching word or phrase is identified by the first means; fourth means for enabling the processing unit to recursively perform the functions associated with the first means, the second means and the third means with respect to any shortened character string generated by the second means until all tokenization paths are terminated; fifth means for enabling the processing unit to calculate a score for any tokenization path formed by the execution of the first means, the second means, the third means and the fourth means based on each word or phrase assigned to the tokenization path; and sixth means for enabling the processing unit to select the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer program product comprising a computer-readable medium having computer program logic recorded thereon for enabling a processing unit to tokenize a character string, the computer program logic comprising:
-
first means for enabling the processing unit to populate a dictionary with words and phrases included in a set of search queries submitted by users of one or more information retrieval systems over a first predetermined time period; and second means for enabling the processing unit to identify one or more series of characters within the character string that match a word or phrase populated within the dictionary. - View Dependent Claims (23, 24)
-
Specification