Tokenization platform
First Claim
1. A method for tokenizing a character string, comprising:
- (a) determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string, wherein the character string comprises a non-delimited character string;
(b) for each matching word or phrase identified in step (a), assigning the matching word or phrase to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path;
(c) if no matching word or phrase is identified in step (a), then terminating any tokenization path with which the character string is associated;
(d) recursively performing steps (a), (b) and (c) for any shortened character string generated in step (b) until all tokenization paths are terminated;
(e) for any tokenization path formed through the performance of steps (a)-(d), calculating a score based on each word or phrase assigned to the tokenization path; and
(f) selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string.
9 Assignments
0 Petitions
Accused Products
Abstract
A tokenization platform and method is described for accurately tokenizing character strings, including but not limited to non-delimited character strings of the type commonly used in Internet domain names and computer filenames, to accurately identify words and phrases occurring therein. In one embodiment, a phased tokenization approach is used in which the final phase is a lexical analysis-based tokenization using a dictionary. The dictionary may be advantageously created and updated based upon one or more query logs associated with respective information retrieval systems, thereby ensuring that the dictionary accurately reflects currently-used terminology and captures alternative spellings and presentations of words and phrases submitted by users.
-
Citations
24 Claims
-
1. A method for tokenizing a character string, comprising:
-
(a) determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string, wherein the character string comprises a non-delimited character string; (b) for each matching word or phrase identified in step (a), assigning the matching word or phrase to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path; (c) if no matching word or phrase is identified in step (a), then terminating any tokenization path with which the character string is associated; (d) recursively performing steps (a), (b) and (c) for any shortened character string generated in step (b) until all tokenization paths are terminated; (e) for any tokenization path formed through the performance of steps (a)-(d), calculating a score based on each word or phrase assigned to the tokenization path; and (f) selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated the character string. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for tokenizing a character string, comprising:
-
populating a dictionary with words and phrases included in a set of search queries submitted by users of one or more information retrieval systems over a first predetermined time period; identifying one or more series of characters within the character string that match a word or phrase populated within the dictionary, wherein the character string comprises a non-delimited character string, and designating the identified one or more series of characters within the character string that match a word or phrase populated within the dictionary as a token associated with the character string. - View Dependent Claims (11, 12)
-
-
13. A computer program product comprising a computer-readable storage device having computer program logic recorded thereon, which, when executed by a processing unit, performs operations to tokenize a character string, the operations comprising:
-
determining if there are any words or phrases in a dictionary that match a series of characters within the character string that begins at the first character of the character string, wherein the character string comprises a non-delimited character string; assigning each matching word or phrase identified by said determining to a tokenization path, wherein the tokenization path comprises one or more contiguous words or phrases embedded within the character string, and removing a corresponding series of characters from the beginning of the character string, thereby generating a shortened character string associated with the tokenization path or terminating the tokenization path; terminating any tokenization path with which the character string is associated if no matching word or phrase is identified by said determining; recursively performing the functions associated with said determining, said assigning and removing, and said terminating with respect to any shortened character string generated by said assigning and removing until all tokenization paths are terminated; calculating a score for any tokenization path formed by the execution of said determining, said assigning and removing, said terminating and said performing based on each word or phrase assigned to the tokenization path; and selecting the word(s) and/or phrase(s) associated with a tokenization path having the highest score as tokens associated with the character string. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer program product comprising a computer-readable storage device having computer program logic recorded thereon, which, when executed by a processing unit, performs operations to tokenize a character string, the operations comprising:
-
populating a dictionary with words and phrases included in a set of search queries submitted by users of one or more information retrieval systems over a first predetermined time period; and identifying one or more series of characters within the character string that match a word or phrase populated within the dictionary, wherein the character string comprises a non-delimited character string; and designating the identified one or more series of characters within the character string that match a word or phrase populated within the dictionary as a token associated with the character string. - View Dependent Claims (23, 24)
-
Specification