Tokenization platform
First Claim
1. A method implemented on a machine having at least one processor, storage, and a communication platform connected to a network for tokenizing a non-delimited character string, comprising:
- creating a dictionary with words and phrases included in a set of search queries submitted via the network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;
identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;
determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and
designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.
9 Assignments
0 Petitions
Accused Products
Abstract
A tokenization platform and method is described for accurately tokenizing character strings, including but not limited to non-delimited character strings of the type commonly used in Internet domain names and computer filenames, to accurately identify words and phrases occurring therein. In one embodiment, a phased tokenization approach is used in which the final phase is a lexical analysis-based tokenization using a dictionary. The dictionary may be advantageously created and updated based upon one or more query logs associated with respective information retrieval systems, thereby ensuring that the dictionary accurately reflects currently-used terminology and captures alternative spellings and presentations of words and phrases submitted by users.
-
Citations
20 Claims
-
1. A method implemented on a machine having at least one processor, storage, and a communication platform connected to a network for tokenizing a non-delimited character string, comprising:
-
creating a dictionary with words and phrases included in a set of search queries submitted via the network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period; identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises; determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string, assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string, removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path, terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining, recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, and determining a selected tokenization path from the one or more tokenization paths; and designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string. - View Dependent Claims (2, 3, 4, 5, 6, 18, 19, 20)
-
-
7. A machine-readable tangible and non-transitory medium having information for enabling a processing unit to tokenize a non-delimited character string, wherein the information, when read by the machine, causes the machine to enable the processing unit to perform the following:
-
creating a dictionary with words and phrases included in a set of search queries submitted via a network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period; identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises; determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string, assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string, removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path, terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining, recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, and determining a selected tokenization path from the one or more tokenization paths; and designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A system, comprising:
-
a processing unit; and a memory containing a program, which, when executed by the processing unit, performs a method for tokenizing a non-delimited character string, the method comprising; creating a dictionary with words and phrases included in a set of search queries submitted via a network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period; identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises; determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string, assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string, removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path, terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining, recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, and determining a selected tokenization path from the one or more tokenization paths; and designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string. - View Dependent Claims (14, 15, 16, 17)
-
Specification