×

Tokenization platform

  • US 9,195,738 B2
  • Filed: 08/13/2012
  • Issued: 11/24/2015
  • Est. Priority Date: 07/24/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method implemented on a machine having at least one processor, storage, and a communication platform connected to a network for tokenizing a non-delimited character string, comprising:

  • creating a dictionary with words and phrases included in a set of search queries submitted via the network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;

    identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;

    determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and

    designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×