Tokenization platform

US 9,195,738 B2
Filed: 08/13/2012
Issued: 11/24/2015
Est. Priority Date: 07/24/2008
Status: Active Grant

First Claim

Patent Images

1. A method implemented on a machine having at least one processor, storage, and a communication platform connected to a network for tokenizing a non-delimited character string, comprising:

creating a dictionary with words and phrases included in a set of search queries submitted via the network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;

identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;

determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and

designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A tokenization platform and method is described for accurately tokenizing character strings, including but not limited to non-delimited character strings of the type commonly used in Internet domain names and computer filenames, to accurately identify words and phrases occurring therein. In one embodiment, a phased tokenization approach is used in which the final phase is a lexical analysis-based tokenization using a dictionary. The dictionary may be advantageously created and updated based upon one or more query logs associated with respective information retrieval systems, thereby ensuring that the dictionary accurately reflects currently-used terminology and captures alternative spellings and presentations of words and phrases submitted by users.

Citations

20 Claims

1. A method implemented on a machine having at least one processor, storage, and a communication platform connected to a network for tokenizing a non-delimited character string, comprising:
- creating a dictionary with words and phrases included in a set of search queries submitted via the network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;
  
  identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;
  
  determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and
  
  designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.
- View Dependent Claims (2, 3, 4, 5, 6, 18, 19, 20)
- - 2. The method of claim 1, further comprising:
    - periodically updating the dictionary with words and phrases included in additional sets of search queries submitted by users of the one or more information retrieval systems over predetermined time periods that are subsequent to the first predetermined time period.
  - 3. The method of claim 1, further comprising:
    - storing the words and phrases in the dictionary into a prefix tree.
  - 4. The method of claim 3, wherein identifying the one or more series of characters comprises:
    - traversing the prefix tree starting at the first character of the non-delimited character string and proceeding from node to node of the prefix tree based on the sequence of characters in the non-delimited character string until all matching words and phrases within the prefix tree are found.
  - 5. The method of claim 3, wherein storing the words and phrases in the dictionary into a prefix tree comprises:
    - storing four letter prefixes and five letter prefixes associated with the words and phrases in the dictionary as nodes immediately below a root of the prefix tree.
  - 6. The method of claim 1, wherein the one or more information retrieval systems include one or more of:
    - an information retrieval system configured to retrieve web pages;
      
      an information retrieval system configured to retrieve images; and
      
      an information retrieval system configured to retrieve news content.
  - 18. The method of claim 1, wherein each of the words and phrases in the dictionary has an associated frequency that is above a predetermined threshold.
  - 19. The method of claim 18, wherein the associated frequency for a particular word or phrase represents a ratio between the total number of times the particular word or phrase appears within a distinct search query and the total number of distinct search queries submitted by the users.
  - 20. The method of claim 1, further including replacing at least a subset of the words and phrases stored in the dictionary with words and phrases included in an additional set of search queries submitted by the users.

7. A machine-readable tangible and non-transitory medium having information for enabling a processing unit to tokenize a non-delimited character string, wherein the information, when read by the machine, causes the machine to enable the processing unit to perform the following:
- creating a dictionary with words and phrases included in a set of search queries submitted via a network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;
  
  identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;
  
  determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and
  
  designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The medium of claim 7, wherein the information, when read by the machine, further causes the machine to perform the following:
    - enabling the processing unit to periodically update the dictionary with words and phrases included in additional sets of search queries submitted by users of the one or more information retrieval systems over predetermined time periods that are subsequent to the first predetermined time period.
  - 9. The medium of claim 7, wherein the information, when read by the machine, further causes the machine to perform the following:
    - enabling the processing unit to store the words and phrases in the dictionary into a prefix tree.
  - 10. The medium of claim 9, wherein the step of enabling the processing unit to identify one or more series of characters comprises:
    - enabling the processing unit to traverse the prefix tree starting at the first character of the non-delimited character string and proceeding from node to node of the prefix tree based on the sequence of characters in the non-delimited character string until all matching words and phrases within the prefix tree are found.
  - 11. The medium of claim 9, wherein the step of enabling the processing unit to store the words and phrases in the dictionary into a prefix tree comprises:
    - enabling the processing unit to store four letter prefixes and five letter prefixes associated with the words and phrases in the dictionary as nodes immediately below a root of the prefix tree.
  - 12. The medium of claim 7, wherein the one or more information retrieval systems include one or more of:
    - an information retrieval system configured to retrieve web pages;
      
      an information retrieval system configured to retrieve images; and
      
      an information retrieval system configured to retrieve news content.

13. A system, comprising:
- a processing unit; and
  
  a memory containing a program, which, when executed by the processing unit, performs a method for tokenizing a non-delimited character string, the method comprising;
  
  creating a dictionary with words and phrases included in a set of search queries submitted via a network by users of one or more information retrieval systems including an Internet search engine over a first predetermined time period;
  
  identifying one or more series of characters within the non-delimited character string that match a word or phrase in the dictionary, wherein the identifying comprises;
  
  determining if there is any word or phrase in the dictionary that matches a series of characters that begins at a first character of the non-delimited character string,assigning, for each matching word or phrase, the matching word or phrase to a tokenization path comprising one or more contiguous words or phrases embedded within the non-delimited character string,removing a corresponding series of characters from beginning of the non-delimited character string to generate a shortened character string or terminate the tokenization path,terminating any tokenization path associated with the non-delimited character string if no matching word or phrase is identified by the determining,recursively performing the above determining, assigning, removing, and terminating with respect to any shortened character string generated by the assigning and removing until all tokenization paths are terminated to generate one or more tokenization paths, anddetermining a selected tokenization path from the one or more tokenization paths; and
  
  designating the word or phrase assigned to the selected tokenization path as a token associated with the non-delimited character string.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13, the method further comprising:
    - periodically updating the dictionary with words and phrases included in additional sets of search queries submitted by users of the one or more information retrieval systems over predetermined time periods that are subsequent to the first predetermined time period.
  - 15. The system of claim 13, further comprising:
    - storing the words and phrases in the dictionary into a prefix tree.
  - 16. The system of claim 15, wherein storing the words and phrases in the dictionary into a prefix tree comprises:
    - storing four letter prefixes and five letter prefixes associated with the words and phrases in the dictionary as nodes immediately below a root of the prefix tree.
  - 17. The system of claim 13, wherein the one or more information retrieval systems include one or more of:
    - an information retrieval system configured to retrieve web pages;
      
      an information retrieval system configured to retrieve images; and
      
      an information retrieval system configured to retrieve news content.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Parikh, Jignashu
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
Nguyen, Timothy

Application Number

US13/572,825
Publication Number

US 20120310630A1
Time in Patent Office

1,198 Days
Field of Search

704/9
US Class Current

1/1
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 40/242   Dictionaries

G06F 40/284   Lexical analysis, e.g. toke...

Tokenization platform

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Tokenization platform

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links