Methods and systems for augmenting a token lexicon

US 8,051,096 B1
Filed: 09/30/2004
Issued: 11/01/2011
Est. Priority Date: 09/30/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving a character string in an alphanumeric format having no token-delineating breaks and comprising one or more tokens in the alphanumeric format; and

for each of the one or more tokens, parsing the received character string into a first portion containing a first token and a second portion containing the remaining tokens;

identifying the first token in one or more logs associated with multiple previously received search requests;

determining a frequency with which the identified first token appears in the one or more logs;

determining whether the determined frequency with which the identified first token appears in the one or more logs exceeds a first threshold level; and

storing the identified first token in a lexicon data storage based on the determination of whether the determined frequency with which the identified first token appears in the one or more logs exceeds the first threshold level, wherein the lexicon data storage comprises an ontology associating at least one of a misspelling of the first token with a correct spelling, or an alternate spelling of the first token with a different spelling.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for augmenting a token lexicon are presented. In one embodiment, a method comprising identifying a first token from a search request, storing the first token in a lexicon data storage, receiving a character string comprising a second token, wherein the second token is substantially similar to the first token, and parsing the character string using the lexicon data storage to resolve the second token is set forth. According to another embodiment, a method comprising identifying a first token from an interne article, storing the first token in a lexicon data storage, receiving a character string comprising a second token, wherein the second token is substantially similar to the first token, and parsing the character string using the lexicon data storage to resolve the second token is set forth.

130 Citations

28 Claims

1. A computer-implemented method, comprising:
- receiving a character string in an alphanumeric format having no token-delineating breaks and comprising one or more tokens in the alphanumeric format; and
  
  for each of the one or more tokens, parsing the received character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  identifying the first token in one or more logs associated with multiple previously received search requests;
  
  determining a frequency with which the identified first token appears in the one or more logs;
  
  determining whether the determined frequency with which the identified first token appears in the one or more logs exceeds a first threshold level; and
  
  storing the identified first token in a lexicon data storage based on the determination of whether the determined frequency with which the identified first token appears in the one or more logs exceeds the first threshold level, wherein the lexicon data storage comprises an ontology associating at least one of a misspelling of the first token with a correct spelling, or an alternate spelling of the first token with a different spelling.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The computer-implemented method of claim 1, wherein the first token comprises at least one of a misspelled word, a domain name, an abbreviation, or a proper name.
  - 3. The computer-implemented method of claim 1, wherein the ontology associates the first token with at least one interrelated token.
  - 4. The computer-implemented method of claim 1, wherein storing the identified first token in the lexicon data storage comprises storing the identified first token in the lexicon data storage when the multiple previously received search requests exceed the first threshold level, when the identified first token comprises more than a second threshold level of characters, or when characters in the identified first token are included in a specific character set.
  - 5. The computer-implemented method of claim 1, wherein the character string comprises a domain name.

6. A computer-implemented method, comprising:
- identifying a character string in an alphanumeric format having no token delineating breaks and comprising one or more tokens in the alphanumeric format from an internet-accessible article; and
  
  for each of the one or more tokens, parsing the identified character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  determining a first frequency with which the first token appears in the internet-accessible article, or a second frequency with which the first token appears at least once in a number of different internet-accessible articles;
  
  determining whether the determined first frequency with which the first token appears in the internet-accessible article exceeds a first threshold level, or whether the determined second frequency with which the first token appears at least once in the number of different internet-accessible articles exceeds a second threshold level; and
  
  storing the first token in a lexicon data storage based on the determination of whether the determined first frequency with which the first token appears in the internet-accessible article exceeds the first threshold level, or whether the determined second frequency with which the first token appears at least once in the number of different internet-accessible articles exceeds the second threshold level, wherein the lexicon data storage comprises an ontology associating at least one of a misspelling of the first token with a correct spelling, or an alternate spelling of the first token with a preferred spelling.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The computer-implemented method of claim 6, wherein the internet-accessible article comprises at least one of an instant messaging dialog, a chat session, a mailing list archive, and a web page.
  - 8. The computer-implemented method of claim 6, wherein the first token comprises at least one of a misspelled word, a domain name, an abbreviation, or a proper name.
  - 9. The computer-implemented method of claim 6, wherein the ontology associates the first token with at least one interrelated token.
  - 10. The computer-implemented method of claim 6, wherein the character string comprises a domain name.

11. A computer-implemented method comprising:
- identifying a character string in an alphanumeric format having no token delineating breaks and comprising one or more tokens in the alphanumeric format, wherein the identified character string is included in a plurality of previously received search requests;
  
  for each of the one or more tokens, parsing the identified character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  determining whether the first token is already included in a lexicon comprising an ontology of interrelated tokens and whether the first token occurs in the plurality of previously received search requests with at least a threshold frequency;
  
  based upon the determination of whether the first token is already included in the lexicon and the determination of whether the first token occurs in the plurality of previously received search results with at least the threshold frequency, identifying a second token that comprises a correct spelling of the first token, or an alternate spelling of the first token; and
  
  adding the first token to the lexicon with an association to the identified second token;
  
  receiving an alphanumeric string of characters comprising a domain name and having no token-delineating breaks;
  
  matching a portion of the received alphanumeric string of characters to the first token using the lexicon; and
  
  replacing the matched portion of the received alphanumeric string of characters with the identified second token contained in the lexicon.
- View Dependent Claims (12, 13, 14)
- - 12. The computer-implemented method of claim 11, wherein the second token is identified in response to determining that (i) the first token is not already included in the lexicon and (ii) the first token occurs in the plurality of previously received search results with at least the threshold frequency.
  - 13. The computer-implemented method of claim 11, wherein the second token is identified as the correct spelling of the first token, or the alternate spelling of the first token using a spell checker.
  - 14. The computer-implemented method of claim 11, wherein the portion of the received alphanumeric string of characters is matched to the first token using the lexicon in response to a failed attempt to resolve the domain name.

15. A non-transitory computer-readable storage device comprising program code that, when executed, causes a processor to perform operations comprising:
- receiving a character string in an alphanumeric format having no token-delineating breaks and comprising the first token one or more tokens in the alphanumeric format; and
  
  for each of the one or more tokens, parsing the received character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  identifying the first token in one or more logs associated with multiple previously received search requests;
  
  determining a frequency with which the identified first token appears in the one or more logs;
  
  determining whether the determined frequency with which the identified first token appears in the one or more logs exceeds a first threshold level; and
  
  storing the identified first token in a lexicon data storage based on the determination of whether the determined frequency with which the identified first token appears in the one or more logs exceeds the first threshold level, wherein the lexicon data storage comprises an ontology associating at least one of a misspelling of the first token with a correct spelling, or an alternate spelling of the first token with a preferred spelling.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The non-transitory computer-readable storage device of claim 15, wherein the first token comprises at least one of a misspelled word, a domain name, an abbreviation, or a proper name.
  - 17. The non-transitory computer-readable storage device of claim 15, wherein the ontology associates the first token with an interrelated token.
  - 18. The non-transitory computer-readable storage device of claim 15, wherein storing the identified first token in the lexicon data storage comprises storing the identified first token in the lexicon data storage when the multiple previously received search requests exceed the first threshold level, when the identified first token comprises more than a second threshold level of characters, or when characters in the identified first token are included in a specific character set.
  - 19. The non-transitory computer-readable storage device of claim 15, wherein the character string comprises a domain name.

20. A non-transitory computer-readable storage device comprising program code that, when executed, causes a processor to perform operations comprising:
- identifying a character string in an alphanumeric format having no token delineating breaks and comprising one or more tokens in the alphanumeric format from an internet-accessible article; and
  
  for each of the one or more tokens, parsing the identified character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  determining a first frequency with which the first token appears in the internet-accessible article, or a second frequency with which the first token appears at least once in a number of different internet-accessible articles;
  
  determining whether the determined first frequency with which the first token appears in the internet-accessible article exceeds a first threshold level, or whether the determined second frequency with which the first token appears at least once in the number of different internet-accessible articles exceeds a second threshold level; and
  
  storing the first token in a lexicon data storage based on the determination of whether the determined first frequency with which the first token appears in the internet-accessible article exceeds the first threshold level, or whether the determined second frequency with which the first token appears at least once in the number of different internet-accessible articles exceeds the second threshold level, wherein the lexicon data storage comprises an ontology associating at least one of a misspelling of the first token with a correct spelling, or an alternate spelling of the first token with a preferred spelling.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The non-transitory computer-readable storage device of claim 20, wherein the internet-accessible article comprises at least one of an instant messaging dialog, a chat session, a mailing list archive, and a web page.
  - 22. The non-transitory computer-readable storage device of claim 20, wherein the first token comprises at least one of a misspelled word, a domain name, an abbreviation, or a proper name.
  - 23. The non-transitory computer-readable storage device of claim 20, wherein the ontology associates the first token with an interrelated token.
  - 24. The non-transitory computer-readable storage device of claim 20, wherein the character string comprises a domain name.

25. A non-transitory computer-readable storage device comprising program code that, when executed, causes a processor to perform operations comprising:
- identifying a character string in an alphanumeric format having no token delineating breaks and comprising one or more tokens in the alphanumeric format, wherein the identified character string is included in a plurality of previously received search requests;
  
  for each of the one or more tokens, parsing the identified character string into a first portion containing a first token and a second portion containing the remaining tokens;
  
  determining whether the first token is already included in a lexicon comprising an ontology of interrelated tokens and whether the first token occurs in the plurality of previously received search requests with at least a threshold frequency;
  
  based upon the determination of whether the first token is already included in the lexicon and the determination of whether the first token occurs in the plurality of previously received search results with at least the threshold frequency, identifying a second token that comprises a correct spelling of the first token, or an alternate spelling of the first token; and
  
  adding the first token to the lexicon with an association to the identified second token;
  
  receiving an alphanumeric string of characters comprising a domain name and having no token-delineating breaks;
  
  matching a portion of the received alphanumeric string of characters to the first token using the lexicon; and
  
  replacing the matched portion of the received alphanumeric string of characters with the identified second token contained in the lexicon.
- View Dependent Claims (26, 27, 28)
- - 26. The non-transitory computer-readable storage device of claim 25, wherein the second token is identified in response to determining that (i) the first token is not already included in the lexicon and (ii) the first token occurs in the plurality of previously received search results with at least the threshold frequency.
  - 27. The non-transitory computer-readable storage device of claim 25, wherein the second token is identified as the correct spelling of the first token, or the alternate spelling of the first token using a spell checker.
  - 28. The non-transitory computer-readable storage device of claim 25, wherein the portion of the received alphanumeric string of characters is matched to the first token using the lexicon in response to a failed attempt to resolve the domain name.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Elbaz, Gilad Israel, Mandelson, Jacob Leon
Primary Examiner(s)
Pham; Hung Q
Assistant Examiner(s)
CHEUNG, HUBERT G

Application Number

US10/954,714
Time in Patent Office

2,588 Days
Field of Search

707 1-206
US Class Current

707/778
CPC Class Codes

G06F 16/3334   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 40/232   Orthographic correction, e....

G06F 40/242   Dictionaries

G06F 40/284   Lexical analysis, e.g. toke...

Methods and systems for augmenting a token lexicon

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

130 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for augmenting a token lexicon

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

130 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links