Linguistic key normalization

US 8,521,516 B2
Filed: 03/25/2009
Issued: 08/27/2013
Est. Priority Date: 03/26/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method executed by one or more processors, the method comprising:

receiving a collection of phrases;

normalizing a plurality of phrases of the collection of phrases, the normalizing being based at least in part on lexicographic normalizing rules;

generating a normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase, the one or more parameters including a translation corresponding to the normalized phrase and a probability for the translation given the normalized phrase;

receiving a training phrase;

normalizing the training phrase according to one or more lexicographic normalization rules;

locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;

associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase; and

determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;

associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase, andassociating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and apparatuses including computer program products are provided for training machine learning systems. In some implementations, a method is provided. The method includes receiving a collection of phrases, normalizing a plurality of phrases of the collection of phrases, the normalizing being based at least in part on lexicographic normalizing rules, and generating a normalized phrase table including a plurality of key-value pairs, each key value pair includes a key corresponding to a normalized phrase and a value corresponding to one or more un-normalized phrases associated with the normalized key, each un-normalized phrase having one or more parameters.

122 Citations

26 Claims

1. A computer-implemented method executed by one or more processors, the method comprising:
- receiving a collection of phrases;
  
  normalizing a plurality of phrases of the collection of phrases, the normalizing being based at least in part on lexicographic normalizing rules;
  
  generating a normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase, the one or more parameters including a translation corresponding to the normalized phrase and a probability for the translation given the normalized phrase;
  
  receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase; and
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase, andassociating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, where normalizing each phrase of the plurality of phrases includes applying one or more normalizing rules, the one or more normalizing rules including rules normalizing based on a case and a morphology of the phrase.
  - 3. The method of claim 1, where the one or more parameters associated with each un-normalized phrase include an identification of a language associated with the phrase.
  - 4. The method of claim 1, where the normalized phrase table includes a plurality of normalized phrases corresponding to un-normalized phrases in a plurality of languages.

5. A computer-implemented method executed by one or more processors, the method comprising:
- receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase;
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase, andassociating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase; and
  
  training a machine learning model using the one or more un-normalized phrases and the associated one or more weights.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The method of claim 5, where associating one or more weights further comprises:
    - assigning a weight to each un-normalized phrase based on a distance measure between each un-normalized phrase and the received training phrase.
  - 7. The method of claim 6, where the distance measure includes determining a distance vector having entries corresponding to the one or more lexicographic normalization rules.
  - 8. The method of claim 5, where training the machine learning model includes:
    - using one or more un-normalized phrases and their associated assigned weights as particular feature functions for the received phrase in a machine learning model.
  - 9. The method of claim 5, where training the machine learning model includes training a language model.
  - 10. The method of claim 5, where training the machine learning model includes training a language identification model.
  - 11. The method of claim 5, where training the machine learning model includes training a statistical machine translation model.
  - 12. The method of claim 5, where identifying the normalized phrase includes identifying a particular chunk of a distributed normalized phrase table including a key value corresponding to the normalized training phrase and searching the identified chunk for the normalized training phrase.

13. A computer program product, encoded on a non-transitory computer-readable medium, operable to cause a data processing apparatus to perform operations comprising:
- receiving a collection of phrases;
  
  normalizing a plurality of phrases of the collection of phrases, the normalizing being based at least in part on lexicographic normalizing rules;
  
  generating a normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase, the one or more parameters including a translation corresponding to the normalized phrase and a probability for the translation given the normalized phrase;
  
  receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase; and
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase, andassociating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase.
- View Dependent Claims (14, 15, 16)
- - 14. The computer program product of claim 13, where normalizing each phrase of the plurality of phrases includes applying one or more normalizing rules, the one or more normalizing rules including rules normalizing based on a case and a morphology of the phrase.
  - 15. The computer program product of claim 13, where the one or more parameters associated with each un-normalized phrase include an identification of a language associated with the phrase.
  - 16. The computer program product of claim 13, where the normalized phrase table includes a plurality of normalized phrases corresponding to un-normalized phrases in a plurality of languages.

17. A computer program product, encoded on a non-transitory computer-readable medium, operable to cause a data processing apparatus to perform operations comprising:
- receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase;
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase,associating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase; and
  
  training a machine learning model using the one or more un-normalized phrases and the associated one or more weights.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer program product of claim 17, where associating one or more weights further comprises:
    - assigning a weight to each un-normalized phrase based on a distance measure between each un-normalized phrase and the received training phrase.
  - 19. The computer program product of claim 18, where the distance measure includes determining a distance vector having entries corresponding to the one or more lexicographic normalization rules.
  - 20. The computer program product of claim 17, where training the machine learning model includes:
    - using one or more un-normalized phrases and their associated assigned weights as particular feature functions for the received phrase in a machine learning model.
  - 21. The computer program product of claim 17, where training the machine learning model includes training a language model.
  - 22. The computer program product of claim 17, where training the machine learning model includes training a language identification model.
  - 23. The computer program product of claim 17, where training the machine learning model includes training a statistical machine translation model.
  - 24. The computer program product of claim 17, where identifying the normalized phrase includes identifying a particular chunk of a distributed normalized phrase table including a key value corresponding to the normalized training phrase and searching the identified chunk for the normalized training phrase.

25. A system comprising:
- one or more computers configured to perform operations including;
  
  receiving a collection of phrases;
  
  normalizing a plurality of phrases of the collection of phrases, the normalizing being based at least in part on lexicographic normalizing rules;
  
  generating a normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase, the one or more parameters including a translation corresponding to the normalized phrase and a probability for the translation given the normalized phrase;
  
  receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase; and
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase, andassociating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase.

26. A system comprising:
- one or more computers configured to perform operations including;
  
  receiving a training phrase;
  
  normalizing the training phrase according to one or more lexicographic normalization rules;
  
  locating the normalized training phrase in a normalized phrase table, the normalized phrase table including a plurality of key-value pairs, each key-value pair having a key that includes a normalized phrase and a value that includes one or more un-normalized phrases associated with the normalized phrase of the key and one or more parameters associated with each un-normalized phrase;
  
  associating one or more weights to one or more un-normalized phrases associated with the key-value pair for the identified normalized training phrase in the normalized phrase table based on a relation of each associated un-normalized phrase to the received training phrase;
  
  determining a degree of match between the received training phrase and a specific un-normalized phrase associated with the located normalized training phrase, the degree of match being determined according to a similarity measure, wherein associating one or more weights comprises;
  
  associating a first weight to the specific un-normalized phrase when the training phrase has a high degree of match with the specific un-normalized phrase,associating a second weight to the specific un-normalized phrase when the training phrase has a low degree of match with the specific un-normalized phrase; and
  
  training a machine learning model using the one or more un-normalized phrases and the associated one or more weights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Och, Franz Josef, Tsochandaridis, Ioannis, Genzel, Dmitriy, Thayer, Ignacio E
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sirjani, Fariba

Application Number

US12/411,224
Publication Number

US 20130151235A1
Time in Patent Office

1,616 Days
Field of Search

704 1- 10, 707601-606
US Class Current

704/10
CPC Class Codes

G06F 40/237 Lexical tools

Linguistic key normalization

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

122 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Linguistic key normalization

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

122 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others