Computing numeric representations of words in a high-dimensional space

US 9,740,680 B1
Filed: 05/18/2015
Issued: 08/22/2017
Est. Priority Date: 01/15/2013
Status: Active Grant

First Claim

Patent Images

1. One or more non-transitory computer storage media encoded with a data set, the data set associating each word in a vocabulary of words with a respective numeric representation of the word in a high-dimensional space,wherein the data set indicates, for each word of a plurality of the words in the vocabulary and by the position of the numeric representation of the word in the high-dimensional space, a semantic meaning of the word,wherein the data set indicates, for each of a plurality of pairs of words in the vocabulary and by the relative positions of the numeric representations of the words in the high-dimensional space, a degree of semantic relationship, syntactic relationship, or both between the words in the pair of words,whereby the non-transitory computer storage media, when encoded with the data set, provides the function of representing in a quantitative way semantic and syntactic relationships between and among words in the vocabulary, andwherein the one or more non-transitory computer storage media are encoded with the data set by a process comprising the steps of:

obtaining a set of training data, wherein the set of training data comprises sequences of words;

training a plurality of classifiers and an embedding function on the set of training data, wherein the embedding function receives an input word and maps the input word to a numeric representation in the high-dimensional space in accordance with a set of embedding function parameters, wherein each of the classifiers corresponds to a respective position surrounding the input word in a sequence of words, and wherein each of the classifiers processes the numeric representation of the input word to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word will be found in the corresponding position relative to the input word, and wherein training the embedding function comprises determining trained values of the embedding function parameters;

processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numeric representation of each word in the vocabulary;

generating the data set by associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space; and

storing the data set on the one or more non-transitory computer storage media.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing numeric representations of words. One of the methods includes obtaining a set of training data, wherein the set of training data comprises sequences of words; training a classifier and an embedding function on the set of training data, wherein training the embedding function comprises obtained trained values of the embedding function parameters; processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numerical representation of each word in the vocabulary in the high-dimensional space; and associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space.

Citations

16 Claims

1. One or more non-transitory computer storage media encoded with a data set, the data set associating each word in a vocabulary of words with a respective numeric representation of the word in a high-dimensional space,wherein the data set indicates, for each word of a plurality of the words in the vocabulary and by the position of the numeric representation of the word in the high-dimensional space, a semantic meaning of the word,wherein the data set indicates, for each of a plurality of pairs of words in the vocabulary and by the relative positions of the numeric representations of the words in the high-dimensional space, a degree of semantic relationship, syntactic relationship, or both between the words in the pair of words,whereby the non-transitory computer storage media, when encoded with the data set, provides the function of representing in a quantitative way semantic and syntactic relationships between and among words in the vocabulary, andwherein the one or more non-transitory computer storage media are encoded with the data set by a process comprising the steps of:
- obtaining a set of training data, wherein the set of training data comprises sequences of words;
  
  training a plurality of classifiers and an embedding function on the set of training data, wherein the embedding function receives an input word and maps the input word to a numeric representation in the high-dimensional space in accordance with a set of embedding function parameters, wherein each of the classifiers corresponds to a respective position surrounding the input word in a sequence of words, and wherein each of the classifiers processes the numeric representation of the input word to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word will be found in the corresponding position relative to the input word, and wherein training the embedding function comprises determining trained values of the embedding function parameters;
  
  processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numeric representation of each word in the vocabulary;
  
  generating the data set by associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space; and
  
  storing the data set on the one or more non-transitory computer storage media.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer storage media of claim 1, wherein the numeric representations are continuous representations represented using floating-point numbers.
  - 3. The computer storage media of claim 1, wherein positions of numeric representations in the high-dimensional space reflect semantic similarities between words represented by the numeric representations.
  - 4. The computer storage media of claim 1, wherein positions of numeric representations in the high-dimensional space reflect syntactic similarities between words represented by the numeric representations.
  - 5. The computer storage media of claim 1, wherein the embedding function maps the input word to a floating point vector.
  - 6. The computer storage media of claim 1, wherein training the plurality of classifiers and the embedding function comprises preforming a backpropagation training technique to determine the trained values of the embedding function parameters.
  - 7. The computer storage media of claim 1, wherein a dimensionality of the high-dimensional space is on the order of one thousand.
  - 8. The computer storage media of claim 1, wherein each of the plurality of classifiers comprises a respective set of classifier parameters, and wherein training the plurality of classifiers and the embedding function on the set of training data comprises determining trained values of each of the sets of classifier parameters.

9. One or more non-transitory computer storage media encoded with a data set, the data set associating each word in a vocabulary of words with a respective numeric representation of the word in a high-dimensional space,wherein the data set indicates, for each word of a plurality of the words in the vocabulary and by the position of the numeric representation of the word in the high-dimensional space, a semantic meaning of the word,wherein the data set indicates, for each of a plurality of pairs of words in the vocabulary and by the relative positions of the numeric representations of the words in the high-dimensional space, a degree of semantic relationship, syntactic relationship, or both between the words in the pair of words,whereby the non-transitory computer storage media, when encoded with the data set, provides the function of representing in a quantitative way semantic and syntactic relationships between and among words in the vocabulary, andwherein the one or more non-transitory computer storage media are encoded with the data set by a process comprising the steps of:
- obtaining a set of training data, wherein the set of training data comprises sequences of words;
  
  training a classifier and an embedding function on the set of training data, wherein the embedding function receives a plurality of words surrounding an unknown word in a sequence of words and maps the plurality of words into a numeric representation in accordance with a set of embedding function parameters, wherein the classifier processes the numeric representation of the sequence of words to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores measure a predicted likelihood that the corresponding word is the unknown word, and wherein training the embedding function comprises determining trained values of the embedding function parameters;
  
  processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numeric representation of each word in the vocabulary in the high-dimensional space;
  
  generating the data set by associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space; and
  
  storing the data set on the one or more non-transitory computer storage media.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer storage media of claim 9, wherein the numeric representations are continuous representations represented using floating-point numbers.
  - 11. The computer storage media of claim 9, wherein positions of numeric representations in the high-dimensional space reflect semantic similarities between words represented by the numeric representations.
  - 12. The computer storage media of claim 9, wherein positions of numeric representations in the high-dimensional space reflect syntactic similarities between words represented by the numeric representations.
  - 13. The computer storage media of claim 9, wherein the embedding function maps each of the plurality of words to a respective floating point vector and outputs a single merged vector that is a combination of the respective floating point vectors.
  - 14. The computer storage media of claim 9, wherein training the classifier and the embedding function comprises preforming a backpropagation training technique to determine the trained values of the embedding function parameters.
  - 15. The computer storage media of claim 9, wherein a dimensionality of the high-dimensional space is on the order of one thousand.
  - 16. The computer storage media of claim 9, wherein the classifier has a set of classifier parameters, and wherein training the plurality of classifiers and the embedding function on the set of training data comprises determining trained values of the set of classifier parameters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Mikolov, Tomas, Chen, Kai, Corrado, Gregory S., Dean, Jeffrey A.
Primary Examiner(s)
Abebe, Daniel

Application Number

US14/715,421
Time in Patent Office

827 Days
Field of Search

704243
US Class Current
CPC Class Codes

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

G06N 20/00   Machine learning

G10L 15/06   Creation of reference templ...

Computing numeric representations of words in a high-dimensional space

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Computing numeric representations of words in a high-dimensional space

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links