Word vector processing for foreign languages
First Claim
Patent Images
1. A word vector processing method, comprising:
- performing word segmentation on a corpus to obtain words;
determining n-gram strokes corresponding to the words, the n-gram stroke representing n successive strokes of a corresponding word;
initializing word vectors of the words and stroke vectors of the n-gram strokes corresponding to the words; and
after performing the word segmentation, determining the n-gram strokes, and initializing the word vectors and stroke vectors, training the word vectors and the stroke vectors,wherein the training the word vectors and the stroke vectors comprises;
determining a designated word in the corpus, and one or more context words of the designated word in the corpus;
determining a degree of similarity between the designated word and the context word according to stroke vectors of n-gram strokes corresponding to the designated word as well as a word vector of the context word;
selecting one or more words from the words as a negative sample word;
determining a degree of similarity between the designated word and each negative sample word;
determining a loss characterization value corresponding to the designated word according to a designated loss function, the degree of similarity between the designated word and the context word, and the degree of similarity between the designated word and each negative sample word; and
updating the word vector of the context word and the stroke vectors of the n-gram strokes corresponding to the designated word according to the loss characterization value.
3 Assignments
0 Petitions
Accused Products
Abstract
A word vector processing method is provided. Word segmentation is performed on a corpus to obtain words, and n-gram strokes corresponding to the words are determined. Each n-gram stroke represents n successive strokes of a corresponding word. Word vectors of the words and stroke vectors of the n-gram strokes are initialized corresponding to the words. After performing the word segmentation, the n-gram strokes are determined, and the word vectors and stroke vectors are determined, training the word vectors and the stroke vectors.
-
Citations
21 Claims
-
1. A word vector processing method, comprising:
-
performing word segmentation on a corpus to obtain words; determining n-gram strokes corresponding to the words, the n-gram stroke representing n successive strokes of a corresponding word; initializing word vectors of the words and stroke vectors of the n-gram strokes corresponding to the words; and after performing the word segmentation, determining the n-gram strokes, and initializing the word vectors and stroke vectors, training the word vectors and the stroke vectors, wherein the training the word vectors and the stroke vectors comprises; determining a designated word in the corpus, and one or more context words of the designated word in the corpus; determining a degree of similarity between the designated word and the context word according to stroke vectors of n-gram strokes corresponding to the designated word as well as a word vector of the context word; selecting one or more words from the words as a negative sample word; determining a degree of similarity between the designated word and each negative sample word; determining a loss characterization value corresponding to the designated word according to a designated loss function, the degree of similarity between the designated word and the context word, and the degree of similarity between the designated word and each negative sample word; and updating the word vector of the context word and the stroke vectors of the n-gram strokes corresponding to the designated word according to the loss characterization value. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations comprising:
-
performing word segmentation on a corpus to obtain words; determining n-gram strokes corresponding to the words, the n-gram stroke representing n successive strokes of a corresponding word; initializing word vectors of the words and stroke vectors of the n-gram strokes corresponding to the words; and after performing the word segmentation, determining the n-gram strokes, and initializing the word vectors and stroke vectors, training the word vectors and the stroke vector, wherein the training the word vectors and the stroke vectors comprises; determining a designated word in the corpus, and one or more context words of the designated word in the corpus; determining a degree of similarity between the designated word and the context word according to stroke vectors of n-gram strokes corresponding to the designated word as well as a word vector of the context word; selecting one or more words from the words as a negative sample word; determining a degree of similarity between the designated word and each negative sample word; determining a loss characterization value corresponding to the designated word according to a designated loss function, the degree of similarity between the designated word and the context word, and the degree of similarity between the designated word and each negative sample word; and updating the word vector of the context word and the stroke vectors of the n-gram strokes corresponding to the designated word according to the loss characterization value. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer-implemented system, comprising:
-
one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform one or more operations comprising; performing word segmentation on a corpus to obtain words; determining n-gram strokes corresponding to the words, the n-gram stroke representing n successive strokes of a corresponding word; initializing word vectors of the words and stroke vectors of the n-gram strokes corresponding to the words; and after performing the word segmentation, determining the n-gram strokes, and initializing the word vectors and stroke vectors, training the word vectors and the stroke vector, wherein the training the word vectors and the stroke vectors comprises determining a designated word in the corpus, and one or more context words of the designated word in the corpus; determining a degree of similarity between the designated word and the context word according to stroke vectors of n-gram strokes corresponding to the designated word as well as a word vector of the context word; selecting one or more words from the words as a negative sample word; determining a degree of similarity between the designated word and each negative sample word; determining a loss characterization value corresponding to the designated word according to a designated loss function, the degree of similarity between the designated word and the context word, and the degree of similarity between the designated word and each negative sample word; and updating the word vector of the context word and the stroke vectors of the n-gram strokes corresponding to the designated word according to the loss characterization value. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification