Machine translation using vector space representations
First Claim
1. A method, implemented by a computer, for automatically translating text, comprising:
- (a) generating a conceptual representation space based on source-language documents and target-language documents, wherein respective terms from the source-language documents and the target-language documents have a representation in the conceptual representation space, wherein a polysemous term from the source-language documents has a plurality of representations in the conceptual representation space, each representation of the polysemous term corresponding to a sense of the polysemous term, wherein generating a conceptual representation space comprises generating a Latent Semantic Indexing (LSI) space;
(b) representing a new source-language document in the conceptual representation space, wherein a subset of terms in the new source-language document is represented in the conceptual representation space, such that each term in the subset has a representation in the conceptual representation space;
(c) comparing the plurality of representations of the polysemous term to the representation of a first term in the new source-language document to identify one sense of the polysemous term that is similar to the first term above a threshold;
(d) automatically translating the first term in the new source-language document into a corresponding target-language term based on the one sense of the polysemous term identified in step (c) to provide a machine translation of the new source-language document;
wherein step (d) comprises(d1) subdividing sentences that are longer than a threshold into logically coherent segments, and(d2) automatically translating a term in one of the logically coherent segments into a corresponding target-language term contained in at least one target-language document based on a similarity between the representation of the term and the representation of the corresponding target-language term;
(e) identifying an idiomatic expression contained in at least one of the source-language documents, wherein identifying an idiomatic expression contained in at least one of the source-language documents comprises;
identifying at least one candidate sequence of words;
generating (i) a representation of the at least one candidate sequence of words in the conceptual representation space, and (ii) a representation of each word in the at least one candidate sequence of words in the conceptual representation space;
comparing the representation of the at least one candidate sequence of words with the representation of each word in the at least one candidate sequence of words to determine a difference thereof; and
identifying the at least one candidate sequence of words as an idiomatic expression if the difference is greater than a threshold;
(f) generating a representation of the idiomatic expression in the conceptual representation space; and
(g) automatically translating the idiomatic expression into a target-language term or expression based on a similarity between the representation of the idiomatic expression and the representation of the target-language term.
4 Assignments
0 Petitions
Accused Products
Abstract
An embodiment of the present invention provides a method for automatically translating text. First, a conceptual representation space is generated based on source-language documents and target-language documents, wherein respective terms from the source-language and target-language documents have a representation in the conceptual representation space. Second, a new source-language document is represented in the conceptual representation space, wherein a subset of terms in the new source-language document is represented in the conceptual representation space, such that each term in the subset has a representation in the conceptual representation space. Then, a term in the new source-language document is automatically translated into a corresponding target-language term based on a similarity between the representation of the term and the representation of the corresponding target-language term.
53 Citations
6 Claims
-
1. A method, implemented by a computer, for automatically translating text, comprising:
-
(a) generating a conceptual representation space based on source-language documents and target-language documents, wherein respective terms from the source-language documents and the target-language documents have a representation in the conceptual representation space, wherein a polysemous term from the source-language documents has a plurality of representations in the conceptual representation space, each representation of the polysemous term corresponding to a sense of the polysemous term, wherein generating a conceptual representation space comprises generating a Latent Semantic Indexing (LSI) space; (b) representing a new source-language document in the conceptual representation space, wherein a subset of terms in the new source-language document is represented in the conceptual representation space, such that each term in the subset has a representation in the conceptual representation space; (c) comparing the plurality of representations of the polysemous term to the representation of a first term in the new source-language document to identify one sense of the polysemous term that is similar to the first term above a threshold; (d) automatically translating the first term in the new source-language document into a corresponding target-language term based on the one sense of the polysemous term identified in step (c) to provide a machine translation of the new source-language document;
wherein step (d) comprises(d1) subdividing sentences that are longer than a threshold into logically coherent segments, and (d2) automatically translating a term in one of the logically coherent segments into a corresponding target-language term contained in at least one target-language document based on a similarity between the representation of the term and the representation of the corresponding target-language term; (e) identifying an idiomatic expression contained in at least one of the source-language documents, wherein identifying an idiomatic expression contained in at least one of the source-language documents comprises; identifying at least one candidate sequence of words; generating (i) a representation of the at least one candidate sequence of words in the conceptual representation space, and (ii) a representation of each word in the at least one candidate sequence of words in the conceptual representation space; comparing the representation of the at least one candidate sequence of words with the representation of each word in the at least one candidate sequence of words to determine a difference thereof; and identifying the at least one candidate sequence of words as an idiomatic expression if the difference is greater than a threshold; (f) generating a representation of the idiomatic expression in the conceptual representation space; and (g) automatically translating the idiomatic expression into a target-language term or expression based on a similarity between the representation of the idiomatic expression and the representation of the target-language term. - View Dependent Claims (2, 3)
-
-
4. A computer program product comprising a computer readable storage medium having control logic stored therein for causing a computer to automatically translate text, the control logic comprising:
-
computer readable first program code that causes the computer to generate a conceptual representation space based on source-language documents and target-language documents, wherein respective terms from the source-language documents and the target-language documents have a representation in the conceptual representation space, wherein a polysemous term from the source-language documents has a plurality of representations in the conceptual representation space, each representation of the polysemous term corresponding to a sense of the polysemous term, wherein the conceptual representation space is a Latent Semantic Indexing (LSI) space; computer readable second program code that causes the computer to represent a new source-language document in the conceptual representation space, wherein a subset of terms in the new source-language document is represented in the conceptual representation space, such that each term in the subset has a representation in the conceptual representation space; computer readable third program code that causes the computer to compare the plurality of representations of the polysemous term to the representation of a first term in the new source-language document to identify one sense of the polysemous term that is similar to the first term above a threshold; computer readable fourth program code that causes the computer to automatically translate the first term in the new source-language document into a corresponding target-language term based on the one sense of the polysemous term identified by the computer readable third program code to provide a machine translation of the new source-language document;
wherein the computer readable fourth program code comprisescomputer readable eighth program code that causes the computer to subdivide sentences that are longer than a threshold into logically coherent segments, and computer readable ninth program code that causes the computer to automatically translate a term in one of the logically coherent segments into a corresponding target-language term contained in at least one target-language document based on a similarity between the representation of the term and the representation of the corresponding target-language term; computer readable fifth program code that causes the computer to identify an idiomatic expression contained in at least one of the source-language documents, wherein the computer readable fifth program code comprises; code that causes the computer to identify at least one candidate sequence of words; code that causes the computer to generate (i) a representation of the at least one candidate sequence of words in the conceptual representation space, and (ii) a representation of each word in the at least one candidate sequence of words in the conceptual representation space; code that causes the computer to compare the representation of the at least one candidate sequence of words with the representation of each word in the at least one candidate sequence of words to determine a difference thereof; and code that causes the computer to identify the at least one candidate sequence of words as an idiomatic expression if the difference is greater than a threshold; computer readable sixth program code that causes the computer to generate a representation of the idiomatic expression in the conceptual representation space; and computer readable seventh program code that causes the computer to automatically translate the idiomatic expression into a target-language term or expression based on a similarity between the representation of the idiomatic expression and the representation of the target-language term. - View Dependent Claims (5, 6)
-
Specification