×

Machine translation using vector space representations

  • US 7,765,098 B2
  • Filed: 04/24/2006
  • Issued: 07/27/2010
  • Est. Priority Date: 04/26/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method, implemented by a computer, for automatically translating text, comprising:

  • (a) generating a conceptual representation space based on source-language documents and target-language documents, wherein respective terms from the source-language documents and the target-language documents have a representation in the conceptual representation space, wherein a polysemous term from the source-language documents has a plurality of representations in the conceptual representation space, each representation of the polysemous term corresponding to a sense of the polysemous term, wherein generating a conceptual representation space comprises generating a Latent Semantic Indexing (LSI) space;

    (b) representing a new source-language document in the conceptual representation space, wherein a subset of terms in the new source-language document is represented in the conceptual representation space, such that each term in the subset has a representation in the conceptual representation space;

    (c) comparing the plurality of representations of the polysemous term to the representation of a first term in the new source-language document to identify one sense of the polysemous term that is similar to the first term above a threshold;

    (d) automatically translating the first term in the new source-language document into a corresponding target-language term based on the one sense of the polysemous term identified in step (c) to provide a machine translation of the new source-language document;

    wherein step (d) comprises(d1) subdividing sentences that are longer than a threshold into logically coherent segments, and(d2) automatically translating a term in one of the logically coherent segments into a corresponding target-language term contained in at least one target-language document based on a similarity between the representation of the term and the representation of the corresponding target-language term;

    (e) identifying an idiomatic expression contained in at least one of the source-language documents, wherein identifying an idiomatic expression contained in at least one of the source-language documents comprises;

    identifying at least one candidate sequence of words;

    generating (i) a representation of the at least one candidate sequence of words in the conceptual representation space, and (ii) a representation of each word in the at least one candidate sequence of words in the conceptual representation space;

    comparing the representation of the at least one candidate sequence of words with the representation of each word in the at least one candidate sequence of words to determine a difference thereof; and

    identifying the at least one candidate sequence of words as an idiomatic expression if the difference is greater than a threshold;

    (f) generating a representation of the idiomatic expression in the conceptual representation space; and

    (g) automatically translating the idiomatic expression into a target-language term or expression based on a similarity between the representation of the idiomatic expression and the representation of the target-language term.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×