Locating parallel word sequences in electronic documents
First Claim
Patent Images
1. A method comprising the following computer-executable acts:
- receiving a first electronic document, wherein the first electronic document comprises a first set of word sequences;
receiving a second electronic document, wherein the second electronic document comprises a second set of word sequences, wherein a word sequence pair comprises a word sequence from the first set of word sequences and a word sequence from the second set of word sequences or an empty word sequence, and wherein the second document comprises a hyperlink to the first document;
automatically correlating the first electronic document and the second electronic document based at least in part upon the hyperlink;
assigning a respective label to each word sequence pair to generate a plurality of possible alignments of word sequences in the first set of word sequences with respect to word sequences in the second set of word sequences;
assigning respective scores to a plurality of different alignments, wherein a score is based at least in part upon a plurality of features comprising;
a first distortion feature that indicates a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences; and
a second distortion feature that is indicative of a difference between;
an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
causing a highest score assigned to an alignment amongst all scores assigned to the plurality of different alignments to be stored in a data repository, wherein the score is indicative of an amount of parallelism between word sequences aligned in the alignment.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for automatically extracting parallel word sequences from comparable corpora are described. Electronic documents, such as web pages belonging to a collaborative online encyclopedia, are analyzed to locate parallel word sequences between electronic documents written in different languages. These parallel word sequences are then used to train a machine translation system that can translate text from one language to another.
23 Citations
20 Claims
-
1. A method comprising the following computer-executable acts:
-
receiving a first electronic document, wherein the first electronic document comprises a first set of word sequences; receiving a second electronic document, wherein the second electronic document comprises a second set of word sequences, wherein a word sequence pair comprises a word sequence from the first set of word sequences and a word sequence from the second set of word sequences or an empty word sequence, and wherein the second document comprises a hyperlink to the first document; automatically correlating the first electronic document and the second electronic document based at least in part upon the hyperlink; assigning a respective label to each word sequence pair to generate a plurality of possible alignments of word sequences in the first set of word sequences with respect to word sequences in the second set of word sequences; assigning respective scores to a plurality of different alignments, wherein a score is based at least in part upon a plurality of features comprising; a first distortion feature that indicates a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences; and a second distortion feature that is indicative of a difference between; an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and causing a highest score assigned to an alignment amongst all scores assigned to the plurality of different alignments to be stored in a data repository, wherein the score is indicative of an amount of parallelism between word sequences aligned in the alignment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17, 19, 20)
-
-
9. A computing apparatus, comprising:
-
a processor; and a memory that is configured with components that are executable by the processor, the components comprising; a receiver component that receives; a first electronic document that comprises a first set of word sequences; and a second electronic document that comprises a second set of word sequences and a hyperlink to the first electronic document, wherein the first electronic document is automatically correlated with the second electronic document based at least in part upon the hyperlink to the first electronic document in the second electronic document; a feature extractor component that extracts a plurality of features based on the first electronic document and the second electronic document, the plurality of features comprising; a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and a second distortion feature that is indicative of a difference between; an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and a ranker component that outputs a ranked list of word sequence pairs, wherein the word sequence pairs comprise a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the ranked list of word sequence pairs are ranked in an order based at least in part upon the first distortion feature and the second distortion feature and that is indicative of an amount of parallelism between word sequences in the word sequence pairs. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computer-readable data storage device comprising instructions that, when executed by a processor, cause the processor to perform acts, comprising:
-
receiving a first web page that comprises a first set of word sequences in a first language; receiving a second web page that comprises a second set of word sequences in a second language, wherein the first web page and the second web page are web pages in a collaborative encyclopedia, wherein the first web page and the second web page are directed toward same subject matter, wherein the first web page comprises a first hyperlink to the second web page and the second web page comprises a second hyperlink to the first web page; automatically correlating the first web page and the second web page based at least in part upon the first hyperlink and the second hyperlink; assigning a score to a plurality of word sequence pairs, wherein a word sequence pair comprises a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pairs, and wherein the score is assigned based at least in part upon a plurality of features comprising; a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and a second distortion feature that is indicative of a difference between; an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and comparing a highest score assigned to a word sequence pair amongst all scores assigned to word sequence pairs that comprise a first word sequence to a threshold value; and if the highest score is above the threshold value, outputting data that indicates that the word sequences in the word sequence pair that has been assigned the highest score are parallel to one another. - View Dependent Claims (18)
-
Specification