Locating parallel word sequences in electronic documents

US 8,560,297 B2
Filed: 06/07/2010
Issued: 10/15/2013
Est. Priority Date: 06/07/2010
Status: Active Grant

First Claim

Patent Images

1. A method comprising the following computer-executable acts:

receiving a first electronic document, wherein the first electronic document comprises a first set of word sequences;

receiving a second electronic document, wherein the second electronic document comprises a second set of word sequences, wherein a word sequence pair comprises a word sequence from the first set of word sequences and a word sequence from the second set of word sequences or an empty word sequence, and wherein the second document comprises a hyperlink to the first document;

automatically correlating the first electronic document and the second electronic document based at least in part upon the hyperlink;

assigning a respective label to each word sequence pair to generate a plurality of possible alignments of word sequences in the first set of word sequences with respect to word sequences in the second set of word sequences;

assigning respective scores to a plurality of different alignments, wherein a score is based at least in part upon a plurality of features comprising;

a first distortion feature that indicates a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences; and

a second distortion feature that is indicative of a difference between;

an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and

an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and

causing a highest score assigned to an alignment amongst all scores assigned to the plurality of different alignments to be stored in a data repository, wherein the score is indicative of an amount of parallelism between word sequences aligned in the alignment.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for automatically extracting parallel word sequences from comparable corpora are described. Electronic documents, such as web pages belonging to a collaborative online encyclopedia, are analyzed to locate parallel word sequences between electronic documents written in different languages. These parallel word sequences are then used to train a machine translation system that can translate text from one language to another.

23 Citations

View as Search Results

20 Claims

1. A method comprising the following computer-executable acts:
- receiving a first electronic document, wherein the first electronic document comprises a first set of word sequences;
  
  receiving a second electronic document, wherein the second electronic document comprises a second set of word sequences, wherein a word sequence pair comprises a word sequence from the first set of word sequences and a word sequence from the second set of word sequences or an empty word sequence, and wherein the second document comprises a hyperlink to the first document;
  
  automatically correlating the first electronic document and the second electronic document based at least in part upon the hyperlink;
  
  assigning a respective label to each word sequence pair to generate a plurality of possible alignments of word sequences in the first set of word sequences with respect to word sequences in the second set of word sequences;
  
  assigning respective scores to a plurality of different alignments, wherein a score is based at least in part upon a plurality of features comprising;
  
  a first distortion feature that indicates a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences; and
  
  a second distortion feature that is indicative of a difference between;
  
  an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
  
  an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
  
  causing a highest score assigned to an alignment amongst all scores assigned to the plurality of different alignments to be stored in a data repository, wherein the score is indicative of an amount of parallelism between word sequences aligned in the alignment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17, 19, 20)
- - 2. The method of claim 1, wherein the word sequences in the first set of word sequences are in a first language and the word sequences in the second set of word sequences are in a second language.
  - 3. The method of claim 2, wherein the highest score is utilized to train a statistical machine translation system that is configured to translate text in the first language to text in the second language.
  - 4. The method of claim 2, wherein the first electronic document and the second electronic document are a first web page and a second web page, respectively.
  - 5. The method of claim 4, further comprising:
    - determining that the first web page and the second web page are directed toward the same subject matter based at least in part upon the second electronic document comprising the hyperlink to the first electronic document; and
      
      automatically correlating the first web page with the second web page based at least in part upon the determining that the first web page and the second web page are directed toward the same subject matter.
  - 6. The method of claim 5, wherein the first web page and the second web page are from an online collaborative encyclopedia.
  - 7. The method of claim 1, wherein the score is assigned based at least in part upon features derived from word alignments between words in word sequences in the first set of word sequence and words in the respective word sequences in the second set of word sequences.
  - 8. The method of claim 1, wherein the score is assigned based at least in part upon word-level induced lexicon features between words in at least one word sequence in the first set of word sequences and words in the respective word sequences in the second set of word sequences.
  - 17. The method of claim 1, wherein the plurality of features further comprises one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.
  - 19. The method of claim 4, wherein at least one of the first web page or the second web page is a web page pertaining to news items.
  - 20. The method of claim 1, wherein the plurality of features further comprises an image feature indicating whether a word sequence in the first set of word sequences and a word sequence in the second set of word sequences are both captions of an image.

9. A computing apparatus, comprising:
- a processor; and
  
  a memory that is configured with components that are executable by the processor, the components comprising;
  
  a receiver component that receives;
  
  a first electronic document that comprises a first set of word sequences; and
  
  a second electronic document that comprises a second set of word sequences and a hyperlink to the first electronic document, wherein the first electronic document is automatically correlated with the second electronic document based at least in part upon the hyperlink to the first electronic document in the second electronic document;
  
  a feature extractor component that extracts a plurality of features based on the first electronic document and the second electronic document, the plurality of features comprising;
  
  a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and
  
  a second distortion feature that is indicative of a difference between;
  
  an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
  
  an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
  
  a ranker component that outputs a ranked list of word sequence pairs, wherein the word sequence pairs comprise a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the ranked list of word sequence pairs are ranked in an order based at least in part upon the first distortion feature and the second distortion feature and that is indicative of an amount of parallelism between word sequences in the word sequence pairs.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computing apparatus of claim 9, wherein the components further comprise a comparer component that compares scores assigned to the word sequence pairs with a threshold score and selects the word sequence pair with a highest score as including word sequences that are parallel to one another.
  - 11. The computing apparatus of claim 9, wherein the ranker component outputs the ranked list of word sequence pairs based at least in part upon word sequence alignment between the first electronic document and the second electronic document.
  - 12. The computing apparatus of claim 9, wherein the first electronic document and the second electronic document are a first web page and a second web page, respectively.
  - 13. The computing apparatus of claim 12, wherein the first web page and the second web page belong to an online collaborative encyclopedia.
  - 14. The computing apparatus of claim 13, wherein the first set of word sequences in the first web page is in a first language and the second set of word sequences in the second web page is in a second language.
  - 15. The computing apparatus of claim 9, wherein the plurality of features further comprise one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.

16. A computer-readable data storage device comprising instructions that, when executed by a processor, cause the processor to perform acts, comprising:
- receiving a first web page that comprises a first set of word sequences in a first language;
  
  receiving a second web page that comprises a second set of word sequences in a second language, wherein the first web page and the second web page are web pages in a collaborative encyclopedia, wherein the first web page and the second web page are directed toward same subject matter, wherein the first web page comprises a first hyperlink to the second web page and the second web page comprises a second hyperlink to the first web page;
  
  automatically correlating the first web page and the second web page based at least in part upon the first hyperlink and the second hyperlink;
  
  assigning a score to a plurality of word sequence pairs, wherein a word sequence pair comprises a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pairs, and wherein the score is assigned based at least in part upon a plurality of features comprising;
  
  a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and
  
  a second distortion feature that is indicative of a difference between;
  
  an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
  
  an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
  
  comparing a highest score assigned to a word sequence pair amongst all scores assigned to word sequence pairs that comprise a first word sequence to a threshold value; and
  
  if the highest score is above the threshold value, outputting data that indicates that the word sequences in the word sequence pair that has been assigned the highest score are parallel to one another.
- View Dependent Claims (18)
- - 18. The computer-readable data storage device of claim 16, wherein the plurality of features further comprises one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Quirk, Christopher Brian, Toutanova, Kristina N., Smith, Jason Robert
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sharma, Neeraj

Application Number

US12/794,778
Publication Number

US 20110301935A1
Time in Patent Office

1,226 Days
Field of Search

704/2, 704/3, 704/4, 704/8, 704/9, 715/205, 382/229
US Class Current

704/2
CPC Class Codes

G06F 40/295 Named entity recognition

G06F 40/45 Example-based machine trans...

Locating parallel word sequences in electronic documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Locating parallel word sequences in electronic documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links