Extracting sentence translations from translated documents

US 7,054,803 B2
Filed: 12/19/2000
Issued: 05/30/2006
Est. Priority Date: 12/19/2000
Status: Expired due to Fees

First Claim

Patent Images

1. Method of extracting translations from translated texts, the method comprising the steps of:

accessing a first text in a first language;

accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text;

dividing the first text and the second text each into a plurality of textual elements;

forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence;

calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features;

calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and

optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system extracts translations from translated texts, such as sentence translations from translated versions of documents. A first and a second text are accessed and divided into a plurality of textual elements. From these textual elements, a sequence of pairs of text portions is formed, and a pair score is calculated for each pair, using weighted features. Then, an alignment score of the sequence is calculated using the pair scores, and the sequence is systematically varied to identify a sequence that optimizes the alignment score. The invention allows for fast, reliable and robust alignment of sentences within large translated documents. Further, it allows to exploit a broad variety of existing knowledge sources in a flexible way, without performance penalty. Further, a general implementation of dynamic programming search with online memory allocation and garbage collection allows for treating very long documents with limited memory footprint.

70 Citations

View as Search Results

20 Claims

1. Method of extracting translations from translated texts, the method comprising the steps of:
- accessing a first text in a first language;
  
  accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text;
  
  dividing the first text and the second text each into a plurality of textual elements;
  
  forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence;
  
  calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features;
  
  calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and
  
  optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein the dividing step includes a monolingual pre-processing step;
    - the monolingual pre-processing includes performing normalization of the textual elements, said normalization including lemmatization, case normalization, or truncation;
      
      the monolingual pre-processing further includes counting the frequencies of the normalized textual elements that occur in the texts, and storing the frequencies;
      
      the step of forming said sequence of pairs of text portions includes the steps of retrieving the stored frequencies and pairing text elements having at least similar frequencies; and
      
      the method further comprises the step of reducing at least one weight assigned to a feature occurring in a text element pair if the difference between the frequencies of the paired textual elements exceeds a certain amount.
  - 3. The method of claim 1, wherein:
    - the alignment score is calculated by summing up all the pair scores; and
      
      the alignment score is optimized by selecting the maximum alignment score.
  - 4. The method of claim 1, wherein said plurality of features include lexical information.
  - 5. The method of claim 1, wherein said plurality of features include document structure and formatting information.
  - 6. The method of claim 1, wherein said plurality of features include any character within the text, the weights of such features being lower than the weights of other features.
  - 7. The method of claim 1, further comprising the step of generating pairs of textual elements, each pair comprising a textual element of the first text and a textual element of the second text;
    - wherein the step of generating pairs of textual elements comprises the step of normalizing textual elements; and
      
      wherein the normalizing step includes removing accents, inessential non-alphanumeric characters, or case normalization.
  - 8. The method of claim 1, further comprising the step of generating pairs of textual elements, each pair comprising a textual element of the first text and a textual element of the second text;
    - wherein the step of generating pairs of textual elements comprises the step of accessing at least one bilingual resource.
  - 9. The method of claim 1, wherein the first and second texts are provided in the form of a first and second document, the first and second languages being natural languages, and wherein the method is used for extracting sentence translations.
  - 10. The method of claim 1, wherein the first and second texts are provided in the form of speech signals and a transcript thereof.
  - 11. The method of claim 1, wherein the first language is a first set of characters identifying a first DNA sequence and the second language is a second set of characters identifying a second DNA sequence.
  - 12. The method of claim 1, wherein the forming, calculating and optimizing steps are performed in a dynamic programming process comprising the steps of:
    - accessing a set of nodes, each node being a pair of positions in the first and second texts, each node being annotated with a node score;
      
      for each node, generating a set of successor nodes by applying a set of node transitions; and
      
      for each successor node, calculating a node score using the node score of the node accessed for generating the successor nodes.
  - 13. The method of claim 12, wherein said node score is the score of the best alignment that led to the respective node;
    - and wherein;
      
      each node has assigned a pointer to a predecessor node that took part in the best alignment that led to the respective node; and
      
      the process further comprises the step of deleting each node which has no successor node that points to the node as its predecessor node.
  - 14. The method of claim 12, wherein the process further comprises a pruning step of comparing the score of each successor node with the scores of competing nodes spanning a similar part of the first and second texts, and deleting those successor nodes having scores being considerably worse than the scores of the competing nodes.
  - 15. The method of claim 14, further comprising the steps of:
    - estimating the number of matches that can be achieved in the alignment of the remaining parts of the texts; and
      
      using said estimate in comparing the competing nodes.
  - 16. The method of claim 14, further comprising the steps ofcomputing an approximate alignment before performing the forming, calculating and optimizing steps;
    - andusing said approximate alignment in estimating the number of matches.
  - 17. The method of claim 14, wherein the step of estimating the number of matches includes the step of accessing an index for determining for each feature occurrence where in the respective text the feature occurs.
  - 18. The method of claim 14, further comprising the step of performing a backward run of the Hunt/Szymanski algorithm and recording the intermediate results sequentially in a stack such that they can be retrieved in reverse order.

19. A computer readable storage medium storing instructions for performing a method comprising the steps of:
- accessing a first text in a first language;
  
  accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text;
  
  dividing the first text and the second text each into a plurality of textual elements;
  
  forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence;
  
  calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features;
  
  calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and
  
  optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences.

20. A system for extracting translations from translated texts, comprising:
- a pre-processor for accessing a first text in a first language, accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text, and dividing the first and the second text each into a plurality of textual elements; and
  
  a processor for forming a sequence of pairs of text portions from said pluralities of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence, the processor being further arranged for calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence, and optimizing said alignment score by repeating said forming and calculating steps, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Eisele, Andreas
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
Sked, Matthew J.

Application Number

US09/738,990
Publication Number

US 20020107683A1
Time in Patent Office

1,988 Days
Field of Search

704/2, 704/277, 704/9
US Class Current

704/2
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

G06F 40/45 Example-based machine trans...

Extracting sentence translations from translated documents

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

70 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting sentence translations from translated documents

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

70 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links