Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

US 8,296,127 B2
Filed: 03/22/2005
Issued: 10/23/2012
Est. Priority Date: 03/23/2004
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

obtaining, via a processing module that is executable by a processor, a collection of texts which are not parallel texts;

determining sentences within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentences within the collection of texts, and determining at least one parameter indicative of a sentence in the first document and a sentence in the second document, and using said at least one parameter to determine sentences which have similar meanings; and

using said sentences which have similar meanings to create training data for a machine translation system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A translation training device which extracts from two nonparallel Corpora a set of parallel sentences. The system finds parameters between different sentences or phrases, in order to find parallel sentences. The parallel sentences are then used for training a data-driven machine translation system. The process can be applied repetitively until sufficient data is collected or until the performance of the translation system stops improving.

363 Citations

29 Claims

1. A method, comprising:
- obtaining, via a processing module that is executable by a processor, a collection of texts which are not parallel texts;
  
  determining sentences within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentences within the collection of texts, and determining at least one parameter indicative of a sentence in the first document and a sentence in the second document, and using said at least one parameter to determine sentences which have similar meanings; and
  
  using said sentences which have similar meanings to create training data for a machine translation system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method recited in claim 1, further comprising using said training data to train a machine translation system.
  - 3. The method recited in claim 1, further comprising, after training said machine translation system, comparing again said sentences in said first document with said sentences in said second document.
  - 4. The method recited in claim 1, wherein said parameter includes dates of texts.
  - 5. The method recited in claim 1, wherein said parameter includes a number of words in common in a specified word phrase.
  - 6. The method recited in claim 1, wherein said parameter includes alignment of words in two specified word phrases.
  - 7. The method recited in claim 1, wherein said parameter includes a fertility representing a number of words to which another word is connected.
  - 8. The method recited in claim 1, wherein said parameter includes a number of words in one sentence which have no corresponding words in the other sentence.
  - 9. The method recited in claim 1, wherein said determining sentences comprises using a first parameter to select a pair of texts which are similar, and determining possible sentence pairs within said pair of texts.
  - 10. The method recited in claim 9, wherein said first parameter comprises dates of the texts.
  - 11. The method recited in claim 9, wherein said determining possible sentence pairs comprises using a word overlap filter to determine likely overlapping sentences.
  - 12. The method recited in claim 11, wherein said word overlap filter verifies that a ratio of the lengths of the sentences is no greater than two, and that at least half the words in each sentence have a translation in the other sentence.
  - 13. The method recited in claim 1, wherein said determining comprises determining a coarse correspondence between two texts, and further filtering said texts to determine sentence pairings within the two texts.

14. A method, comprising:
- obtaining, via a processing module that is executable by a processor, a first amount of parallel training data for a learning component of a machine translation system;
  
  using the learning component of the machine translation system trained using said parallel data to determine translation parameters, including at least one probabilistic word dictionary;
  
  using said translation parameters to extract parallel sentences from a second corpus of nonparallel data, where said second corpus is larger than a database of said parallel training data;
  
  using said parallel sentences to create training data for said learning component of said machine translation system;
  
  training said learning component using said training data, and iteratively re-analyzing said comparable corpus using the system thus trained;
  
  continuing said iterative re-analyzing until training reaches a specified level.
- View Dependent Claims (15, 16, 17)
- - 15. The method recited in claim 14, wherein said continuing comprises terminating the iterative process when a sufficiently large corpus of training data is obtained.
  - 16. The method recited in claim 14, wherein said continuing comprises terminating the iterative process when a translation system trained on the data stops improving.
  - 17. The method recited in claim 14, wherein said the iteratively reanalyzing is continued until an improvement less than a specified amount is obtained.

18. A computer system, comprising:
- a database, storing a first collection of texts in a first language, and a second collection of texts, which are not parallel to said first collection of texts, that are in a second language;
  
  a training processor, that processes said texts to determine portions in the first collection of texts whose meaning is substantially the same as portions within the second collection of texts, by comparing a plurality of sentences within the collection of texts, and determining at least one parameter indicative of a first portion within the first collection and a second portion within the second collection, and using said at least one parameter to determine portions which have similar meanings; and
  
  a translation processor, using training data based on said portions which have similar meanings to translate input text between said first and second languages.
- View Dependent Claims (19, 20, 21, 22, 23, 24)
- - 19. The system recited in claim 18, wherein said training processor iteratively operates, by training a dictionary, and then comparing again said first collection and said second collection.
  - 20. The system recited in claim 18, wherein said training processor uses dates of texts as said parameter.
  - 21. The system recited in claim 18, wherein said training processor determines a number of words in common in a specified word phrase, and uses said number of words as said parameter.
  - 22. The system recited in claim 18, wherein said training processor determines alignment of words in two specified word phrases.
  - 23. The system recited in claim 18, wherein said training processor uses a word overlap filter that verifies that a ratio of the lengths of the sentences is no greater than a first specified value, and verifies that at least a second specified number of the words in each sentence have a translation in the other sentence.
  - 24. The system recited in claim 18, wherein said training processor determines a coarse correspondence between two texts, and further filtering said texts to determine sentence pairings within the two texts.

25. A system, comprising:
- a database including a first amount of parallel training data for a learning component of a machine translation system, and a second corpus of non parallel data, said second amount greater than said first amount;
  
  a learning processor, forming at least one probabilistic word dictionary using said parallel data and also forming training data, and using said probabilistic word dictionary to determine portions within said second corpus which have comparable meanings, and to refine said training data to form refined training data, based on said second corpus, and to re-train again, based on said second corpus and said refined training data.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The system recited in claim 25, further comprising a machine translation system that uses said training data to translate a document.
  - 27. The system recited in claim 25, wherein said learning processor continues said iterative re-analyzing until training reaches a specified level.
  - 28. The system recited in claim 25, wherein said first amount of parallel training data is less than 100 k tokens of information.
  - 29. The system recited in claim 25, wherein said second corpus of data is 10 times greater than said first amount of parallel training data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Marcu, Daniel, Munteanu, Dragos Stefan
Primary Examiner(s)
Smits, Talivaldis Ivars

Application Number

US11/087,376
Publication Number

US 20050228643A1
Time in Patent Office

2,772 Days
Field of Search

None
US Class Current

704/5
CPC Class Codes

G06F 40/42 Data-driven translation

Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

363 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

363 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links