Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

US 20050228643A1
Filed: 03/22/2005
Published: 10/13/2005
Est. Priority Date: 03/23/2004
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

Obtaining a collection of texts which are not parallel texts;

determining sentence portions within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentence portions within the collection of texts, and determining at least one parameter indicative of a sentence portion in the first document and a sentence portion in the second document, and using said at least one parameter to determine sentence portions which have similar meanings; and

using said sentence portions which have similar meanings to create training data for a machine translation system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A translation training device which extracts from two nonparallel Corpora a set of parallel sentences. The system finds parameters between different sentences or phrases, in order to find parallel sentences. The parallel sentences are then used for training a data-driven machine translation system. The process can be applied repetitively until sufficient data is collected or until the performance of the translation system stops improving.

Citations

29 Claims

1. A method, comprising:
- Obtaining a collection of texts which are not parallel texts;
  
  determining sentence portions within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentence portions within the collection of texts, and determining at least one parameter indicative of a sentence portion in the first document and a sentence portion in the second document, and using said at least one parameter to determine sentence portions which have similar meanings; and
  
  using said sentence portions which have similar meanings to create training data for a machine translation system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method as in claim 1, further comprising using said training data to train a machine translation system.
  - 3. A method as in claim 1, further comprising, after training said machine translation system, comparing again said sentence portions in said first document with said sentence portions in said second document.
  - 4. A method as in claim 1, wherein said parameter includes dates of texts.
  - 5. A method as in claim 1, wherein said parameter includes a number of words in common in a specified word phrase.
  - 6. A method as in claim 1, wherein said parameter includes alignment of words in two specified word phrases.
  - 7. A method as in claim 1, wherein said parameter includes a fertility representing a number of words to which another word is connected.
  - 8. A method as in claim 1, wherein said parameter includes a number of words in one sentence portion which have no corresponding words in the other sentence portion
  - 9. A method as in claim 1, wherein said determining sentence portions comprises using a first parameter to select a pair of texts which are similar, and determining possible sentence portion pairs within said pair of texts.
  - 10. A method as in claim 9, wherein said first parameter comprises dates of the texts.
  - 11. A method as in claim 9, wherein said determining possible sentence portion pairs comprises using a word overlap filter to determine likely overlapping sentence portions.
  - 12. A method as in claim 11, wherein said word overlap filter verifies that a ratio of the lengths of the sentence portions is no greater than two, and that at least half the words in each sentence portion have a translation in the other sentence portion.
  - 13. A method as in claim 1, wherein said determining comprises determining a coarse correspondence between two texts, and further filtering said texts to determine sentence portion pairings within the two texts.

14. A method, comprising:
- obtaining a first amount of parallel training data for a learning component of a machine translation system;
  
  using the learning component of the machine translation system trained using said parallel data to determine translation parameters, including at least one probabilistic word dictionary;
  
  using said translation parameters to extract parallel sentences from a second corpus of nonparallel data, where said second corpus is larger than a database of said parallel training data;
  
  using said parallel sentences to create training data for said learning component of said machine translation system;
  
  training said learning component using said training data, and iteratively re-analyzing said comparable corpus using the system thus trained;
  
  continuing said iterative re-analyzing when training reaches a specified level.
- View Dependent Claims (15, 16, 17)
- - 15. A method as in claim 14, wherein said continuing comprises terminating the iterative process until a sufficiently large corpus of training data is obtained
  - 16. A method as in claim 14, wherein said continuing comprises terminating the iterative process when a translation system trained on the data stops improving.
  - 17. A method as in claim 14, wherein said the iteratively reanalyzing is continued until an improvement less than a specified amount is obtained.

18. A computer system, comprising:
- a database, storing a first collection of texts in a first language, and a second collection of texts, which are not parallel to said first collection of texts, that are in a second language;
  
  a training processor, that processes said texts to determine portions in the first collection of texts whose meaning is substantially the same as portions within the second collection of texts, by comparing a plurality of sentences within the collection of texts, and determining at least one parameter indicative of a first portion within the first collection and a second portion within the second collection, and using said at least one parameter to determine portions which have similar meanings; and
  
  a translation processor, using training data based on said portions which have similar meanings to translate input text between said first and second languages.
- View Dependent Claims (19, 20, 21, 22, 23, 24)
- - 19. A system as in claim 18, wherein said training processor iteratively operates, by training a dictionary, and then comparing again said first collection and said second collection.
  - 20. A system as in claim 18, wherein said training processor uses dates of texts as said parameter.
  - 21. A system as in claim 18, wherein said training processor determines a number of words in common in a specified word phrase, and uses said number of words as said parameter.
  - 22. A system as in claim 18, wherein said training processor determines alignment of words in two specified word phrases.
  - 23. A system as in claim 18, wherein said training processor uses a word overlap filter that verifies that a ratio of the lengths of the sentences is no greater than a first specified value, and verifies that at least a second specified number of the words in each sentence have a translation in the other sentence.
  - 24. A system as in claim 18, wherein said training processor determines a coarse correspondence between two texts, and further filtering said texts to determine sentence pairings within the two texts.

25. A system, comprising:
- a database including a first amount of parallel training data for a learning component of a machine translation system, and a second corpus of non parallel data, said second amount greater than said first amount;
  
  a learning processor, forming at least one probabilistic word dictionary using said parallel data and also forming training data, and using said probabilistic word dictionary to determine portions within said second corpus which have comparable meanings, and to refine said training data to form refined training data, based on said second corpus, and to re-train again, based on said second corpus and said refined training data.
- View Dependent Claims (26, 27, 28, 29)
- - 26. A system as in claim 25, further comprising a machine translation system that uses said training data in to translate a document.
  - 27. A system as in claim 25, wherein said learning processor continues said iterative re-analyzing until training reaches a specified level.
  - 28. A system as in claim 25, wherein said first amount of parallel training data is less than 100 k tokens of information.
  - 29. A system as in claim 25, wherein said second corpus of data is 10 times greater than said first amount of parallel training data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Marcu, Daniel, Munteanu, Dragos Stefan

Granted Patent

US 8,296,127 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/42 Data-driven translation

Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Discovery of parallel text portions in comparable collections of corpora and training using comparable texts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links