Identifying documents which form translated pairs, within a document collection

US 7,813,918 B2
Filed: 08/03/2005
Issued: 10/12/2010
Est. Priority Date: 08/03/2005
Status: Active Grant

First Claim

Patent Images

1. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:

obtaining a group of documents;

determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;

changing an order of information within the reduced size versions;

sorting the reduced size versions;

comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and

using the documents that represent similar information for training for the text-to-text application.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A training system for text to text application. The training system finds groups of documents, and identifies automatically similar documents in the groups which are similar. The automatically identified documents can then be used for training of the text to text application. The comparison uses reduced size versions of the documents in order to minimize the amount of processing.

Citations

19 Claims

1. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:
- obtaining a group of documents;
  
  determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;
  
  changing an order of information within the reduced size versions;
  
  sorting the reduced size versions;
  
  comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and
  
  using the documents that represent similar information for training for the text-to-text application.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the text-to-text application is a machine translation system.
  - 3. The method of claim 1, further comprising:
    - carrying out a rough translation to a second language of documents in the group to form a group of translated documents; and
      
      comparing the group of translated documents to other documents prior to the determining.
  - 4. The method of claim 1, wherein determining the reduced size versions comprises:
    - forming vectors indicative of the documents; and
      
      comparing the vectors.

5. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:
- obtaining a group of documents;
  
  determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;
  
  comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and
  
  using the documents that represent similar information for training for the text-to-text application,wherein determining the reduced size versions includes comparing words in the documents to specified dictionaries of words and defining the documents in terms of information about the words in the dictionaries.
- View Dependent Claims (6, 7)
- - 6. The method of claim 5, wherein the reduced size versions include keys representing positions of words in the dictionaries.
  - 7. The method of claim 6, further comprising changing an order of the keys prior to comparing the reduced size versions.

8. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising:
- a database including a group of documents;
  
  a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents; and
  
  a text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application,wherein the text-to-text application is executable to carry out a rough translation to a second language of documents in the group to form a group of translated documents, and to compare the group of translated documents to other documents prior to determining the documents that represent similar information.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The system of claim 8, wherein the text-to-text application is a machine translation system.
  - 10. The system of claim 8, wherein the text-to-text application module is executable to change an order of information within the reduced size versions prior to the comparing.
  - 11. The system of claim 10, wherein the text-to-text application module is executable to sort the reduced size versions.
  - 12. The system of claim 8 wherein the text-to-text application is executable to form vectors indicative of the documents, and compares the vectors.

13. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising:
- a database including a group of documents;
  
  a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents;
  
  a text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application; and
  
  a plurality of word dictionaries each having a plurality of words therein, and wherein the reduced size versions are determined at least in part by comparing words in the documents to words in the dictionaries.
- View Dependent Claims (14, 15)
- - 14. The system of claim 13, wherein the reduced size versions include keys representing positions of words in the dictionaries.
  - 15. The system of claim 14, wherein the text-to-text application module is executable to change an order of said keys prior to said comparing.

16. A method for identifying documents that represent similar information, the method comprising:
- obtaining a first group of documents in a first language, and a second group of documents in a second language;
  
  carrying out a rough translation to the first language of the second group of documents to form a third group of translated documents, the carrying out of the rough translation performed by a machine translation system;
  
  determining reduced size versions of the first and third groups of documents, wherein the reduced size versions summarize information about words contained in the first and third groups of documents, and the determining is performed by a processor; and
  
  comparing the reduced size versions to determine documents that represent similar information, the comparing performed by a processor.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, further comprising using the documents that represent similar information to train a text-to-text application system.
  - 18. The method of claim 16, further comprising changing an order of information within the reduced size versions prior to determining the documents that represent similar information.
  - 19. The method of claim 18, further comprising sorting the reduced size versions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SDL Inc (RWS Holdings Plc)
Original Assignee
Language Weaver, Inc. (RWS Holdings Plc)
Inventors
Knight, Kevin, Muslea, Ion, Marcu, Daniel
Primary Examiner(s)
Sked; Matthew J

Application Number

US11/197,744
Publication Number

US 20070033001A1
Time in Patent Office

1,896 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 40/45 Example-based machine trans...

Identifying documents which form translated pairs, within a document collection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying documents which form translated pairs, within a document collection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links