Identifying documents which form translated pairs, within a document collection
First Claim
Patent Images
1. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:
- obtaining a group of documents;
determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;
changing an order of information within the reduced size versions;
sorting the reduced size versions;
comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and
using the documents that represent similar information for training for the text-to-text application.
2 Assignments
0 Petitions
Accused Products
Abstract
A training system for text to text application. The training system finds groups of documents, and identifies automatically similar documents in the groups which are similar. The automatically identified documents can then be used for training of the text to text application. The comparison uses reduced size versions of the documents in order to minimize the amount of processing.
-
Citations
19 Claims
-
1. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:
-
obtaining a group of documents; determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor; changing an order of information within the reduced size versions; sorting the reduced size versions; comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and using the documents that represent similar information for training for the text-to-text application. - View Dependent Claims (2, 3, 4)
-
-
5. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising:
-
obtaining a group of documents; determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor; comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; and using the documents that represent similar information for training for the text-to-text application, wherein determining the reduced size versions includes comparing words in the documents to specified dictionaries of words and defining the documents in terms of information about the words in the dictionaries. - View Dependent Claims (6, 7)
-
-
8. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising:
-
a database including a group of documents; a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents; and a text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application, wherein the text-to-text application is executable to carry out a rough translation to a second language of documents in the group to form a group of translated documents, and to compare the group of translated documents to other documents prior to determining the documents that represent similar information. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising:
-
a database including a group of documents; a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents; a text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application; and a plurality of word dictionaries each having a plurality of words therein, and wherein the reduced size versions are determined at least in part by comparing words in the documents to words in the dictionaries. - View Dependent Claims (14, 15)
-
-
16. A method for identifying documents that represent similar information, the method comprising:
-
obtaining a first group of documents in a first language, and a second group of documents in a second language; carrying out a rough translation to the first language of the second group of documents to form a third group of translated documents, the carrying out of the rough translation performed by a machine translation system; determining reduced size versions of the first and third groups of documents, wherein the reduced size versions summarize information about words contained in the first and third groups of documents, and the determining is performed by a processor; and comparing the reduced size versions to determine documents that represent similar information, the comparing performed by a processor. - View Dependent Claims (17, 18, 19)
-
Specification