Machine assisted translation tools
First Claim
1. A translation memory comprising:
- a computer usable medium having computer readable data embodied therein, the computer readable data further comprising;
an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address in the computer usable medium and paired with a target language text segment encoded in computer readable format;
an inverted index comprising a listing of source language letter n-grams, wherein each listed n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the computer usable memory; and
a posting vector file having a posting vector associated with each listed n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including;
i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, andii) a number of entropy weight values, each of the number of weight values associated with one document identification number.
15 Assignments
0 Petitions
Accused Products
Abstract
A translation memory for computer assisted translation based upon an aligned file having a number of source language text strings paired with target language text strings. A posting vector file includes a posting vector associated with each source language text string in the aligned file. Each posting vector includes a document identification number corresponding to a selected one of the source language text strings in the aligned file and a number of entropy weight values, each of the number of weight values corresponding to a unique letter n-gram that appears in the selected source language text string. Preferably, the translation memory further includes an inverted index comprising a listing of source language letter n-grams and a pointer to each of the posting vectors including an entry for the listed letter n-gram.
397 Citations
18 Claims
-
1. A translation memory comprising:
-
a computer usable medium having computer readable data embodied therein, the computer readable data further comprising; an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address in the computer usable medium and paired with a target language text segment encoded in computer readable format; an inverted index comprising a listing of source language letter n-grams, wherein each listed n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the computer usable memory; and a posting vector file having a posting vector associated with each listed n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including; i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, and ii) a number of entropy weight values, each of the number of weight values associated with one document identification number. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for creating a translation memory used to translate from a source language to a target language, the method comprising the steps of:
-
providing a weighted letter n-gram file for the source language; providing an aligned file comprising a plurality of source language text strings, each source language text string paired with a target language document; and creating an inverted index from the aligned file, the inverted index comprising a listing of unique letter n-grams appearing in the aligned file wherein each listing of a unique letter n-gram in the inverted index is associated with a set document identifications that point to source language text strings in the aligned file that contain the associated letter n-gram. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method for analyzing language comprising the steps of:
-
providing a plurality of text strings to be analyzed; identifying the text codeset applying to the plurality of text strings; selecting each of the text strings in turn; tokenizing the selected text segment to determine a set of letter n-grams appearing in the selected text document; converting each of the set of letter n-grams to Unicode values; defining a set of unique letter n-grams by adding each letter n-gram determined in the tokenizing step to the set of unique letter n-grams so long as each of the letter n-grams appear in the set of unique letter n-grams only once; computing a frequency of occurrence of each letter n-gram in the set of unique letter n-grams among the plurality of text strings; calculating an entropy weighting for each letter n-gram in the set of unique letter n-grams; filtering the set of unique letter n-grams to remove letter n-grams having an entropy weight below a preselected threshold; pairing each letter n-gram remaining in the set of unique letter n-grams with the calculated entropy weight; and saving the letter n-gram;
entropy weight pairs to a weighted n-gram file.
-
-
17. A method for creating an inverted index to a file containing a plurality of text strings comprising the steps of:
-
creating a list of letter n-grams wherein each listed letter n-gram has an entropy weight greater than a preselected threshold; for each listed letter n-gram, determining a set of the plurality of text strings in which the letter n-gram appears.
-
-
18. A method for retrieving a target subset of text strings from a plurality of text strings comprising the steps of:
-
inputting a text query; tokenizing the text query to determine a set of letter n-grams appearing in the text query; filtering the tokenized text query to remove letter n-grams having an entropy weight below a preselected threshold; pairing each remaining letter n-gram in the tokenized text query with an entropy weight; and using an inverted index to the plurality of text strings, determining the target subset of the plurality of text strings in which any of the paired letter n-grams appear.
-
Specification