Machine assisted translation tools utilizing an inverted index and list of letter n-grams
First Claim
1. A translation memory comprising:
- an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address and paired with a target language text segment encoded in the computer readable format;
an inverted index comprising a listing of source language letter n-grams, wherein each listed letter n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the translation memory; and
a posting vector file having a posting vector associated with each listed letter n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including;
i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, andii) a number of entropy weight values, each of the number of entropy weight values associated with one document identification number.
10 Assignments
0 Petitions
Accused Products
Abstract
A translation memory for computer assisted translation based upon an aligned file having a number of source language text strings paired with target language text strings. A posting vector file includes a posting vector associated with each source language text string in the aligned file. Each posting vector includes a document identification number corresponding to a selected one of the source language text strings in the aligned file and a number of entropy weight values, each of the number of weight values corresponding to a unique letter n-gram that appears in the selected source language text string. Preferably, the translation memory further includes an inverted index comprising a listing of source language letter n-grams and a pointer to each of the posting vectors including an entry for the listed letter n-gram.
270 Citations
18 Claims
-
1. A translation memory comprising:
-
an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address and paired with a target language text segment encoded in the computer readable format; an inverted index comprising a listing of source language letter n-grams, wherein each listed letter n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the translation memory; and a posting vector file having a posting vector associated with each listed letter n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including; i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, and ii) a number of entropy weight values, each of the number of entropy weight values associated with one document identification number. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for translating a source language to a target language, the method comprising the steps of:
-
providing a weighted letter n-gram file for the source language; providing an aligned file comprising a plurality of source language text strings, each source language text string paired with a target language document; creating an inverted index from the aligned file, the inverted index comprising a listing of unique letter n-grams appearing in the aligned file wherein each listing of a unique letter n-gram in the inverted index is associated with a set of document identifications that point to source language text strings in the aligned file that contain the associated letter n-gram; and retrieving the target language corresponding to the pointed source language. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method for analyzing language comprising the steps of:
-
providing a plurality of text strings to be analyzed; identifying the text codeset applying to the plurality of text strings; selecting each of the text strings in turn; tokenizing the selected text string to determine a set of letter n-grams appearing in the selected text string; converting each letter n-grams to codeset values; defining a set of unique letter n-grams by adding each letter n-gram determined in the tokenizing step to the set of unique letter n-grams; computing a frequency of occurrence of each letter n-gram in the selected text sequence; calculating an entropy weighting for each letter n-gram in the set of unique letter n-grams; filtering the set of unique letter n-grams to remove letter n-grams having an entropy weight below a preselected threshold; pairing each letter n-gram remaining in the set of unique letter n-grams with the calculated entropy weight; and saving the letter n-gram;
entropy weight pairs to a weighted n-gram file.
-
-
17. A method for creating an inverted index to a file containing a plurality of text strings comprising the steps of:
-
creating a list of letter n-grams wherein each listed letter n-gram has an entropy weight in a preselected threshold range; for each listed letter n-gram, determining a set of the plurality of text strings in which the letter n-gram appears.
-
-
18. A method for retrieving a target subset of text strings from a plurality of target text strings comprising the steps of:
-
inputting a source text query; tokenizing the source text query to determine a set of letter n-grams appearing in the text query; pairing each letter n-gram in the tokenized text query with an entropy weight; filtering the paired tokenized text query to remove letter n-grams having an entropy weight below a preselected threshold; and using an inverted index to the plurality of target text strings to determine the target subset of the plurality of text strings in which any of the paired letter n-grams appear.
-
Specification