Machine assisted translation tools utilizing an inverted index and list of letter n-grams

US 6,131,082 A
Filed: 12/01/1997
Issued: 10/10/2000
Est. Priority Date: 06/07/1995
Status: Expired due to Term

First Claim

Patent Images

1. A translation memory comprising:

an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address and paired with a target language text segment encoded in the computer readable format;

an inverted index comprising a listing of source language letter n-grams, wherein each listed letter n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the translation memory; and

a posting vector file having a posting vector associated with each listed letter n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including;

i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, andii) a number of entropy weight values, each of the number of entropy weight values associated with one document identification number.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A translation memory for computer assisted translation based upon an aligned file having a number of source language text strings paired with target language text strings. A posting vector file includes a posting vector associated with each source language text string in the aligned file. Each posting vector includes a document identification number corresponding to a selected one of the source language text strings in the aligned file and a number of entropy weight values, each of the number of weight values corresponding to a unique letter n-gram that appears in the selected source language text string. Preferably, the translation memory further includes an inverted index comprising a listing of source language letter n-grams and a pointer to each of the posting vectors including an entry for the listed letter n-gram.

270 Citations

18 Claims

1. A translation memory comprising:
- an aligned file having a number of source language text segments encoded in a computer readable format, each of the source language text segments positioned at a unique address and paired with a target language text segment encoded in the computer readable format;
  
  an inverted index comprising a listing of source language letter n-grams, wherein each listed letter n-gram includes an associated entry for an entropy weight for the listed letter n-gram, a count of the number of source language text segments in the aligned file that include an entry for the listed letter n-gram, and a pointer to a unique location in the translation memory; and
  
  a posting vector file having a posting vector associated with each listed letter n-gram in the inverted index, each posting vector positioned at one of the unique locations pointed to in the inverted index, each posting vector including;
  
  i) a plurality of document identification numbers each corresponding to a selected one of the source language text strings in the aligned file, andii) a number of entropy weight values, each of the number of entropy weight values associated with one document identification number.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The translation memory of claim 1 further comprising:
    - correlation file having a listing of each of the unique addresses for the source language text strings, wherein each of the unique addresses in the correlation file is identified with one document identification number.
  - 3. The translation memory of claim 1 wherein the letter n-grams have a length in the range of two to three source language characters.
  - 4. The translation memory of claim 1 wherein the entropy weight is calculated from:
    - ##EQU3## Entropy_i =entropy weight for a letter n-gram i;
      
      freq_i =frequency of term i in text segment k;
      
      tfreq_i =total frequency of term i in all text segments; and
      
      N=total number of text segments.
  - 5. The translation memory of claim 1 wherein the computer readable data is compressed.
  - 6. The translation memory of claim 1 wherein the postings vector file is compressed using sparse vector coding.
  - 7. The translation memory of claim 1 wherein the entropy weights in each posting vector are normalized.
  - 8. The translation memory of claim 1 wherein the listing of letter n-grams is provided in Unicode format.
  - 9. The translation memory of claim 1 wherein each of the listings of letter n-grams include only letter n-grams having an entropy value within a predetermined range.

10. A method for translating a source language to a target language, the method comprising the steps of:
- providing a weighted letter n-gram file for the source language;
  
  providing an aligned file comprising a plurality of source language text strings, each source language text string paired with a target language document;
  
  creating an inverted index from the aligned file, the inverted index comprising a listing of unique letter n-grams appearing in the aligned file wherein each listing of a unique letter n-gram in the inverted index is associated with a set of document identifications that point to source language text strings in the aligned file that contain the associated letter n-gram; and
  
  retrieving the target language corresponding to the pointed source language.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10 wherein the step of providing a weighted letter n-gram file for the source language comprises the steps of:
    - providing a quantity of source language text;
      
      tokenizing the quantity of source language text to identify a set of unique letter n-grams in the source language;
      
      calculating an entropy weight for each of the identified unique letter n-grams in the source language;
      
      filtering the set of unique letter n-grams to remove letter n-grams having an entropy weight below a preselected threshold weight;
      
      filtering the set of unique letter n-grams to remove letter n-grams having a frequency of occurrence below a preselected threshold frequency; and
      
      saving the filtered set of unique letter n-grams with the associated entropy weight for each letter n-gram to the weighted letter n-gram file for the source language.
  - 12. The method of claim 11 further comprising the step of converting each unique letter n-gram in the tokenized source language text to Unicode before the step of calculating.
  - 13. The method of claim 10 wherein the step of creating an inverted index comprises:
    - selecting each source language document from the aligned file in turn;
      
      tokenizing the selected source language document to determine each letter n-gram contained therein;
      
      filtering the tokenized document to remove letter n-grams not appearing in the weighted n-gram file for the source language; and
      
      pairing each letter n-gram remaining in the selected document after the filtering with its associated entropy weight from the weighted letter-n-gram file for the source language.
  - 14. The method of claim 11 further comprising the step of converting each n-gram in the tokenized document to Unicode before filtering.
  - 15. The method of claim 11 wherein the step of creating an inverted index from the document vectors comprises applying a FAST-INV algorithm to the document vectors for each source language document.

16. A method for analyzing language comprising the steps of:
- providing a plurality of text strings to be analyzed;
  
  identifying the text codeset applying to the plurality of text strings;
  
  selecting each of the text strings in turn;
  
  tokenizing the selected text string to determine a set of letter n-grams appearing in the selected text string;
  
  converting each letter n-grams to codeset values;
  
  defining a set of unique letter n-grams by adding each letter n-gram determined in the tokenizing step to the set of unique letter n-grams;
  
  computing a frequency of occurrence of each letter n-gram in the selected text sequence;
  
  calculating an entropy weighting for each letter n-gram in the set of unique letter n-grams;
  
  filtering the set of unique letter n-grams to remove letter n-grams having an entropy weight below a preselected threshold;
  
  pairing each letter n-gram remaining in the set of unique letter n-grams with the calculated entropy weight; and
  
  saving the letter n-gram;
  
  entropy weight pairs to a weighted n-gram file.

17. A method for creating an inverted index to a file containing a plurality of text strings comprising the steps of:
- creating a list of letter n-grams wherein each listed letter n-gram has an entropy weight in a preselected threshold range;
  
  for each listed letter n-gram, determining a set of the plurality of text strings in which the letter n-gram appears.

18. A method for retrieving a target subset of text strings from a plurality of target text strings comprising the steps of:
- inputting a source text query;
  
  tokenizing the source text query to determine a set of letter n-grams appearing in the text query;
  
  pairing each letter n-gram in the tokenized text query with an entropy weight;
  
  filtering the paired tokenized text query to remove letter n-grams having an entropy weight below a preselected threshold; and
  
  using an inverted index to the plurality of target text strings to determine the target subset of the plurality of text strings in which any of the paired letter n-grams appear.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lionbridge US, Inc. (Lionbridge Technologies Incorporated)
Original Assignee
Lionbridge US, Inc. (Lionbridge Technologies Incorporated)
Inventors
Savourel, Yves I., Hargrave, James E. III
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/980,630
Time in Patent Office

1,044 Days
Field of Search

704/2, 704/7, 704/3, 704/4, 704/1, 704/9, 704/10, 707/109, 707/530, 707/531, 707/532, 707/533, 709/257, 709/270, 709/277, 382/228, 382/229, 382/230, 382/231
US Class Current

704/7
CPC Class Codes

G06F 40/45   Example-based machine trans...

G06F 40/47   Machine-assisted translatio...

Y10S 707/99935   Query augmenting and refini...

Machine assisted translation tools utilizing an inverted index and list of letter n-grams

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

270 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Machine assisted translation tools utilizing an inverted index and list of letter n-grams

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

270 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links