Computerized cross-language document retrieval using latent semantic indexing
First Claim
1. A multi-language information retrieval method for operating a computer system, including an information file of stored data objects, to retrieve selected data objects based on a user query, the method comprising the steps ofselecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria,translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said set of training data objects, and storing said translations corresponding to each of said multi-language training data objects in the information file,for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects,parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms,generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the ith distinct term in the jth merged data object,decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space,generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said semantic space,examining the similarity between said pseudo-object and the stored data objects in said semantic space to generate the selected data objects corresponding to said pseudo-object, andgenerating a report of the selected data objects.
4 Assignments
0 Petitions
Accused Products
Abstract
A methodology for retrieving textual data objects in a multiplicity of languages is disclosed. The data objects are treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in each language under consideration. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.
674 Citations
8 Claims
-
1. A multi-language information retrieval method for operating a computer system, including an information file of stored data objects, to retrieve selected data objects based on a user query, the method comprising the steps of
selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria, translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said set of training data objects, and storing said translations corresponding to each of said multi-language training data objects in the information file, for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects, parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms, generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the ith distinct term in the jth merged data object, decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space, generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said semantic space, examining the similarity between said pseudo-object and the stored data objects in said semantic space to generate the selected data objects corresponding to said pseudo-object, and generating a report of the selected data objects.
-
7. A method for retrieving information from a multi-language information file stored in a computer system based on a user query, the file including stored data objects, the method comprising the steps of
selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria, translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said set of training data objects, and storing said translations corresponding to each of said multi-language training data objects in the information file, for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects, parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms, generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the ith distinct term in the jth merged data object, decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space, folding into said semantic space other data objects excluded from said set of training data objects by parsing each of said other data objects to obtain data object query terms and applying a mathematical transformation to said data object query terms and said distinct terms to create an augmented semantic space to serve as said semantic space, generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said augmented semantic space, examining the similarity between said pseudo-object and the stored data objects in said augmented semantic space to generate the selected data objects corresponding to said pseudo-object, and generating a report of the selected data objects.
-
8. A multi-language information retrieval method for operating a computer system, including an information file of stored data objects, to retrieve selected data objects based on a user query, the method comprising the steps of
selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria, translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said translations corresponding to each of said training data objects, and storing said set of multi-language training data objects in the information file, for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects, parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms, generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the ith distinct term in the jth merged data object, decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space, generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said semantic space, examining the similarity between said pseudo-object and the stored data objects in said semantic space to generate the selected data objects corresponding to said pseudo-object, processing the selected data objects to produce a coded representation of the selected data objects and storing said coded representation in the computer system in a form accessible by the user for later recall so that the user query requires no repetition, and generating a report of the selected data objects.
Specification