Extended functionality for an inverse inference engine based web search
First Claim
1. A multi-language information retrieval method for retrieving information from a plurality of target documents using at least one reference document, the target documents and at least one reference document stored as electronic information files in a computer system, comprising:
- generating a term-document matrix to represent the electronic information files,each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files,the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages,the term-document matrix including a second partition of entries that represent the target documents,the target documents comprising content in the first natural language or the second natural language;
generating a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and an extent to which terms are correlated;
receiving a query consisting of at least one term;
in response to receiving the query, generating a query vector having as many elements as rows of the generated term-spread matrix;
formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and
providing a response to the received query that reflects the document weights.
5 Assignments
0 Petitions
Accused Products
Abstract
An extension of an inverse inference search engine is disclosed which provides cross language document retrieval, in which the information matrix used as input to the inverse inference engine is organized into rows of blocks corresponding to languages within a predetermined set of natural languages. The information matrix is further organized into two column-wise partitions. The first partition consists of blocks of entries representing fully translated documents, while the second partition is a matrix of blocks of entries representing documents for which translations are not available in all of the predetermined languages. Further in the second partition, entries in blocks outside the main diagonal of blocks are zero. Another disclosed extension to the inverse inference retrieval document retrieval system supports automatic, knowledge based training. This approach applies the idea of using a training set to the problem of searching databases where information that is diluted or not reliable enough to allow the creation of robust semantic links. To address this situation, the disclosed system loads the left-hand partition of the input matrix for the inverse inference engine with information from reliable sources.
-
Citations
36 Claims
-
1. A multi-language information retrieval method for retrieving information from a plurality of target documents using at least one reference document, the target documents and at least one reference document stored as electronic information files in a computer system, comprising:
-
generating a term-document matrix to represent the electronic information files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language; generating a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and an extent to which terms are correlated; receiving a query consisting of at least one term; in response to receiving the query, generating a query vector having as many elements as rows of the generated term-spread matrix; formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description; determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and providing a response to the received query that reflects the document weights. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable memory medium containing instructions that control a computer processor to retrieve information from a plurality of target documents using at least one reference document, the target documents and at least one reference document stored as electronic information files in a computer system, by:
-
generating a term-document matrix to represent the electronic information files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language; generating a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and an extent to which terms are correlated; receiving a query consisting of at least one term; in response to receiving the query, generating a query vector having as many elements as rows of the generated term-spread matrix; formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description; determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and providing a response to the received query that reflects the document weights. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. An information retrieval system having a plurality of target documents and at least one reference document stored as electronic information files, comprising:
-
a memory; an information file processing component stored on the memory that is configured to, when executed generate a term-document matrix to represent the electronic information flies, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language; and generate a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and an extent to which terms are correlated; a query mechanism stored on the memory that is configured to, when executed, receive a query of at least one term and to generate a query vector having as many elements as the rows of the generated term-spread matrix; and an inverse inference engine stored on the memory that is configured to, when executed formulate, based upon the generated term-spread matrix and the query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description; determine a solution vector to the constrained optimization problem description, the solution vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and provide a response to the received query that reflects the document weights. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
Specification