Extended functionality for an inverse inference engine based web search

US 20050021517A1
Filed: 05/26/2004
Published: 01/27/2005
Est. Priority Date: 03/22/2000
Status: Active Grant

First Claim

Patent Images

1-12. -12. (Canceled)

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An extension of an inverse inference search engine is disclosed which provides cross language document retrieval, in which the information matrix used as input to the inverse inference engine is organized into rows of blocks corresponding to languages within a predetermined set of natural languages. The information matrix is further organized into two column-wise partitions. The first partition consists of blocks of entries representing fully translated documents, while the second partition is a matrix of blocks of entries representing documents for which translations are not available in all of the predetermined languages. Further in the second partition, entries in blocks outside the main diagonal of blocks are zero. Another disclosed extension to the inverse inference retrieval document retrieval system supports automatic, knowledge based training. This approach applies the idea of using a training set to the problem of searching databases where information that is diluted or not reliable enough to allow the creation of robust semantic links. To address this situation, the disclosed system loads the left-hand partition of the input matrix for the inverse inference engine with information from reliable sources.

Citations

60 Claims

1-12. -12. (Canceled)

13. A multi-language information retrieval method for retrieving information from a plurality of target documents using at least one reference document, the target documents and at least one reference document stored as electronic information files in a computer system, comprising:
- generating a term-document matrix to represent the electronic information files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second Yersion of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language;
  
  generating a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and the extent to which terms are correlated;
  
  receiving a query consisting of at least one term;
  
  in response to receiving the query, generating a query vector having as many elements as rows of the generated term-spread matrix;
  
  formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and
  
  providing a response to the received query that reflects the document weights.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The method of claim 13 wherein at least one of the document weights in the determined solution vector is positive and at least one of the document weights in the determined solution vector is negative, wherein the positive document weights represent the relevance of the corresponding target documents in the first natural language to the query, and wherein absolute values of the negative document weights represent the relevance of the corresponding target documents in the second natural language to the query.
  - 15. The method of claim 13, the providing the response further comprising:
    - organizing, according to the sign of each document weight, display objects that represent the target documents that correspond to the document weights, thereby displaying the objects that represent documents comprising content in the first natural language in proximity to each other and displaying the objects that represent documents comprising content in the second natural language in proximity to each other.
  - 16. The method of claim 15, the providing the response further comprising:
    - organizing the display objects according to the absolute value of each document weight, such that the display objects are displayed in decreasing absolute value of the corresponding document weights.
  - 17. The method of claim 13 wherein each row of the term-document matrix is associated with a respective term, and wherein a first set of the rows are associated with terms in the first natural language and a second set of the rows are associated with terms in the second natural language.
  - 18. The method of claim 13 wherein the second version of the reference document comprises terms that are a translation into the second natural language of terms of the first version of the reference document.
  - 19. The method of claim 13 wherein the second version of the reference document is topically related to the first version of the reference document.
  - 20. The method of claim 19 wherein the second version of the reference document is a translation into the second natural language of the first version of the reference document comprising content in the first natural language.
  - 21. The method of claim 13 wherein the first version and the second version of the reference document are used to find semantic links from terms in the first natural language to terms in the second natural language.
  - 22. The method of claim 13, wherein the term-document matrix is one of a plurality of term-document matrices, each term-document matrix having a first partition similar to the first partition of the term-document matrix and having entries that represent content in a first natural language and content in a second natural language, each term-document matrix associated with a translation from a source language to a different target foreign language, wherein, in each term-document matrix, the first natural language comprises the source language and the second natural language comprises the target foreign natural language.
  - 23. The method of claim 13, the first partition further comprising entries that represent a third version of the at least one reference document comprising content in a third natural language, such that the first, second, and third versions of the at least one reference document can be used to semantically line documents between the first, second, and third natural languages.
  - 24. The method of claim 23 wherein the first and second versions of the at least one reference document are used to translate terms between the first and second natural language and the first and third versions of the at least one reference document are used to translate terms between the first and third natural language.

25. A computer-readable memory medium containing instructions that control a computer processor to retrieve information from a plurality of target documents using at least one reference document, the target documents and at least one reference document stored as electronic information files in a computer system, by:
- generating a term-document matrix to represent the electronic information files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language;
  
  generating a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and the extent to which terms are correlated;
  
  receiving a query consisting of at least one term;
  
  in response to receiving the query, generating a query vector having as many elements as rows of the generated term-spread matrix;
  
  formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and
  
  providing a response to the received query that reflects the document weights.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 26. The memory medium of claim 25 wherein at least one of the document weights in the determined solution vector is positive and at least one of the document weights in the determined solution vector is negative, wherein the positive document weights represent the relevance of the corresponding target documents in the first natural language to the query, and wherein absolute values of the negative document weights represent the relevance of the corresponding target documents in the second natural language to the query.
  - 27. The memory medium of claim 25, the response further comprising:
    - organizing, according to the sign of each document weight, display objects that represent the target documents that correspond to the document weights, thereby displaying the objects that represent documents comprising content in the first natural language in proximity to each other and displaying the objects that represent documents comprising content in the second natural language in proximity to each other.
  - 28. The memory medium of claim 27, the response further comprising:
    - organizing the display objects according to the absolute value of each document weight, such that the display objects are displayed in decreasing absolute value of the corresponding document weights.
  - 29. The memory medium of claim 25 wherein each row of the term-document matrix is associated with a respective term, and wherein a first set of the rows are associated with terms in the first natural language and a second set of the rows are associated with terms in the second natural language.
  - 30. The memory medium of claim 25 wherein the second version of the reference document comprises terms that are a translation into the second natural language of terms of the first version of the reference document.
  - 31. The memory medium of claim 25 wherein the second version of the reference document is topically related to the first version of the reference document.
  - 32. The memory medium of claim 31 wherein the second version of the reference document is a translation into the second natural language of the first version of the reference document comprising content in the first natural language.
  - 33. The memory medium of claim 25 wherein the first version and the second version of the reference document are used to find semantic links from terms in the first natural language to terms in the second natural language.
  - 34. The memory medium of claim 25 wherein the term-document matrix is one of a plurality of term-document matrices, each term-document matrix having a first partition similar to the first partition of the term-document matrix and having entries that represent content in a first natural language and content in a second natural language, each term-document matrix associated with a translation from a source language to a different target foreign language, wherein, in each term-document matrix, the first natural language comprises the source language and the second natural language comprises the target foreign natural language.
  - 35. The memory medium of claim 25, the first partition further comprising entries that represent a third version of the at least one reference document comprising content in a third natural language, such that the first, second, and third versions of the at least one reference document can be used to semantically line documents between the first, second, and third natural languages.
  - 36. The memory medium of claim 35 wherein the first and second versions of the at least one reference document are used to translate terms between the first and second natural language and the first and third versions of the at least one reference document are used to translate terms between the first and third natural language.

37. An information retrieval system having a plurality of target documents and at least one reference document stored as electronic information files, comprising:
- an information file processing component that is structured to generate a term-document matrix to represent the electronic information files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the electronic information files, the term-document matrix including a first partition of entries that represent a first version of the at least one reference document comprising content in a first natural language and a second version of the at least one reference document comprising content in a second natural language such that the first and second versions of the reference document can be used to semantically link documents between the first and second natural languages, the term-document matrix including a second partition of entries that represent the target documents, the target documents comprising content in the first natural language or the second natural language; and
  
  generate a term-spread matrix that is a weighted autocorrelation of the generated term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and the extent to which terms are correlated;
  
  a query mechanism that is structured to receive a query of at least one term and to generate a query vector having as many elements as the rows of the generated term-spread matrix; and
  
  an inverse inference engine that is structured to formulate, based upon the generated term-spread matrix and the query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the target documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determine a solution vector to the constrained optimization problem description, the solution vector including a plurality of document weights, each weight corresponding to one of the target documents and reflecting a degree of correlation between the query and the corresponding target document; and
  
  provide a response to the received query that reflects the document weights.
- View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 38. The information retrieval system of claim 37 wherein at least one of the document weights in the determined solution vector is positive and at least one of the document weights in the determined solution vector is negative, wherein the positive document weights represent the relevance of the corresponding target documents in the first natural language to the query, and wherein absolute values of the negative document weights represent the relevance of the corresponding target documents in the second natural language to the query.
  - 39. The information retrieval system of claim 37, the response further comprising:
    - display objects that each represent a target documents that correspond to one of the document weights and are organized according to the sign of each document weight, thereby causing the objects that represent documents comprising content in the first natural language to be displayed in proximity to each other and the objects that represent documents comprising content in the second natural language to be displayed in proximity to each other.
  - 40. The information retrieval system of claim 38, the objects further structured to be organized according to the absolute value of each document weight, thereby causing the objects to be displayed in decreasing absolute value of the corresponding document weights.
  - 41. The information retrieval system of claim 37 wherein each row of the term-document matrix is associated with a respective term, and wherein a first set of the rows are associated with terms in the first natural language and a second set of the rows are associated with terms in the second natural language.
  - 42. The information retrieval system of claim 37 wherein the second version of the reference document comprises terms that are a translation into the second natural language of terms of the first version of the reference document.
  - 43. The information retrieval system of claim 37 wherein the second version of the reference document is topically related to the first version of the reference document.
  - 44. The information retrieval system of claim 43 wherein the second version of the reference document is a translation into the second natural language of the first version of the reference document comprising content in the first natural language.
  - 45. The information retrieval system of claim 37 wherein the first version and the second version of the reference document are used to find semantic links from terms in the first natural language to terms in the second natural language.
  - 46. The information retrieval system of claim 37 wherein the term-document matrix is one of a plurality of term-document matrices, each term-document matrix having a first partition similar to the first partition of the term-document matrix and having entries that represent content in a first natural language and content in a second natural language, each term-document matrix associated with a translation from a source language to a different target foreign language, wherein, in each term-document matrix, the first natural language comprises the source language and the second natural language comprises the target foreign natural language.
  - 47. The information retrieval system of claim 37, the first partition further comprising entries that represent a third version of the at least one reference document comprising content in a third natural language, such that the first, second, and third versions of the at least one reference document can be used to semantically line documents between the first, second, and third natural languages.
  - 48. The information retrieval system of claim 47 wherein the first and second versions of the at least one reference document are used to translate terms between the first and second natural language and the first and third versions of the at least one reference document are used to translate terms between the first and third natural language.

49. A computer-implemented method for retrieving information from a plurality of search documents using at least one reference document, the search documents and the at least one reference document stored as electronic information files in a computer system, comprising:
- generating a term-document matrix to represent the files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the files, the term-document matrix including a first partition of entries that represent the at least one reference document, wherein the reference document is predetermined to contain reliable information, the term-document matrix further including a second partition of entries that represent the plurality of search documents, wherein the search documents contain potentially insufficient information for establishing semantic links between terms of the search documents;
  
  generating a term-spread matrix that is a weighted autocorrelation of the term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the files and the extent to which terms are correlated;
  
  receiving a query of at least one term;
  
  in response to receiving the query, generating a query vector having as many elements as the rows of the generated term-spread matrix;
  
  formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the search documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the search documents and reflecting a degree of correlation between the query and the corresponding search document; and
  
  providing a response to the received query that reflects the document weights.
- View Dependent Claims (50, 51, 52)
- - 50. The method of claim 49, further comprising periodically accumulating information from multiple sources and adding the information to the second partition of search documents.
  - 51. The method of claim 49 wherein the at least one, reference document comprises an encyclopedia.
  - 52. The method of claim 49 wherein the at least one reference document comprises a collection of news reports.

53. A computer-readable memory medium containing instructions for controlling a computer processor to retrieve information from a plurality of search documents using at least one reference document, the search documents and the at least one reference document stored as electronic information files in a computer system, by:
- generating a term-document matrix to represent the files, each element in the term-document matrix indicating a measure of a number of occurrences of a term within a respective one of the files, the term-document matrix including a first partition of entries that represent the at least one reference document, wherein the reference document is predetermined to contain reliable information, the term-document matrix further including a second partition of entries that represent the plurality of search documents, wherein the search documents contain potentially insufficient information for establishing semantic links between terms of the search documents;
  
  generating a term-spread matrix that is a weighted autocorrelation of the term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the files and the extent to which terms are correlated;
  
  receiving a query of at least one term;
  
  in response to receiving the query, generating a query vector having as many elements as the rows of the generated term-spread matrix;
  
  formulating, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the search documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determining a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the search documents and reflecting a degree of correlation between the query and the corresponding search document; and
  
  providing a response to the received query that reflects the document weights.
- View Dependent Claims (54, 55, 56)
- - 54. The memory medium of claim 53, further comprising instructions that control the computer processor by periodically accumulating information from multiple sources and adding the information to the second partition of search documents.
  - 55. The memory medium of claim 53 wherein the at least one reference document comprises an encyclopedia.
  - 56. The memory medium of claim 53 wherein the at least one reference document comprises a collection of news reports.

57. An information retrieval system having a plurality of search documents and at least one reference document stored as electronic information files, comprising:
- an information file processing component that is structured to generate a term-document matrix to represent the files, each element in the term-document matrix indicating a number of occurrences of a term within a respective one of the files, the term-document matrix including a first partition of entries that represent the at least one reference document, wherein the reference document is predetermined to contain reliable information, the term-document matrix further including a second partition of entries that represent the plurality of search documents, wherein the search documents contain potentially insufficient information for establishing semantic links between terms of the search documents; and
  
  generate a term-spread matrix that is a weighted autocorrelation of the term-document matrix, the term-spread matrix indicating an amount of variation in term usage in the information files and the extent to which terms are correlated;
  
  a query mechanism that is structured to receive a query of at least one term and to generate a query vector having as many elements as the rows of the generated term-spread matrix; and
  
  an inverse inference engine that is structured to formulate, based upon the generated term-spread matrix and query vector, a constrained optimization problem description for determining a degree of correlation between the query vector and the search documents, wherein the choice of a stabilization parameter determines the extent of a trade-off between a degree of fit and stability of all solutions to the constrained optimization problem description;
  
  determine a solution vector to the constrained optimization problem description, the vector including a plurality of document weights, each weight corresponding to one of the search documents and reflecting a degree of correlation between the query and the corresponding search document; and
  
  provide a response to the received query that reflects the document weights.
- View Dependent Claims (58, 59, 60)
- - 58. The information retrieval system of claim 57 wherein the information processing component is further structured to periodically accumulate information from multiple sources and add the information to the second partition of search documents.
  - 59. The information retrieval system of claim 57 wherein the at least one reference document comprises an encyclopedia.
  - 60. The information retrieval system of claim 57 wherein the at least one reference document comprises a collection of news reports.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fiver LLC
Original Assignee
Insightful Corporation (Cloud Software Group)
Inventors
Marchisio, Giovanni B.

Granted Patent

US 7,269,598 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 16/334   Query execution G06F16/335 ...

G06F 16/954   Navigation, e.g. using cate...

G06F 40/169   Annotation, e.g. comment da...

G06F 40/216   using statistical methods

G06F 40/268   Morphological analysis

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

G06F 40/58   Use of machine translation,...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99939   Privileged access

Y10S 707/99943   Generating database or data...

Extended functionality for an inverse inference engine based web search

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

60 Claims

Specification

Solutions

Use Cases

Quick Links

Extended functionality for an inverse inference engine based web search

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

60 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links