SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR INFORMATION SORTING AND RETRIEVAL USING A LANGUAGE-MODELING KERNAL FUNCTION

US 20110270829A1
Filed: 06/22/2011
Published: 11/03/2011
Est. Priority Date: 12/20/2005
Status: Active Grant

First Claim

Patent Images

1. A system for sorting a plurality of documents based at least in part on a relationship between each of the plurality of documents and a user query, relevance feedback, and relations among plurality of documents, the system comprising:

a data source comprising the plurality of documents; and

a host computing element in communication with said data source and configured to receive an initial user input comprising the user query;

wherein said host computing element is further configured to convert each of the plurality of documents into a corresponding document language model, each document language model being associated with a distribution of a plurality document terms present in the plurality of documents and a distribution of a plurality document terms present in each of the plurality of documents;

wherein said host computing element is further configured to convert the user query into a corresponding query language model, the query language model being associated with a distribution of a plurality of query terms present in the user query and the distribution of the plurality document terms present in the plurality of documents;

wherein said host computing element is further configured to define a kernel function configured to evaluate a similarity relationship between two document language models under the influence of the query language model;

wherein said host computing element is further configured to automatically obtain via the defined kernel function a first vector space having a plurality of dimensions associated with at least two of the distribution of the plurality document terms present in the plurality of documents, the distribution of the plurality document terms present in each of the plurality of documents, and the distribution of the plurality of query terms present in the user query;

wherein said host computing element is further configured to map via the defined kernel function each of the plurality of the document language models and the query language model in the first vector space; and

wherein said host computing element is further configured to rank each of the plurality of documents based at least in part on a similarity relationship between each of the document language models and the query language model in the first vector space to determine a relative relevance of each of the plurality of documents to the user query.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Various embodiments provide a system, method, and computer program product for sorting and/or selectively retrieving a plurality of documents in response to a user query. More particularly, embodiments are provided that convert each document into a corresponding document language model and convert the user query into a corresponding query language model. The language models are used to define a vector space having dimensions corresponding to terms in the documents and in the user query. The language models are mapped in the vector space. Each of the documents is then ranked, wherein the ranking is based at least in part on a position of the mapped language models in the vector space, so as to determine a relative relevance of each of the plurality of documents to the user query.

Citations

40 Claims

1. A system for sorting a plurality of documents based at least in part on a relationship between each of the plurality of documents and a user query, relevance feedback, and relations among plurality of documents, the system comprising:
- a data source comprising the plurality of documents; and
  
  a host computing element in communication with said data source and configured to receive an initial user input comprising the user query;
  
  wherein said host computing element is further configured to convert each of the plurality of documents into a corresponding document language model, each document language model being associated with a distribution of a plurality document terms present in the plurality of documents and a distribution of a plurality document terms present in each of the plurality of documents;
  
  wherein said host computing element is further configured to convert the user query into a corresponding query language model, the query language model being associated with a distribution of a plurality of query terms present in the user query and the distribution of the plurality document terms present in the plurality of documents;
  
  wherein said host computing element is further configured to define a kernel function configured to evaluate a similarity relationship between two document language models under the influence of the query language model;
  
  wherein said host computing element is further configured to automatically obtain via the defined kernel function a first vector space having a plurality of dimensions associated with at least two of the distribution of the plurality document terms present in the plurality of documents, the distribution of the plurality document terms present in each of the plurality of documents, and the distribution of the plurality of query terms present in the user query;
  
  wherein said host computing element is further configured to map via the defined kernel function each of the plurality of the document language models and the query language model in the first vector space; and
  
  wherein said host computing element is further configured to rank each of the plurality of documents based at least in part on a similarity relationship between each of the document language models and the query language model in the first vector space to determine a relative relevance of each of the plurality of documents to the user query.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A system according to claim 1, further comprising a user interface in communication with said host computing element and configured to receive the initial user input, the user interface being further configured to display the ranked plurality of documents.
  - 3. A system according to claim 1, wherein the plurality of documents comprises relevant documents and non-relevant documents relative to the user query, and wherein said host computing element further receives a positive feedback input comprising a selection of at least one of the relevant documents;
    - wherein said host computing element is further configured to receive a negative feedback input comprising a selection of at least one of the non-relevant documents;
      
      wherein said host computing element is further configured to refine the query language model based on the initial user input and at least one of the positive feedback input and the negative feedback input;
      
      wherein said host computing element is further configured to re-compute the language-modeling kernel as an integration of the query language model and the document language models based at least in part upon replacing the query language model component of the language modeling kernel with the refined query language model;
      
      wherein said host computing element is further configured to generate a decision boundary in a new vector space determined by the re-computed language-modeling kernel between the document language models corresponding to the selected relevant documents and the document language models corresponding to the selected non-relevant documents such that the decision boundary is substantially equidistant from the document language models corresponding to the relevant documents and the document language models corresponding to the non-relevant documents; and
      
      wherein said host computing element is further configured to re-rank each of the plurality of documents based at least in part according to the generated boundary in the new vector space.
  - 4. A system according to claim 3, wherein said host computing element is further configured to receive a positive feedback input by estimating a positive feedback input comprising a selection of at least one of the relevant documents.
  - 5. A system according to claim 3, wherein said host computing element is further configured to receive a negative feedback input by estimating a negative feedback input comprising a selection of at least one of the relevant documents.
  - 6. A system according to claim 3, further comprising a user interface in communication with said host computing element and configured to receive the positive feedback input and the negative feedback input.
  - 7. A system according to claim 3, further comprising a user interface in communication with said host computing element and configured to estimate the positive feedback input from at least one of user browsing activities detected via the user interface, user reading activities detected via the user interface, and user printing activities detected via the user interface.
  - 8. A system according to claim 3, wherein said host computing element is further configured to refine the query language model by analyzing a distribution of the plurality of document terms present in the selection of relevant documents in the positive feedback input and a distribution of the plurality of query terms present in the selection of relevant documents in the positive feedback input.
  - 9. A system according to claim 3, wherein said host computing element is further configured to re-compute the language-modeling kernel by replacing the query language model with the refined query language model.
  - 10. A system according to claim 3, wherein said host computing element is further configured to determine the new vector space using the re-computed language-modeling kernel to automatically determine the dimensions of the new vector space based in part upon at least one of a plurality of document statistics, document collection statistics, and relevance statistics.
  - 11. A system according to claim 3, wherein said host computing element I is further configured to generate the decision boundary in the new vector space determined by the re-computed language-modeling kernel by applying a kernel based learning algorithm to the received positive feedback input and the received negative feedback input.
  - 12. A system according to claim 3, wherein the re-computed language modeling kernel integrates a query probability distribution expressed by the query language model corresponding to the user query and the positive feedback input with a similarity measure corresponding to a document probability distribution across the plurality of documents, the language modeling kernel providing a similarity measure between each of the plurality of documents biased at least in part by a user information need, the language modeling kernel being configured for modeling at least one of a plurality document statistics, a plurality of collection statistics, and a plurality of relevance statistics.
  - 13. A system according to claim 3, wherein said host computing element is further configured to re-rank each of the plurality of documents based at least in part on the computed language modeling kernel.
  - 14. A system according to claim 11, wherein the kernel based learning algorithm applied by said host computing element comprises a support vector machine.
  - 15. A system according to claim 1, wherein said host computing element is further configured to convert each of the plurality of documents into a corresponding document language model by analyzing the distribution of the plurality document terms present in the plurality of documents to determine a statistical measure of at least one of a prevalence of at least one of the plurality of document terms present in each of the plurality of documents and a prevalence of at least one of the plurality of document terms present in the plurality of documents.
  - 16. A system according to claim 1, wherein said host computing element is further configured to convert the user query into a corresponding query language model by analyzing the distribution of the plurality of query terms present in the user query relative to the distribution of the plurality of document terms present in the plurality of documents.
  - 17. A system according to claim 1, wherein said host computing element comprises a memory device configured for storing a plurality of pre-computed document language models and at least a portion of the plurality of documents.
  - 18. A system according to claim 3, wherein the new vector space comprises a high dimensional vector space, which is systematically and dynamically determined by the re-computed language-modeling kernel using a language modeling technique selected from the group consisting of:
    - term frequency determinations,term-term co-occurrence relationship determinations,term distribution determinations in the positive feedback input,term distribution determinations in a pre-defined user profile,term distribution determinations in a dynamically generated user profile, and combinations thereof.

19. A method for sorting a plurality of documents based at least in part on a relationship between each of the plurality of documents and a user query, relevance feedback, and relationships among the plurality of document, the method comprising:
- converting each of the plurality of documents into a corresponding document language model, each document language model being associated with a distribution of a plurality document terms present in the plurality of documents and a plurality document terms present in each of the plurality of documents;
  
  converting the user query into a corresponding query language model, the query language model being associated with a distribution of a plurality of query terms present in the user query and the distribution of the plurality of document terms present in the plurality of documents;
  
  defining a kernel function configured to evaluate a similarity relationship between two document language models under the influence of the query language model;
  
  obtaining automatically via the defined kernel function a first vector space having a plurality of dimensions associated with at least two of the distribution of the plurality document terms present in the plurality of documents, the distribution of the plurality document terms present in each of the plurality of documents, and the distribution of the plurality of query terms present in the user query;
  
  mapping via the defined kernel function each of the document language models and the query language model in the first vector space; and
  
  ranking each of the plurality of documents based at least in part on a similarity relationship between each of the document language models and the query language model in the first vector space to determine a relative relevance of each of the plurality of documents to the user query.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 20. A method according to claim 19, wherein the plurality of documents comprises relevant documents and non-relevant documents relative to the user query, the method further comprising:
    - receiving a positive feedback input comprising a selection of at least one of the relevant documents;
      
      receiving a negative feedback input comprising a selection of at least one of the non-relevant documents;
      
      refining the query language model based at least in part on the initial query and at least one of the positive feedback input and the negative feedback input;
      
      re-computing a language-modeling kernel as an integration of the query language model and the document language models based at least in part upon replacing the query language model component of the language-modeling kernel with the refined query language model;
      
      generating a decision boundary in a new vector space determined at least in part by the re-computed language-modeling kernel between the document language models corresponding to the selected relevant documents and the document language models corresponding to the selected non-relevant documents such that the decision boundary is substantially equidistant from the document language models corresponding to the relevant documents and the document language models corresponding to the non-relevant documents; and
      
      re-ranking each of the plurality of documents based at least in part according to the generated boundary in the new vector space.
  - 21. A method according to claim 20, wherein receiving a positive feedback input comprises estimating a positive feedback input comprising a selection of at least one of the relevant documents.
  - 22. A method according to claim 20, wherein receiving a negative feedback input comprises estimating a negative feedback input comprising a selection of at least one of the non-relevant documents.
  - 23. A method according to claim 20, wherein refining the query language model comprises analyzing a distribution of the plurality of document terms present in the selection of relevant documents in the positive feedback input and a distribution of the plurality of query terms present in the selection of relevant documents in the positive feedback input.
  - 24. A method according to claim 20, wherein re-computing the language-modeling kernel comprises replacing the query language model with the refined query language model.
  - 25. A method according to claim 20, wherein determining the new vector space comprises using the re-computed language-modeling kernel to automatically determine the dimensions of the new vector space based in part upon at least one of a plurality of document statistics, document collection statistics, and relevance statistics.
  - 26. A method according to claim 20, wherein generating the decision boundary in the new vector space comprises applying a kernel based learning algorithm to the received positive feedback input and the received negative input.
  - 27. A method according to claim 26, wherein the kernel based learning algorithm comprises a support vector machine.
  - 28. A method according to claim 19, wherein converting each of the plurality of documents into a corresponding document language model further comprises analyzing the distribution of the plurality document terms present in the plurality of documents to determine a statistical measure of at least one of a prevalence of at least one of the plurality of document terms present in each of the plurality of documents and a prevalence of at least one of the plurality of document terms present in the plurality of documents.
  - 29. A method according to claim 19, wherein converting the user query into a corresponding query language model further comprises analyzing the distribution of the plurality of query terms present in the user query relative to the distribution of the plurality of document terms present in the plurality of documents to determine a statistical measure of the relative relevance of each of the plurality of documents to the user query.

30. A computer program product for sorting a plurality of documents based at least in part on a relationship between each of the plurality of documents and a user query, relevance feedback, interest, and relations among plurality of documents, the computer program product comprising a computer-readable storage medium having computer-readable program code instructions stored therein comprising:
- a first set of computer instructions for converting each of the plurality of documents into a corresponding document language model, each document language model being associated with a distribution of a plurality document terms present in the plurality of documents and a plurality document terms present in each of the plurality of documents;
  
  a second set of computer instructions for converting the user query into a corresponding query language model, the query language model being associated with a distribution of a plurality of query terms present in the user query and the distribution of the plurality of document terms present in the plurality of documents;
  
  a third set of computer instructions for defining a kernel function configured to evaluate a similarity relationship between two document language models under the influence of the query language model;
  
  a fourth set of computer instructions for automatically obtaining via the defined kernel function of the third set of computer instructions a first vector space having a plurality of dimensions associated with at least two of the distribution of the plurality of document terms present in the plurality of documents, the distribution of the plurality of document terms present in each of the plurality of documents, and the distribution of the plurality of query terms present in the user query;
  
  a fifth set of computer instructions for mapping via the defined kernel function each of the document language models and the query language model in the first vector space; and
  
  a sixth set of computer instructions for ranking each of the plurality of documents based at least in part on a similarity relationship between each of the document language models and the query language model in the first vector space to determine a relative relevance of each of the plurality of documents to the user query.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38)
- - 31. A computer program product according to claim 30, wherein the plurality of documents comprises relevant documents and non-relevant documents relative to the user query, the computer program product further comprising:
    - a seventh set of computer instructions for receiving a positive feedback input comprising a selection of at least one of the relevant documents;
      
      a eighth set of computer instructions for receiving a negative feedback input comprising a selection of at least one of the non-relevant documents;
      
      an ninth set of computer instructions for refining the query language model based on the initial user input and at least one of the positive feedback input and the negative feedback input; and
      
      a tenth set of computer instructions for re-computing a language-modeling kernel as an integration of the query language model and the document language models based at least in part upon replacing the query language model component of the language-modeling kernel with the refined query language model;
      
      a eleventh set of computer instructions for generating a decision boundary in a new vector space automatically determined at least in part by the re-computed language-modeling kernel between the document language models corresponding to the selected relevant documents and the document language models corresponding to the selected non-relevant documents such that the decision boundary is substantially equidistant from the document language models corresponding to the relevant documents and the document language models corresponding to the non-relevant documents; and
      
      a twelfth set of computer instructions for re-ranking each of the plurality of documents based at least in part according to the generated boundary in the new vector space.
  - 32. A computer program product according to claim 31, wherein the seventh set of computer instructions for receiving a positive feedback input comprises computer instructions for estimating a positive feedback input comprising a selection of at least one of the relevant documents.
  - 33. A computer program product according to claim 31, wherein the eighth set of computer instructions for receiving a negative feedback input comprises computer instructions for estimating a negative feedback input comprising a selection of at least one of the relevant documents.
  - 34. A computer program product according to claim 31, wherein the tenth set of computer instructions generating the decision boundary comprises applying a kernel based learning algorithm to the received positive feedback input and the received negative input to generate the decision boundary.
  - 35. A computer program product according to claim 34, wherein the kernel based learning algorithm comprises a support vector machine.
  - 36. A computer program product according to claim 30, wherein the first set of computer instructions for converting each of the plurality of documents into a corresponding document language model further comprises analyzing the distribution of the plurality document terms present in the plurality of documents to determine a statistical measure of at least one of a prevalence of at least one of the plurality of document terms present in each of the plurality of documents and a prevalence of at least one of the plurality of document terms present in the plurality of documents.
  - 37. A computer program product according to claim 30, wherein the second set of computer instructions for converting the user query into a corresponding query language model further comprises analyzing the distribution of the plurality of query terms present in the user query relative to the distribution of the plurality of document terms present in the plurality of documents to determine a statistical measure of the relative relevance of each of the plurality of documents to the user query.
  - 38. A computer program product according to claim 31, wherein the eleventh set of computer instructions comprise computer instructions for determining the dimensions of the new vector space based in part upon at least one of a plurality of document statistics, document collection statistics, and relevance statistics.

39. A system adapted to interface with a search engine for sorting a plurality of documents retrieved and ranked by the search engine based at least in part on a relationship between each of the plurality of documents and a user query received via the search engine, relevance feedback, and relations among the plurality of documents, the system comprising:
- a host computing element configured to receive a user relevance feedback via the search engine, the user relevance feedback comprising a selection of at least a portion of the retrieved plurality of documents, the selection comprising one or more relevant document sample;
  
  wherein said host computing element is further configured to generate a plurality of document language models corresponding to each of the plurality of documents, the document language models corresponding at least in part to a plurality of terms present in each of the retrieved plurality of documents;
  
  wherein said host computing element is further configured to estimate a query language model based at least in part on the one or more selected relevant document samples, the query language model being associated with a distribution of a plurality document terms present in the one or more selected relevant document samples in the user relevance feedback and a distribution of a plurality query terms present in the user query;
  
  wherein said host computing element is further configured to compute a language-modeling kernel based at least in part on the query language model, the language-modeling kernel configured to evaluate a similarity relationship between two document language models under the influence of the query language model;
  
  wherein said host computing element is further configured to map the document language models to a high dimensional vector space automatically determined by the computed language-modeling kernel;
  
  wherein said host computing element is further configured to generate a decision boundary in the high-dimensional vector space between the document language models corresponding to the selected relevant document samples and the document language models corresponding to a plurality of non-relevant documents; and
  
  wherein said host computing element is further configured to re-rank the plurality of documents retrieved from the search engine based at least in part on a location of the decision boundary in the high dimensional vector space to refine a rank of the retrieved plurality of documents based at least in part on the query language model and the plurality of document language models.
- View Dependent Claims (40)
- - 40. A system according to claim 39, wherein said host computing element is further configured to estimate the query language model based on a user information need selected from the group consisting of:
    - a user profile,the user relevance feedback,the user access log,the user query, andcombinations thereof, such that the language-modeling kernel is computed based at least in part on the user information need and such that the high-dimensional vector space is further determined by the user information need, and such that the system is configured for a substantially personalized information retrieval process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Etsy Incorporated
Original Assignee
Araicom Research LLC
Inventors
Raghavan, Vijay A., Xie, Ying

Granted Patent

US 9,177,047 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/728
CPC Class Codes

G06F 16/3326   using relevance feedback fr...

G06F 16/3347   using vector based model

G06F 16/3349   Reuse of stored results of ...

G06F 16/58   Retrieval characterised by ...

SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR INFORMATION SORTING AND RETRIEVAL USING A LANGUAGE-MODELING KERNAL FUNCTION

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

40 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR INFORMATION SORTING AND RETRIEVAL USING A LANGUAGE-MODELING KERNAL FUNCTION

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

40 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links