System and method of data caching for compliance storage systems with keyword query based access

US 8,140,538 B2
Filed: 04/17/2008
Issued: 03/20/2012
Est. Priority Date: 04/17/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method of data caching for compliance and storage systems that provides keyword search query based access to documents, the method comprising:

searching documents from a storage device by a keyword based interface;

staging from a cache documents that are read and that are expected to be needed again from the storage device;

computing a document weight for each of the documents read and expected to be needed again, wherein the document weight is based on a document information retrieval (IR) relevancy metric for user keyword queries and a recency and a frequency of each query and the document weight models a probability of a particular document being accessed again through a query, and wherein the document weight is based on a relevance of each document for queries in a query history;

placing a processor and a disk in data communication with a First In First Out queue and a cache; and

if the document being accessed again was not already in the cache, evicting another document from the cache to make room for the document being accessed again to be placed in the cache by packing elements in the order of a document weight-to-size ratio, highest to smallest, and evicting documents with a smallest document weight-to-size ratio first;

maintaining a query history of recent queries from a user in a query history first-in first-out queue;

assigning each query from a user a query weight based on a position of the query from a user in the First In First Out queue, wherein the query weight models a probability of a query or a related query being invoked again;

wherein each one of the document weight is recomputed by the processor when a document to be retrieved was not previously cached;

updating the query history First-in First-Out queue and each of the document weights when a new query has been entered;

adapting each of the document weights to changing query frequencies and popularities; and

selecting and evicting documents from the cache according to a knapsack solution.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of data caching for compliance and storage systems that provide keyword search query based access to documents computes a value for each data document based on a document information-retrieval relevancy metric for user keyword queries and a recency, frequency of each query. The values are adapted to changing query frequencies and popularities. Then selecting and evicting documents from a cache can be based on the values according to a knapsack solution. A weight is computed for each query such that recent, more frequent queries get a higher weight. A information-retrieval metric is used for measuring a relevancy of a document for a query. A weighted sum is taken of the information-retrieval metric times a query weight over all queries.

14 Citations

7 Claims

1. A method of data caching for compliance and storage systems that provides keyword search query based access to documents, the method comprising:
- searching documents from a storage device by a keyword based interface;
  
  staging from a cache documents that are read and that are expected to be needed again from the storage device;
  
  computing a document weight for each of the documents read and expected to be needed again, wherein the document weight is based on a document information retrieval (IR) relevancy metric for user keyword queries and a recency and a frequency of each query and the document weight models a probability of a particular document being accessed again through a query, and wherein the document weight is based on a relevance of each document for queries in a query history;
  
  placing a processor and a disk in data communication with a First In First Out queue and a cache; and
  
  if the document being accessed again was not already in the cache, evicting another document from the cache to make room for the document being accessed again to be placed in the cache by packing elements in the order of a document weight-to-size ratio, highest to smallest, and evicting documents with a smallest document weight-to-size ratio first;
  
  maintaining a query history of recent queries from a user in a query history first-in first-out queue;
  
  assigning each query from a user a query weight based on a position of the query from a user in the First In First Out queue, wherein the query weight models a probability of a query or a related query being invoked again;
  
  wherein each one of the document weight is recomputed by the processor when a document to be retrieved was not previously cached;
  
  updating the query history First-in First-Out queue and each of the document weights when a new query has been entered;
  
  adapting each of the document weights to changing query frequencies and popularities; and
  
  selecting and evicting documents from the cache according to a knapsack solution.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the computing of said value for each data document further comprises:
    - computing a higher document weight for recent queries;
      
      computing a higher document weight for more frequent queries;
      
      computing a information-retrieval (IR) metric measuring a relevancy of a document for a query; and
      
      taking a weighted sum of the IR metric times a query weight over all queries.
  - 3. The method of claim 1, wherein if all document sizes are the same, the ordering is done according to cached values of said document weight.
  - 4. The method of claim 1, further comprising:
    - allowing direct document accesses to be interspersed between keyword query accesses; and
      
      calculating a document weight for a direct access as a query which matches only one document.

5. A document search system, comprising:
- a keyword based interface that searches documents from a storage device;
  
  a cache that stages documents that are read and that are expected to be needed again from said storage device, wherein said cache further includes a document weight that is maintained for each document, said document weight models a probability of a particular document being accessed again through a query, said document weight is based on a relevance of each document for queries in a query history, and if said document being accessed again was not already in said cache, another document is evicted from said cache to make room for said document being accessed again to be placed in said cache by packing elements in the order of a document weight-to-size ratio, highest to smallest, and documents with a smallest document weight-to-size ratio are evicted first;
  
  a query history first-in first-out (FIFO) queue that maintains a query history of recent queries from a user, wherein, each query is assigned a query weight based on its position in said FIFO queue, wherein the query weight models a probability of a query or a related query being invoked again;
  
  a processor connected to said query history FIFO queue, wherein said processor computes a value for each data document based on a document information retrieval (IR) relevancy metric for user keyword queries and a recency and a frequency of each query, and said processor recomputes each one of said document weight (Dw) for each data document when a document to be retrieved was not previously cached;
  
  an updating system that updates said query history FIFO queue, each of said query weight, and each of said document weight when a new query has been entered;
  
  a mechanism that adapts each one of said document weight for each data document to changing query frequencies and popularities; and
  
  a mechanism selecting and evicting documents from said cache based on said document weight for each data document according to a knapsack solution.
- View Dependent Claims (6, 7)
- - 6. The document search system of claim 5, wherein if a document selected from a result set was not already in said cache, said document selected from a result set will be fetched from a storage system and a document weight for said document selected from a result set and not already in the cache is calculated by iterating over all queries then in said query history FIFO queue.
  - 7. The document search system of claim 5, wherein:
    - if said document being accessed again was not already in said cache, another document is evicted from said cache to make room for said document being accessed again to be placed in said cache according to a 0(1) knapsack computation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Korupolu, Madhukar R., Mitra, Soumyadeb
Primary Examiner(s)
Lu, Charles

Application Number

US12/104,711
Publication Number

US 20090265329A1
Time in Patent Office

1,433 Days
Field of Search

707/748, 707/723
US Class Current

707/748
CPC Class Codes

G06F 16/9574 of access to content, e.g. ...

System and method of data caching for compliance storage systems with keyword query based access

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

System and method of data caching for compliance storage systems with keyword query based access

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links