Method of generating a distributed text index for parallel query processing

US 7,324,988 B2
Filed: 03/19/2004
Issued: 01/29/2008
Est. Priority Date: 07/07/2003
Status: Active Grant

First Claim

Patent Images

1. A method for parallel query processing, the method comprising the steps of:

providing a set of node indices for text indexing a set of documents, each node text index covering a subset of the documents, each term of the node indices having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the set of documents and each node text index having an assigned precalculated quality measure, the quality measure being expressive of a difference between the global frequency measure of a term and the local frequency measure of the term within the subset of documents covered by the node (102, 104, 106, 108);

determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of;

for each node text index;

calculation of a difference of the quality measure and the precalculated quality measure;

calculating a mean value of the differences;

if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures;

using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and

recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient; and

calculating of rank scores on the basis of the global frequency measures.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method of generating a distributed text index for parallel query processing by a number of nodes. A set of node indices is generated for text indexing a set of documents, each node text index covering a subset of the documents. For each node text index, a local frequency measure for each term of the node text index is calculated on the basis of a frequency of documents containing the term in the subset of the documents of the node. A global frequency measure for each term is calculated on the basis of a frequency of documents containing the term in the set of documents. A quality measure for each node text index is calculated on the basis of the local frequency measures of the terms of the node and the global frequency measure of the terms of the node.

10 Citations

View as Search Results

8 Claims

1. A method for parallel query processing, the method comprising the steps of:
- providing a set of node indices for text indexing a set of documents, each node text index covering a subset of the documents, each term of the node indices having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the set of documents and each node text index having an assigned precalculated quality measure, the quality measure being expressive of a difference between the global frequency measure of a term and the local frequency measure of the term within the subset of documents covered by the node (102, 104, 106, 108);
  
  determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of;
  
  for each node text index;
  
  calculation of a difference of the quality measure and the precalculated quality measure;
  
  calculating a mean value of the differences;
  
  if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures;
  
  using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and
  
  recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient; and
  
  calculating of rank scores on the basis of the global frequency measures.

2. A method for updating a distributed text index for parallel query processing by a number of nodes, the distributed text index having a set of node text indices indexing a set of documents, each node text index covering a subset of the documents, the method comprising the steps of:
- for each node text index;
  
  calculating a local frequency measure for each term on the basis of a frequency of documents containing the term in the subset of the documents of the node (102, 104, 106, 108);
  
  calculating a quality measure on the basis of the local frequency measures and on the basis of precalculated global frequency measures of the terms which have been calculated before the updating of the text index, the global frequency measure of a term being expressive of a frequency of documents containing the term in the entire set of documents;
  
  determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of;
  
  for each node text index;
  
  calculation of a difference of the quality measure and the precalculated quality measure;
  
  calculating a mean value of the differences;
  
  if the mean value of the differences is above a user-defined threshold level,recalculation of the global frequency measures;
  
  using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and
  
  recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient.

3. The method of claim 2 whereby only differences which are above a predefined second threshold level are used for the calculation of the mean value.

4. The method of claim 2, the calculation of the mean value being performed by a rank broker node (108), whereby the difference is communicated from one of the nodes to the rank broker node only if the difference surpasses the second predefined threshold.

5. A computer program product, in particular digital storage medium, comprising program means for performing a method in accordance with any one of the preceding claims 1, 2, 3, or 4.

6. A data processing system having a number of nodes (102, 104, 106, 108) for parallel query processing, each node having a node text index for text indexing a subset of documents of a set of documents being covered by a distributed text index, and each term of a node text index having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the entire set of documents, and each of the node indices having an assigned precalculated quality measure depending on the global frequency measure of the terms and the local frequency measure of the terms of the node, the local frequency measures of the terms being expressive of a frequency of documents containing a term in the subset of the documents covered by the node text index, and means (110) for updating of the distributed text index, the means for updating being adapted to perform the steps of:
- for each node text index;
  
  calculating a local frequency measure for each term on the basis of a frequency of documents containing the term in the subset of the documents of the node;
  
  calculating a quality measure on the basis of the local frequency measures and on the basis of precalculated global frequency measures of the terms which have been calculated before the updating of the text index, the global frequency measure of a term being expressive of a frequency of documents containing the term in the entire set of documents;
  
  determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices;
  
  using of precalculated global frequency measures for calculating of rank scores, if the quality is sufficient by performing the steps of;
  
  for each node text index;
  
  calculation of a difference of the quality measure and the precalculated quality measure;
  
  calculating a mean value of the differences;
  
  if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures; and
  
  recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient.

7. The data processing system of claim 6, the data processing system further comprising a rank broker node for calculating the mean value.

8. The data processing system of claim 6, further comprising a rank broker node (108) for calculating the mean value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Altevogt, Peter, Nitzsche, Raiko
Primary Examiner(s)
Wong; Don
Assistant Examiner(s)
Lie; Angela M

Application Number

US10/805,402
Publication Number

US 20050010560A1
Time in Patent Office

1,411 Days
Field of Search

707 1- 3
US Class Current

1/1
CPC Class Codes

G06F 16/316   Indexing structures

G06F 16/3347   using vector based model

Y10S 707/99933   Query processing, i.e. sear...

Method of generating a distributed text index for parallel query processing

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

10 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method of generating a distributed text index for parallel query processing

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links