Method of generating a distributed text index for parallel query processing
First Claim
1. A method for parallel query processing, the method comprising the steps of:
- providing a set of node indices for text indexing a set of documents, each node text index covering a subset of the documents, each term of the node indices having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the set of documents and each node text index having an assigned precalculated quality measure, the quality measure being expressive of a difference between the global frequency measure of a term and the local frequency measure of the term within the subset of documents covered by the node (102, 104, 106, 108);
determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of;
for each node text index;
calculation of a difference of the quality measure and the precalculated quality measure;
calculating a mean value of the differences;
if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures;
using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and
recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient; and
calculating of rank scores on the basis of the global frequency measures.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to a method of generating a distributed text index for parallel query processing by a number of nodes. A set of node indices is generated for text indexing a set of documents, each node text index covering a subset of the documents. For each node text index, a local frequency measure for each term of the node text index is calculated on the basis of a frequency of documents containing the term in the subset of the documents of the node. A global frequency measure for each term is calculated on the basis of a frequency of documents containing the term in the set of documents. A quality measure for each node text index is calculated on the basis of the local frequency measures of the terms of the node and the global frequency measure of the terms of the node.
10 Citations
8 Claims
-
1. A method for parallel query processing, the method comprising the steps of:
-
providing a set of node indices for text indexing a set of documents, each node text index covering a subset of the documents, each term of the node indices having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the set of documents and each node text index having an assigned precalculated quality measure, the quality measure being expressive of a difference between the global frequency measure of a term and the local frequency measure of the term within the subset of documents covered by the node (102, 104, 106, 108); determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of; for each node text index;
calculation of a difference of the quality measure and the precalculated quality measure;calculating a mean value of the differences; if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures; using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient; and calculating of rank scores on the basis of the global frequency measures.
-
-
2. A method for updating a distributed text index for parallel query processing by a number of nodes, the distributed text index having a set of node text indices indexing a set of documents, each node text index covering a subset of the documents, the method comprising the steps of:
-
for each node text index;
calculating a local frequency measure for each term on the basis of a frequency of documents containing the term in the subset of the documents of the node (102, 104, 106, 108);calculating a quality measure on the basis of the local frequency measures and on the basis of precalculated global frequency measures of the terms which have been calculated before the updating of the text index, the global frequency measure of a term being expressive of a frequency of documents containing the term in the entire set of documents; determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices by performing the steps of; for each node text index;
calculation of a difference of the quality measure and the precalculated quality measure;calculating a mean value of the differences; if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures; using of the precalculated global frequency measures for calculating of rank scores, if the quality is sufficient; and recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient.
-
-
3. The method of claim 2 whereby only differences which are above a predefined second threshold level are used for the calculation of the mean value.
-
4. The method of claim 2, the calculation of the mean value being performed by a rank broker node (108), whereby the difference is communicated from one of the nodes to the rank broker node only if the difference surpasses the second predefined threshold.
-
5. A computer program product, in particular digital storage medium, comprising program means for performing a method in accordance with any one of the preceding claims 1, 2, 3, or 4.
-
6. A data processing system having a number of nodes (102, 104, 106, 108) for parallel query processing, each node having a node text index for text indexing a subset of documents of a set of documents being covered by a distributed text index, and each term of a node text index having an assigned precalculated global frequency measure, the global frequency measure being expressive of a frequency of documents containing the term in the entire set of documents, and each of the node indices having an assigned precalculated quality measure depending on the global frequency measure of the terms and the local frequency measure of the terms of the node, the local frequency measures of the terms being expressive of a frequency of documents containing a term in the subset of the documents covered by the node text index, and means (110) for updating of the distributed text index, the means for updating being adapted to perform the steps of:
-
for each node text index;
calculating a local frequency measure for each term on the basis of a frequency of documents containing the term in the subset of the documents of the node;calculating a quality measure on the basis of the local frequency measures and on the basis of precalculated global frequency measures of the terms which have been calculated before the updating of the text index, the global frequency measure of a term being expressive of a frequency of documents containing the term in the entire set of documents; determining if a quality of the distributed text index is sufficient on the basis of the quality measure of the node indices; using of precalculated global frequency measures for calculating of rank scores, if the quality is sufficient by performing the steps of; for each node text index;
calculation of a difference of the quality measure and the precalculated quality measure;calculating a mean value of the differences; if the mean value of the differences is above a user-defined threshold level, recalculation of the global frequency measures; and recalculating of the global frequency measures and of the quality measure of the nodes, if the quality is not sufficient.
-
-
7. The data processing system of claim 6, the data processing system further comprising a rank broker node for calculating the mean value.
-
8. The data processing system of claim 6, further comprising a rank broker node (108) for calculating the mean value.
Specification