Efficient bundle sorting
First Claim
1. A method of sorting data sets including a predetermined number of distinct keys, comprising the steps of:
- bundling the data sets where substantially identical keys having substantially identical key values are bundled together; and
ordering the bundles in a predeteined order with respect to the order defined by the substantially identical key values for each bundle, and wherein said method is performed using an external memory.
0 Assignments
0 Petitions
Accused Products
Abstract
Many data sets to be sorted consist of a limited number of distinct keys. Sorting such data sets can be thought of as bundling together identical keys and having the bundles placed in order; we therefore denote this as bundle sorting. We describe an efficient algorithm for bundle sorting in external memory that requires at most c(N/B)logM/Bk disk accesses, where N is the number of keys, M is the size of internal memory, k is the number of distinct keys, B is the transfer block size, and 2<c<4. For moderately sized k, this bound circumvents the Θ((N/B)logM/B(N/B)) I/O lower bound known for general sorting. We show that our algorithm is optimal by proving a matching lower bound for bundle sorting. The improved nning time of bundle sorting over general sorting can be significant in practice, as demonstrated by experimentation. An important feature of the new algorithm is that it is executed “in-place”, requiring no additional disk space.
-
Citations
8 Claims
-
1. A method of sorting data sets including a predetermined number of distinct keys, comprising the steps of:
-
bundling the data sets where substantially identical keys having substantially identical key values are bundled together; and
ordering the bundles in a predeteined order with respect to the order defined by the substantially identical key values for each bundle, and wherein said method is performed using an external memory. - View Dependent Claims (2, 3)
-
-
4. A system sorting data sets including a predetermined number of distinct keys, comprising the steps of:
-
bundling means for bundling the data sets where substantially identical keys having substantially identical key values are bundled together; and
ordering means for ordering the bundles in a predetermined order with respect to the order defined by the substantially identical key values for each bundle, wherein said method is performed using an external memory.
-
-
5. A method for sorting large data sets that reside on external memory, given that the available memory is of size M and that the transfer block size is B, said method comprising the steps of:
-
defining a function that maps input keys into about M/B bundles (groups);
sorting the data set according to the bundles, resulting with about M/B sub-sequences, including the steps of;
counting the Dumber of input keys that belong to each bundle;
computing the range of disk blocks in which each bundle will reside upon termination of the sorting step;
loading the first block from each of the said ranges into main memory;
scanning the loaded blocks, while swapping keys so that each block is filled only with keys belonging to its bundle;
writing every block that is filled with the appropriate keys back to its location within its range on disk and loading the next block in its range; and
re-iterating the above steps for each sub-sequence, until each bundle consists of one key only.
-
-
6. A method for sorting large data sets that reside on external memory, given that the available memory is of size M, that the transfer block size is B, and that the number of distinct keys is at most about M/B, said method comprising the steps of:
-
counting the number of input keys that belong to each bundle;
computing the range of disk blocks in which each bundle will reside upon termination of the sorting step;
loading the first block from each of the said ranges into main memory;
scanning the loaded blocks, while swapping keys so that each block is filled only with keys belonging to its bundle;
writing every block that is filled with the appropriate keys back to its location within its range on disk and loading the next block in its range.
-
-
7. A method for sorting large data sets that reside on external memory, given that the available memory is of size M and that the transfer block size is B, said method comprising the steps of:
-
defining a function that maps input keys into about M/B bundles (groups);
sorting the data set according to the bundles, resulting with about M/B sub-sequences, including the steps of;
estimating the number of input keys that belong to each bundle;
computing the range of disk blocks in which each bundle will reside upon termination of the sorting step;
loading the first block from each of the said ranges into main memory;
scanning the loaded blocks, while swapping keys so that each block is filled only with keys belonging to its bundle;
writing every block that is filled with the appropriate keys back to its location within its range on disk and loading the next block in its range, and re-iterating the above steps for each sub-sequence, until each bundle consists of one key only.
-
-
8. A method for sorting large data sets that reside on external memory, given that the available memory is of size M, that the transfer block size is B, and that the number of distinct keys is at most about M/B, said method comprising the steps of:
-
estimating the number of input keys that belong to each bundle;
computing the range of disk blocks in which each bundle will reside upon termination of the sorting step;
loading the first block from each of the said ranges into main memory;
scanning the loaded blocks, while swapping keys so that each block is filled only with keys belonging to its bundle;
writing every block that is filled with the appropriate keys back to its location within its range on disk and loading the next block in its range.
-
Specification