Distributed data set indexing
First Claim
1. An apparatus comprising a processor of a first node device of multiple node devices, and a storage of the first node device to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
- store, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein;
the multiple super cells are distributed among the multiple node devices;
each super cell comprises multiple data cells;
each data cell of the multiple data cells comprises multiple data records; and
each data record of the multiple data records comprises a set of data fields at which data values of the data set are stored;
index, at the first node device, the multiple data records within a first data cell of a super cell by a first data field and by a second data field of the set of data fields in a single read pass through the first data cell, wherein for each data record within the first data cell, the processor is caused to;
retrieve data values from the first data field and the second data field;
perform operations to generate a first binary tree of unique data values of the first data field of the first data cell to determine whether the data value retrieved from the first data field comprises a unique value or a duplicate value, the operations comprising;
search the first binary tree to determine whether the data value retrieved from the first data field comprises a duplicate data value that is already present within the first binary tree or a unique value that is not already present within the first binary tree; and
add the data value retrieved from the first data field to the first binary tree in response to a determination that the data value retrieved from the first data field is a unique value; and
perform operations to generate a second binary tree of unique data values of the second data field of the first data cell to determine whether the data value retrieved from the second data field comprises a unique value or a duplicate value, the operations comprising;
search the second binary tree to determine whether the data value retrieved from the second data field comprises a duplicate data value that is already present within the second binary tree or a unique value that is not already present within the second binary tree; and
add the data value retrieved from the second data field to the second binary tree in response to a determination that the data value retrieved from the second data field is a unique value;
generate a first unique values index of identifiers of the data records associated with the unique data values within the first binary tree based on an in-order traversal of the first binary tree to enable use of the first unique values index to perform a binary search of the data values within the first data field of the data records of the first data cell;
generate a second unique values index of identifiers of the data records associated with the unique data values within the second binary tree based on an in-order traversal of the second binary tree to enable use of the second unique values index to perform a binary search of the data values within the second data field of the data records of the first data cell;
generate, within a super cell index corresponding to the super cell, and from the in-order traversals of the first and second binary trees, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell; and
provide the super cell index to a control device to enable use of the super cell index by the control device.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus including a processor to: index multiple data records within a data cell by first and second data fields in a single read pass through the data cell; wherein for each data record within the first data cell, the processor is to retrieve data values from the first and second data fields, search a first binary tree to determine whether the data value from the first data field comprises a unique value, and add the data value to the first binary tree if it is unique, and search a second binary tree to determine whether the data value from the second data field comprises a unique value, and add the data value to the second binary tree if it is unique; and generate a first and second unique values indexes of identifiers of the data records associated with the unique data values within the first and second binary trees.
-
Citations
30 Claims
-
1. An apparatus comprising a processor of a first node device of multiple node devices, and a storage of the first node device to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
-
store, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of data fields at which data values of the data set are stored; index, at the first node device, the multiple data records within a first data cell of a super cell by a first data field and by a second data field of the set of data fields in a single read pass through the first data cell, wherein for each data record within the first data cell, the processor is caused to; retrieve data values from the first data field and the second data field; perform operations to generate a first binary tree of unique data values of the first data field of the first data cell to determine whether the data value retrieved from the first data field comprises a unique value or a duplicate value, the operations comprising; search the first binary tree to determine whether the data value retrieved from the first data field comprises a duplicate data value that is already present within the first binary tree or a unique value that is not already present within the first binary tree; and add the data value retrieved from the first data field to the first binary tree in response to a determination that the data value retrieved from the first data field is a unique value; and perform operations to generate a second binary tree of unique data values of the second data field of the first data cell to determine whether the data value retrieved from the second data field comprises a unique value or a duplicate value, the operations comprising; search the second binary tree to determine whether the data value retrieved from the second data field comprises a duplicate data value that is already present within the second binary tree or a unique value that is not already present within the second binary tree; and add the data value retrieved from the second data field to the second binary tree in response to a determination that the data value retrieved from the second data field is a unique value; generate a first unique values index of identifiers of the data records associated with the unique data values within the first binary tree based on an in-order traversal of the first binary tree to enable use of the first unique values index to perform a binary search of the data values within the first data field of the data records of the first data cell; generate a second unique values index of identifiers of the data records associated with the unique data values within the second binary tree based on an in-order traversal of the second binary tree to enable use of the second unique values index to perform a binary search of the data values within the second data field of the data records of the first data cell; generate, within a super cell index corresponding to the super cell, and from the in-order traversals of the first and second binary trees, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell; and provide the super cell index to a control device to enable use of the super cell index by the control device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor of a first node device of multiple node devices to perform operations comprising:
-
store, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of data fields at which data values of the data set are stored; index, at the first node device, the multiple data records within a first data cell of a super cell by a first data field and by a second data field of the set of data fields in a single read pass through the first data cell, wherein for each data record within the first data cell, the processor is caused to; retrieve data values from the first data field and the second data field; perform operations to generate a first binary tree of unique data values of the first data field of the first data cell to determine whether the data value retrieved from the first data field comprises a unique value or a duplicate value, the operations comprising; search the first binary tree to determine whether the data value retrieved from the first data field comprises a duplicate data value that is already present within the first binary tree or a unique value that is not already present within the first binary tree; and add the data value retrieved from the first data field to the first binary tree in response to a determination that the data value retrieved from the first data field is a unique value; and perform operations to generate a second binary tree of unique data values of the second data field of the first data cell to determine whether the data value retrieved from the second data field comprises a unique value or a duplicate value, the operations comprising; search the second binary tree to determine whether the data value retrieved from the second data field comprises a duplicate data value that is already present within the second binary tree or a unique value that is not already present within the second binary tree; and add the data value retrieved from the second data field to the second binary tree in response to a determination that the data value retrieved from the second data field is a unique value; generate a first unique values index of identifiers of the data records associated with the unique data values within the first binary tree based on an in-order traversal of the first binary tree to enable use of the first unique values index to perform a binary search of the data values within the first data field of the data records of the first data cell; generate a second unique values index of identifiers of the data records associated with the unique data values within the second binary tree based on an in-order traversal of the second binary tree to enable use of the second unique values index to perform a binary search of the data values within the second data field of the data records of the first data cell; generate, within a super cell index corresponding to the super cell, and from the in-order traversals of the first and second binary trees, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell; and provide the super cell index to a control device to enable use of the super cell index by the control device. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented method comprising:
-
storing, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of data fields at which data values of the data set are stored; indexing, at the first node device, the multiple data records within a first data cell of a super cell by a first data field and by a second data field of the set of data fields in a single read pass through the first data cell, wherein the operations comprise, for each data record within the first data cell; retrieving data values from the first data field and the second data field; performing operations to generate a first binary tree of unique data values of the first data field of the first data cell to determine whether the data value retrieved from the first data field comprises a unique value or a duplicate value, the operations comprising; searching the first binary tree to determine whether the data value retrieved from the first data field comprises a duplicate data value that is already present within the first binary tree or a unique value that is not already present within the first binary tree; and adding the data value retrieved from the first data field to the first binary tree in response to a determination that the data value retrieved from the first data field is a unique value; and perform operations to generate a second binary tree of unique data values of the second data field of the first data cell to determine whether the data value retrieved from the second data field comprises a unique value or a duplicate value, the operations comprising; searching the second binary tree to determine whether the data value retrieved from the second data field comprises a duplicate data value that is already present within the second binary tree or a unique value that is not already present within the second binary tree; and adding the data value retrieved from the second data field to the second binary tree in response to a determination that the data value retrieved from the second data field is a unique value; generating a first unique values index of identifiers of the data records associated with the unique data values within the first binary tree based on an in-order traversal of the first binary tree to enable use of the first unique values index to perform a binary search of the data values within the first data field of the data records of the first data cell; generating a second unique values index of identifiers of the data records associated with the unique data values within the second binary tree based on an in-order traversal of the second binary tree to enable use of the second unique values index to perform a binary search of the data values within the second data field of the data records of the first data cell; generating, within a super cell index corresponding to the super cell, and from the in-order traversals of the first and second binary trees, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell; and providing the super cell index to a control device to enable use of the super cell index by the control device. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification