Distributed data set indexing
First Claim
1. An apparatus comprising a processor of a first node device of multiple node devices, and a storage of the first node device to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
- receive, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein;
the multiple super cells are distributed among the multiple node devices;
each super cell comprises multiple data cells;
each data cell of the multiple data cells comprises multiple data records; and
each data record of the multiple data records comprises a set of fields at which data values of the data set are stored;
index, at the first node device, and at least partially in parallel with other node devices of the multiple node devices, the multiple data records within each data cell of the multiple data cells by a first data field and by a second data field of the set of fields in a single read pass through each data cell of the multiple data cells, wherein for each data record within a first data cell of the received super cell, the processor is caused to;
retrieve a data value from the first data field and a data value from the second data field;
determine, based on the data value retrieved from the first data field, whether the first data field of the data record stores a unique data value, wherein the data value has not yet been retrieved by the processor from the first data field of any data record of the first data cell;
in response to a determination that the first data field of the data record stores a unique data value, add an identifier of the data record to a first unique values index of a first cell index corresponding to the first data cell, wherein identifiers of data records within the first unique values index are ordered based on the corresponding unique data values in the first data field to enable use of the first unique values index to perform a search of the data values within the first data field of the data records of the first data cell;
determine, based on the data value retrieved from the second data field, whether the second data field of the data record stores a unique data value, wherein the data value has not yet retrieved by the processor from the second data field of any data record of the first data cell; and
in response to a determination that the second data field of the data record stores a unique data value, add an identifier of the data record to a second unique values index of the first cell index, wherein identifiers of data records within the second unique values index are ordered based on the corresponding unique data values in the second data field to enable use of the second unique values index to perform a search of the data values within the second data field of the data records of the first data cell;
generate, within a super cell index corresponding to the received super cell, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell, to enable use of the super cell index to determine whether a value specified in search criteria is present within one of the first and second data fields of any data record of the first data cell;
provide, to a control device, a request for a first pointer to a location within the data file at which to store the super cell, the super cell index and the first cell index;
receive, at the first node device and from the control device, the first pointer; and
transmit, to the at least one data device and at least partially in parallel with other node devices of the multiple node devices, the super cell, the super cell index and the first cell index with an instruction to store the super cell, the super cell index and the first cell index with the super cell stored in the data file starting at the location pointed to by the first pointer, with the super cell index and the first cell index stored in the data file at a location after the super cell.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus including a processor to index data records within a data cell, wherein for each data record, the processor retrieves data values from first and second data fields; determines whether the first and second data fields store unique data values; in response to the first data field storing a unique data value, adds an identifier of the data record to a first unique values index, in response to the second data field storing a unique data value, adds the identifier to a second unique values index, wherein identifiers of data records within the unique values indexes are ordered based on corresponding unique data values; and generates an indication of ranges of data values of the first and second data fields to enable a determination of whether a data value specified in search criteria is present within at least the data cell.
37 Citations
30 Claims
-
1. An apparatus comprising a processor of a first node device of multiple node devices, and a storage of the first node device to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
-
receive, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of fields at which data values of the data set are stored; index, at the first node device, and at least partially in parallel with other node devices of the multiple node devices, the multiple data records within each data cell of the multiple data cells by a first data field and by a second data field of the set of fields in a single read pass through each data cell of the multiple data cells, wherein for each data record within a first data cell of the received super cell, the processor is caused to; retrieve a data value from the first data field and a data value from the second data field; determine, based on the data value retrieved from the first data field, whether the first data field of the data record stores a unique data value, wherein the data value has not yet been retrieved by the processor from the first data field of any data record of the first data cell; in response to a determination that the first data field of the data record stores a unique data value, add an identifier of the data record to a first unique values index of a first cell index corresponding to the first data cell, wherein identifiers of data records within the first unique values index are ordered based on the corresponding unique data values in the first data field to enable use of the first unique values index to perform a search of the data values within the first data field of the data records of the first data cell; determine, based on the data value retrieved from the second data field, whether the second data field of the data record stores a unique data value, wherein the data value has not yet retrieved by the processor from the second data field of any data record of the first data cell; and in response to a determination that the second data field of the data record stores a unique data value, add an identifier of the data record to a second unique values index of the first cell index, wherein identifiers of data records within the second unique values index are ordered based on the corresponding unique data values in the second data field to enable use of the second unique values index to perform a search of the data values within the second data field of the data records of the first data cell; generate, within a super cell index corresponding to the received super cell, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell, to enable use of the super cell index to determine whether a value specified in search criteria is present within one of the first and second data fields of any data record of the first data cell; provide, to a control device, a request for a first pointer to a location within the data file at which to store the super cell, the super cell index and the first cell index; receive, at the first node device and from the control device, the first pointer; and transmit, to the at least one data device and at least partially in parallel with other node devices of the multiple node devices, the super cell, the super cell index and the first cell index with an instruction to store the super cell, the super cell index and the first cell index with the super cell stored in the data file starting at the location pointed to by the first pointer, with the super cell index and the first cell index stored in the data file at a location after the super cell. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor of a first node device of multiple node devices to perform operations comprising:
-
receive, at the first node device, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of fields at which data values of the data set are stored; index, at the first node device, and at least partially in parallel with other node devices of the multiple node devices, the multiple data records within each data cell of the multiple data cells by a first data field and by a second data field of the set of fields in a single read pass through each data cell of the multiple data cells, wherein for each data record within a first data cell of the received super cell, the processor is caused to; retrieve a data value from the first data field and a data value from the second data field; determine, based on the data value retrieved from the first data field, whether the first data field of the data record stores a unique data value, wherein the data value has not yet been retrieved by the processor from the first data field of any data record of the first data cell; in response to a determination that the first data field of the data record stores a unique data value, add an identifier of the data record to a first unique values index of a first cell index corresponding to the first data cell, wherein identifiers of data records within the first unique values index are ordered based on the corresponding unique data values in the first data field to enable use of the first unique values index to perform a search of the data values within the first data field of the data records of the first data cell; determine, based on the data value retrieved from the second data field, whether the second data field of the data record stores a unique data value, wherein the data value has not yet retrieved by the processor from the second data field of any data record of the first data cell; and in response to a determination that the second data field of the data record stores a unique data value, add an identifier of the data record to a second unique values index of the first cell index, wherein identifiers of data records within the second unique values index are ordered based on the corresponding unique data values in the second data field to enable use of the second unique values index to perform a search of the data values within the second data field of the data records of the first data cell; generate, within a super cell index corresponding to the received super cell, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell, to enable use of the super cell index to determine whether a value specified in search criteria is present within one of the first and second data fields of any data record of the first data cell; provide, to a control device, a request for a first pointer to a location within the data file at which to store the super cell, the super cell index and the first cell index; receive, at the first node device and from the control device, the first pointer; and transmit, to the at least one data device and at least partially in parallel with other node devices of the multiple node devices, the super cell, the super cell index and the first cell index with an instruction to store the super cell, the super cell index and the first cell index with the super cell stored in the data file starting at the location pointed to by the first pointer, with the super cell index and the first cell index stored in the data file at a location after the super cell. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented method comprising:
-
receiving, at a first node device of multiple node devices, a super cell of multiple super cells into which a data set is divided from a data file maintained by at least one data device, wherein; the multiple super cells are distributed among the multiple node devices; each super cell comprises multiple data cells; each data cell of the multiple data cells comprises multiple data records; and each data record of the multiple data records comprises a set of fields at which data values of the data set are stored; indexing, at the first node device, and at least partially in parallel with other node devices of the multiple node devices, the multiple data records within each data cell of the multiple data cells by a first data field and by a second data field of the set of fields in a single read pass through each data cell of the multiple data cells, wherein for each data record within a first data cell of the received super cell, the operations include; retrieving a data value from the first data field and a data value from the second data field; determining, based on the data value retrieved from the first data field, whether the first data field of the data record stores a unique data value, wherein the data value has not yet been retrieved from the first data field of any data record of the first data cell; in response to a determination that the first data field of the data record stores a unique data value, adding an identifier of the data record to a first unique values index of a first cell index corresponding to the first data cell, wherein identifiers of data records within the first unique values index are ordered based on the corresponding unique data values in the first data field to enable use of the first unique values index to perform a search of the data values within the first data field of the data records of the first data cell; determining, based on the data value retrieved from the second data field, whether the second data field of the data record stores a unique data value, wherein the data value has not yet retrieved from the second data field of any data record of the first data cell; and in response to a determination that the second data field of the data record stores a unique data value, adding an identifier of the data record to a second unique values index of the first cell index, wherein identifiers of data records within the second unique values index are ordered based on the corresponding unique data values in the second data field to enable use of the second unique values index to perform a search of the data values within the second data field of the data records of the first data cell; generating, within a super cell index corresponding to the received super cell, an indication of a range of the data values of the first data field within the data records of the first data cell, and an indication of a range of the data values of the second data field within the data records of the first data cell, to enable use of the super cell index to determine whether a value specified in search criteria is present within one of the first and second data fields of any data record of the first data cell; providing, to a control device, a request for a first pointer to a location within the data file at which to store the super cell, the super cell index and the first cell index; receiving, at the first node device and from the control device, the first pointer; and transmitting, to the at least one data device and at least partially in parallel with other node devices of the multiple node devices, the super cell, the super cell index and the first cell index with an instruction to store the super cell, the super cell index and the first cell index with the super cell stored in the data file starting at the location pointed to by the first pointer, with the super cell index and the first cell index stored in the data file at a location after the super cell. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification