Systems and methods for high-speed searching and filtering of large datasets
First Claim
1. A computer-implemented method comprising:
- receiving from one or more computer-readable storage media electronic indicia of a multitude of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; and
using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the multitude of alphanumeric data records as one or more binary data files stored on one or more computer-readable storage media operatively coupled to the one or more computer processors,wherein;
the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments;
for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges;
each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records;
each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and
arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records.
0 Assignments
0 Petitions
Accused Products
Abstract
A binary data file embodies an inline tree data structure storing fields of a hierarchical dataset. The inline tree comprises first-level binary string segments, each comprising substantially contiguous second-level binary string segments, corresponding to subranges of first and second subsets of data fields. Size is reduced by substituting: binary string indices for alphanumeric strings; a data clump index for a set of correlated/anticorrelated strings; field masks for unoccupied data fields. A dedicated conversion program generates the inline tree from conventional database formats, which is read entirely into RAM to be searched/filtered by a dedicated search/filter program. Small size (<2 bytes/field/record) and contiguous arrangement enables searching/filtering of >106 records (>100 data fields) in <500 nanoseconds/record/core. Recursive subdivision of selection field ranges can guide searches that include those selection fields. One example includes geographic searching/filtering of records that include latitude and longitude fields.
70 Citations
31 Claims
-
1. A computer-implemented method comprising:
-
receiving from one or more computer-readable storage media electronic indicia of a multitude of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; and using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the multitude of alphanumeric data records as one or more binary data files stored on one or more computer-readable storage media operatively coupled to the one or more computer processors, wherein; the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records. - View Dependent Claims (2, 3, 4, 5)
-
-
6. An article comprising a tangible, non-transitory computer-readable medium encoded to store one or more binary data files generated and stored by a method comprising:
-
receiving from one or more computer-readable storage media electronic indicia of a multitude of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; and using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the multitude of alphanumeric data records as one or more binary data files stored on the tangible, non-transitory computer-readable medium, wherein; the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer-implemented method comprising:
-
receiving from one or more computer-readable storage media electronic indicia of multitudes of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; for each of one or more designated selection data fields of each data record, using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, automatically recursively subdividing into multiple selection field subranges a corresponding range of data strings of the selection data fields, wherein each different combination of selection field subranges indicates a different corresponding selected multitude of data records of the hierarchical dataset, wherein each corresponding selected multitude includes only those data records for which each data string of the selection data fields falls within a corresponding selection field subrange; for each one of the selected multitudes of data records, using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the selected multitude of alphanumeric data records as one or more binary data files stored on one or more computer-readable storage media operatively coupled to the one or more computer processors; and using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of a binary header string in one or more of the binary data files, wherein; for each one of the selected multitudes of data records, the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records; the binary selection header string comprises a linked list of indicia of multiple combinations of selection field subranges, and the linked list is linked in a manner that reflects the recursive subdivision of the selection field subranges; and each terminal record in the linked list indicates a corresponding one of the selected multitudes of data records. - View Dependent Claims (23, 24)
-
-
25. An article comprising a tangible, non-transitory computer-readable medium encoded to store one or more binary data files generated and stored by a method comprising:
-
receiving from one or more computer-readable storage media electronic indicia of multitudes of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; for each of one or more designated selection data fields of each data record, using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, automatically recursively subdividing into multiple selection field subranges a corresponding range of data strings of the selection data fields, wherein each different combination of selection field subranges indicates a different corresponding selected multitude of data records of the hierarchical dataset, wherein each corresponding selected multitude includes only those data records for which each data string of the selection data fields falls within a corresponding selection field subrange; for each one of the selected multitudes of data records, using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the selected multitude of alphanumeric data records as one or more binary data files stored on one or more computer-readable storage media operatively coupled to the one or more computer processors; and using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of a binary header string in one or more of the binary data files, wherein; for each one of the selected multitudes of data records, the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records; the binary selection header string comprises a linked list of indicia of multiple combinations of selection field subranges, and the linked list is linked in a manner that reflects the recursive subdivision of the selection field subranges; and each terminal record in the linked list indicates a corresponding one of the selected multitudes of data records. - View Dependent Claims (26, 27, 28, 29)
-
-
30. A computer system comprising one or more computer processors and one or more computer-readable storage media operatively coupled to one or more of the computer processors, wherein the computer system is structured, connected, and programmed to perform a method comprising:
-
receiving from one or more of the computer-readable storage media electronic indicia of a multitude of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; and using one or more of the computer processors, generating and storing binary indicia of the multitude of alphanumeric data records as one or more binary data files stored on one or more of the computer-readable storage media, wherein; the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records.
-
-
31. An article comprising a tangible, non-transitory medium encoding computer-readable instructions that, when applied to a computer system comprising at least one processor, instruct the computer system to perform a method comprising:
-
receiving from one or more computer-readable storage media electronic indicia of a multitude of alphanumeric data records of a hierarchical dataset, each data record including alphanumeric data strings for multiple corresponding defined first-level and second-level data fields; and using one or more computer processors programmed therefor and operatively coupled to the one or more storage media, generating and storing binary indicia of the multitude of alphanumeric data records as one or more binary data files stored on one or more computer-readable storage media operatively coupled to the one or more computer processors, wherein; the one or more data files form a single, continuous binary string including multiple first-level binary string segments and, following each first-level binary string segment, one or more corresponding second-level binary string segments; for the first-level data fields, each range of data strings for the first-level data fields is divided into multiple corresponding first-level subranges, and the multitude of data records comprises multiple first-level subsets of the data records, wherein each first-level subset includes only those data records for which each data string of the first-level data fields falls within a corresponding one of the first-level subranges; each first-level binary string segment encodes the data strings of the first-level data fields of a corresponding one of the first-level subsets of the data records; each second-level binary string segment encodes data strings of the second-level data fields of that first-level subset of the data records which corresponds to the immediately preceding first-level binary string; and arrangement of the one or more binary files enables access to the first-level and second-level binary strings in only the order in which each appears in the single, continuous binary string, without enabling random access to the binary indicia of the data strings of the multitude of data records.
-
Specification