Method of partitioning data records
First Claim
1. A method of performing a retrieval operation in a database comprising a tree of nodes, wherein the tree of nodes comprises a root node which is connected to two or more branches originating at the root node, wherein each branch terminates at a node, wherein each node other than the root node may be a non-terminal node or a leaf node, wherein each non-terminal node is connected to two or more branches originating at the non-terminal node and terminating at a node, wherein each leaf node comprises one or more data records of the database, wherein a test associated with each non-terminal node defines a partition of data records based upon one of entropy/adjacency partition assignment and data clustering using multivariate statistical analysis, wherein a current node is initially set to the root node, said method comprising the steps of:
- (a) receiving input of a search request providing a retrieval operation and information necessary to perform the retrieval operation;
(b) performing the test associated with a current node responsive to the search request, said test resulting in identification of zero or more distal nodes connected to the current node, wherein said identified distal nodes can, according to the test, contain the data record;
(c) repeating step (b) using an untested distal node which is a non-terminal node as the current node; and
(d) performing the retrieval operation on each identified node that is a leaf node.
0 Assignments
0 Petitions
Accused Products
Abstract
A tree-structured index to multidimensional data is created using naturally occurring patterns and clusters within the data which permit efficient search and retrieval strategies in a database of DNA profiles. A search engine utilizes hierarchical decomposition of the database by identifying clusters of similar DNA profiles and maps to parallel computer architecture, allowing scale up past previously feasible limits. Key benefits of the new method are logarithmic scale up and parallelization. These benefits are achieved by identification and utilization of naturally occurring patterns and clusters within stored data. The patterns and clusters enable the stored data to be partitioned into subsets of roughly equal size. The method can be applied recursively, resulting in a database tree that is balanced, meaning that all paths or branches through the tree have roughly the same length. The method achieves high performance by exploiting the natural structure of the data in a manner that maintains balanced trees. Implementation of the method maps naturally to parallel computer architectures, allowing scale up to very large databases.
56 Citations
41 Claims
-
1. A method of performing a retrieval operation in a database comprising a tree of nodes, wherein the tree of nodes comprises a root node which is connected to two or more branches originating at the root node, wherein each branch terminates at a node, wherein each node other than the root node may be a non-terminal node or a leaf node, wherein each non-terminal node is connected to two or more branches originating at the non-terminal node and terminating at a node, wherein each leaf node comprises one or more data records of the database, wherein a test associated with each non-terminal node defines a partition of data records based upon one of entropy/adjacency partition assignment and data clustering using multivariate statistical analysis, wherein a current node is initially set to the root node, said method comprising the steps of:
-
(a) receiving input of a search request providing a retrieval operation and information necessary to perform the retrieval operation;
(b) performing the test associated with a current node responsive to the search request, said test resulting in identification of zero or more distal nodes connected to the current node, wherein said identified distal nodes can, according to the test, contain the data record;
(c) repeating step (b) using an untested distal node which is a non-terminal node as the current node; and
(d) performing the retrieval operation on each identified node that is a leaf node. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of partitioning data records in a computer into groups of roughly equal size, comprising the steps of:
-
(a) defining a function of the probability distribution of the values of a designated variable associated with the data records, wherein the function comprises a linear combination of measures of entropy and adjacency;
(b) partitioning the values of the designated variable into two or more groups, wherein the value of the function is minimized; and
(c) assigning each data record to a group according to the value of the designated variable. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A method of creating a tree-structured index for a database in a computer, wherein the database comprises a tree of nodes;
- wherein the tree of nodes comprises a root node which is connected to two or more branches originating at the root node, wherein each branch terminates at a node, wherein each node other than the root node may be a non-terminal node or a leaf node, wherein each non-terminal node is connected to two or more branches originating at the non-terminal node and terminating at a node, wherein each leaf node comprises one or more data records of the database, wherein the tree-structured index comprises one or more tests associated with each non-terminal node, said method comprising the steps of;
(a) identifying naturally occurring sets of clusters in the data records of the database;
(b) defining for each identified set of clusters a test that assigns each data record to a cluster within the set of clusters; and
(c) associating each test defined in step (b) with a non-terminal node and an associated set of clusters defined in step (a), and associating with each cluster within the set of clusters one branch originating at the non-terminal node, said branch forming part of one or more paths leading to leaf nodes comprising the data records assigned to the cluster by the test. - View Dependent Claims (14, 15, 16, 17, 18)
- wherein the tree of nodes comprises a root node which is connected to two or more branches originating at the root node, wherein each branch terminates at a node, wherein each node other than the root node may be a non-terminal node or a leaf node, wherein each non-terminal node is connected to two or more branches originating at the non-terminal node and terminating at a node, wherein each leaf node comprises one or more data records of the database, wherein the tree-structured index comprises one or more tests associated with each non-terminal node, said method comprising the steps of;
-
19. A method of organizing the data records of a database into clusters, comprising the steps of:
-
(a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
(b) choosing a set of variables from those represented in all of the data records, whereby principal component analysis of the set of variables yields distinct clusters of the data records;
(c) applying principal component analysis to a sample of the data records, whereby two or more principal component vectors are identified, wherein the scores of the sample data records along these vectors form distinct clusters;
(d) formulating a test based on the identified principal component vectors which assigns each data record to a cluster; and
(e) performing the test formulated in step (d) on each data record, whereby the data records are organized into clusters. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A parallel data processing architecture for search, storage, and retrieval of data responsive to queries, comprising:
-
a root host processor, responsive to client queries, for creating a search client object and establishing an initial search queue for a query;
a plurality of host processors accessible by said root host processor, each of said root and host processors maintaining a list of available host processors, query queue length, and processing capacity for each processor;
a bus system coupling said host processors; and
a memory for storing a database tree comprising nodes and data of a database accessible via said nodes, said processors capable of executing a set of tests, associating one test with each non-terminal node of a database tree,
-
-
31. A method for search, storage and retrieval of data from a database, comprising the steps of:
-
defining a set of tests;
associating one test with each non-terminal node of a database tree, each test for defining a partition of data of the database according to one of entropy/adjacency partition assignment or data clustering using multivariable statistical analysis; and
outputting a test result in response to a query by evaluation of one of a Boolean expression or a decision tree.
-
-
32. A method of organizing the data records of a database into clusters, comprising the steps of:
-
(a) representing one or more variables in each data record in a binary form, whereby the value of each bit is assigned based on the value of a variable;
(b) choosing a set of variables from those represented in all of the data records, whereby multivariate statistical analysis of the set of variables yields distinct clusters of the data records;
(c) applying multivariate statistical analysis to a sample of the data records, whereby two or more vectors are identified, wherein the vectors of inner products of the sample data records with the identified vectors form distinct clusters;
(d) formulating a test based on the identified vectors which assigns each data record to a cluster; and
(e) performing the test formulated in step (d) on each data record, whereby the data records are organized into clusters. - View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41)
-
Specification