SYSTEM AND METHOD FOR MACHINE LEARNING AND CLASSIFYING DATA
First Claim
Patent Images
1. A computerized method for classifying data comprising:
- (a) receiving the data;
(b) dividing the received data into two or more chunks;
(c) mapping each chunk into a token and storing the token in a token collection;
(d) hashing each token using two or more local sensitivity hashing functions, wherein each local sensitivity hashing function contains two or more random hashing seed numbers, determining a minimum hash value for each local sensitivity hashing function, and storing the minimum hash value for each local sensitivity hashing function in a minimum hash set collection;
(e) classifying the data using the minimum hash values for the tokens; and
wherein the foregoing steps are performed by one or more processors.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates in general to the field of parallel data processing, and more particularly to machine learning and classification of extremely large volumes of unstructured gene sequence data using Collaborative Analytics Gene Sequence Classification Learning Systems and Methods.
-
Citations
39 Claims
-
1. A computerized method for classifying data comprising:
-
(a) receiving the data; (b) dividing the received data into two or more chunks; (c) mapping each chunk into a token and storing the token in a token collection; (d) hashing each token using two or more local sensitivity hashing functions, wherein each local sensitivity hashing function contains two or more random hashing seed numbers, determining a minimum hash value for each local sensitivity hashing function, and storing the minimum hash value for each local sensitivity hashing function in a minimum hash set collection; (e) classifying the data using the minimum hash values for the tokens; and wherein the foregoing steps are performed by one or more processors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for classifying data comprising:
-
at least one input/output interface; a data storage; one or more processors communicably coupled to the at least on input/output interface and the data storage; and the one or more processors perform the steps of (a) receiving the data from the at least one input/output interface, (b) dividing the received data into two or more chunks, (c) mapping each chunk into a token and storing the token in a token collection within the data storage, (d) hashing each token using two or more local sensitivity hashing functions, wherein each local sensitivity hashing function contains two or more random hashing seed numbers, determining a minimum hash value for each local sensitivity hashing function, and storing the minimum hash value for each local sensitivity hashing function in a minimum hash set collection within the data storage, and (e) classifying the data using the minimum hash values for the tokens. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer program embodied on a non-transitory computer readable storage medium that is executed using one or more processors for classifying data comprising:
-
(a) a code segment for receiving the data; (b) a code segment for dividing the received data into two or more chunks; (c) a code segment for mapping each chunk into a token and storing the token in a token collection; (d) a code segment for hashing each token using two or more local sensitivity hashing functions, wherein each local sensitivity hashing function contains two or more random hashing seed numbers, determining a minimum hash value for each local sensitivity hashing function, and storing the minimum hash value for each local sensitivity hashing function in a minimum hash set collection; and (e) a code segment for classifying the data using the minimum hash values for the tokens.
-
-
26. A system for large-scale and rapid parallel processing, learning, and classification of extremely large volumes of gene sequence data comprising:
-
a plurality of processes operational on a plurality of interconnected processors; the plurality of processes including a master process for coordinating the processing of a set of at least one gene sequence input data, a set of at least one application-independent Map Reduction Aggregation Methods, a set of at least one application-independent Classification Metric Functions, a set of at least one application-independent category managers, a set of at least one application-independent nested categorical key value pair collections, a set of at least one application-independent reduction operations, a set of at least one application-independent Blocking Mechanisms, a set of at least one transitional outputs collection and worker processes; the master process performing the learning and/or classification processing coordination in response to a request to perform the learning and/or classification processing job, allocating portions of the gene sequences input data containing gene sequence text to at least one of the Map Reduction Aggregation Methods and allocating portions of the gene sequence input data containing gene sequence text and category associations to at least one of the category managers; each of the Map Reduction Aggregation Methods including at least one Chunking Operations module comprising a first plurality of worker processes for receiving and mapping portions of the gene sequences input data into individual, independent units of transitional Sequence Chunk work comprising a consistently mapped data key and optional values that are conducive to simultaneous parallel Mapping Operations processing, wherein at least two of the first plurality of the worker processes perform Chunking Operations simultaneously in parallel; each of the Map Reduction Aggregation Methods including at least one Mapping Operations modules comprising a second plurality of worker processes for receiving and mapping transitional Sequence Chunk outputs into individual, independent units of transitional Sequence Token work comprising a consistently mapped data key and optional values that are conducive to simultaneous parallel Locality Sensitive Hashing Operations processing, wherein at least two of the second plurality of the worker processes perform Mapping Operations simultaneously in parallel; each of the Map Reduction Aggregation Methods including at least one Locality Sensitive Hashing modules comprising a third plurality of worker processes for receiving and performing Locality Sensitive Hashing operations on transitional Sequence Token outputs producing individual, independent units of transitional MinHash Set Items work comprising a collection of minimum hash value keys produced from a plurality of unique hashing functions and optional values that are conducive to simultaneous Reduction Operations processing and/or Classification Metric Functions, wherein at least two of the third plurality of the worker processes perform Locality Sensitive Hashing Operations simultaneously in parallel; each of the Map Reduction Aggregation Methods including at least one transitional outputs collection allowing for shared thread-safe accesses by at least one worker process acting as a transitional outputs producer and at least one worker process acting as a transitional outputs consumer, wherein a Blocking Mechanism is utilized for managing accesses to the transitional outputs collection, wherein at least two of the worker processes perform production and consumption operations simultaneously in parallel; each of the Map Reduction Aggregation Methods including at least one Blocking Mechanism that in response to notification when transitional outputs production has started manages the potential differences in the production and consumption speeds between at least one worker process acting as a transitional outputs producer and at least one worker process acting as a transitional outputs consumer, wherein each Blocking Mechanism in response to a complete depletion of transitional outputs by consumption worker processes before the production worker processes have completed production allows consumption worker processes to “
block”
or wait until additional transitional outputs are produced, wherein each Blocking Mechanism in response to an over-production of transitional outputs by the production worker processes exceeding a pre-defined transitional outputs capacity threshold allows consumption worker processes to “
block”
or wait until additional transitional outputs are consumed and the transitional outputs capacity threshold is no longer exceeded, wherein at least two of the worker processes perform Blocking Mechanism production and consumption operations simultaneously in parallel;each reduction operations including at least one application specific Reduction Operations modules comprising a fourth plurality of worker processes for receiving and aggregating transitional MinHash Set Items output by reducing the minimum hash value keys and optional values eliminating the matching keys and aggregating the optional values into at least one nested categorical key value pair collection, wherein at least two of the fourth plurality of the worker processes perform Reduction Operations simultaneously in parallel; each Classification Metric Function including one or more Classification Metric Function Operations modules comprising a fifth plurality of worker processes for performing frequency, similarity, or distance calculations, wherein each calculation is performed using the items within at least one Map Reduction Aggregation transitional output collection and/or using Map Reduction Aggregation outputs and associated categories consolidated into the Nested Categorical Key Value Pair Collection, wherein at least two of the fifth plurality of worker processes simultaneously perform the Classification Metric Functions Operations in parallel; each Category Manager including one or more category and gene sequence management functions comprising at least one set of unique categories and Category IDs, gene sequences and Sequence IDs, default frequencies including all Sequence ID and Category ID associations, and category totals including relevant totals for all categories; the Map Reduction Aggregation Methods applying Chunking Operations, Mapping Operations, and Locality Sensitive Hashing Operations to the retrieved input data to produce transitional MinHash Set Item outputs corresponding to a reduced set of minimum hash values representing the unique characteristics of each individual gene sequence text provided within the gene sequence input data; and the Classification Metric Functions applying Classification Metric Functions Operations to the transitional MinHash Set Item outputs to produce Classification Totals and/or Penetration Totals corresponding to the similarity, distance, or classification between each individual gene sequence text provided within the gene sequence input data for classification and at least one other MinHash Set Item output and/or using Map Reduction Aggregation outputs and associated categories consolidated into the Nested Categorical Key Value Pair Collection. - View Dependent Claims (27, 28, 29, 30, 31)
-
-
32. A system for large-scale and rapid parallel Locality Sensitive Hashing of gene sequence Sequence Tokens input data comprising:
-
a plurality of processes operational on a plurality of interconnected processors; the plurality of processes including a master process for coordinating the processing of at least one set of gene sequence'"'"'s Sequence Tokens input data, a set of at least one MinHash Initialization modules, a set of at least one MinHash Producer modules, a set of at least one MinHash Set Item outputs, and worker processes; the master process performing the Locality Sensitive Hashing processing coordination in response to a request to perform the Locality Sensitive Hashing operations, allocating each of the Sequence Token inputs containing small variable length samples of gene sequence text to at least one of the MinHash producer'"'"'s worker processes; each of the Locality Sensitive Hashing operations including at least one MinHash Initialization module using a predefined Universe Size value to generate non-negative random numbers up to the specified Universe Size used during the creation of a specified number of unique hashing functions each containing a plurality of random numbers used as hashing seeds, wherein each unique hashing function is stored in at least one MinHash delegates collection; each of the Locality Sensitive Hashing operations including at least one MinHash Producer module comprising a first plurality of worker processes for receiving and hashing each Sequence Token'"'"'s text or a pre-defined hash value for each Sequence Token'"'"'s text one time for each unique hashing function contained within the MinHash Delegates collection, wherein at least two of the first plurality of the worker processes perform Locality Sensitive Hashing operations simultaneously in parallel; each of the Locality Sensitive Hashing operations including at least one SkipDups collection for ensuring that duplicate text values contained within a Gene Token'"'"'s text are hashed only one time for each unique gene sequence by each of the unique hashing functions contained within the MinHash Delegates collection; each of the SkipDups collections using SkipDup composite keys comprising a Sequence Token'"'"'s Sequence ID and text or a pre-defined hash value for a Sequence Token'"'"'s text to identify duplicate Sequence Tokens within the same gene sequence; each of the Locality Sensitive Hashing operations including a collection of MinHash Set Items for retaining the minimum hash values produced by each of the unique hashing functions contained within the MinHash delegates collection for each unique gene sequence'"'"'s Gene Tokens that Locality Sensitive Hashing operations are performed against; each MinHash Set Item containing one entry for each unique hashing function contained within the MinHash delegates collection or in alternative embodiments only a smaller predefined number of hashing operations can be performed to identify candidate matches before full MinHash Sets are produced; and each MinHash Producer in response to receiving a Sequence Token input creating a SkipDup key, determining if MinHashing has been performed, hashing each Sequence Token'"'"'s text or a pre-defined hash value for a Sequence Token'"'"'s text one time for each hashing function contained within the MinHash delegates collection or a candidate set of hashing functions in alternative embodiments, and retaining the minimum hash values produced for each unique hashing function within each unique gene sequence in the collection of MinHash Set Item outputs.
-
-
33. A method for mapping gene sequences into Gene Sequence Chunks containing portions of the gene sequence'"'"'s text broken into individual, independent units of transitional work conducive to simultaneous parallel downstream processing comprising:
-
receiving gene sequences input data containing at least one gene sequence text and possibly one or more categories associated with the gene sequence text; creating one unique Sequence ID per gene sequence means identifying each gene sequence'"'"'s text with a unique number operable for referring to the gene sequence in one or more associations throughout the system while maintaining only one copy of the gene sequence'"'"'s text and possibly many copies of the Sequence ID utilized in many associations; creating one unique Category ID per unique category associated with any of the gene sequence'"'"'s text means identifying each category with a unique number operable for referring to the category in one or more associations throughout the system while maintaining only one copy of the category'"'"'s text and possibly many copies of the Category ID utilized in many associations; and creating a set of default frequencies means a set of key value pairs comprising one entry for each unique Category ID associated with a unique Sequence ID, wherein a Sequence ID key and at least one value containing a Category ID or nested key value pair including an optional value that contains a default frequency value and/or other relevant values for perform Classification Metric Functions. - View Dependent Claims (34, 35, 36)
-
-
37. A method for mapping gene sequences into Gene Sequence Tokens containing small samples of a gene sequence'"'"'s text broken into individual, independent units of transitional work conducive to simultaneous parallel downstream processing comprising:
-
receiving gene sequence input data containing at least one gene sequence text or the starting and ending positions of a gene sequence'"'"'s text, and a gene'"'"'s Sequence ID; setting the Start Position and Ending Position equal to the first position within the gene sequence'"'"'s text, or setting the Ending Position equal to the Minimum Token Length, if a Minimum Token Length is used; creating a Sequence Token equal to the text between the current Start and End Positions; setting the End Position equal to the End Position+1; and repeating the specified operations until the Maximum Token Length or the end of gene sequence'"'"'s text is reached, whichever occurs first. - View Dependent Claims (38)
-
-
39. A method for the locality sensitive hashing of gene sequences comprising:
-
receiving gene sequences input data containing at least one gene sequence text; mapping gene sequences into transitional Gene Sequence Chunks by (a) receiving gene sequences input data containing at least one gene sequence text and possibly one or more categories associated with the gene sequence text, (b) creating one unique Sequence ID per gene sequence means identifying each gene sequence'"'"'s text with a unique number operable for referring to the gene sequence in one or more associations throughout the system while maintaining only one copy of the gene sequence'"'"'s text and possibly many copies of the Sequence ID utilized in many associations, (c) creating one unique Category ID per unique category associated with any of the gene sequence'"'"'s text means identifying each category with a unique number operable for referring to the category in one or more associations throughout the system while maintaining only one copy of the category'"'"'s text and possibly many copies of the Category ID utilized in many associations, and (d) creating a set of default frequencies means a set of key value pairs comprising one entry for each unique Category ID associated with a unique Sequence ID, wherein a Sequence ID key and at least one value containing a Category ID or nested key value pair includes an optional value that contains a default frequency value and/or other relevant values for performing Classification Metric Functions; mapping Gene Sequence Chunks into transitional Gene Sequence Tokens by (a) receiving gene sequence input data containing at least one gene sequence text or the starting and ending positions of a gene sequence'"'"'s text, and a gene'"'"'s Sequence ID, (b) setting the Start Position and Ending Position equal to the first position within the gene sequence'"'"'s text, or setting the Ending Position equal to the Minimum Token Length, if a Minimum Token Length is used, (c) creating a Sequence Token equal to the text between the current Start and End Positions, (d) setting the End Position equal to the End Position+1, and (e) repeating the specified operations until the Maximum Token Length or the end of gene sequence'"'"'s text is reached, whichever occurs first; receiving each transitional Gene Sequence Token; creating a set of unique hashing functions used for the locality sensitive hashing operations; creating a SkipDup key comprising the Gene Sequence Token'"'"'s Sequence ID and text or pre-determined text hash value; maintaining a SkipDup key set ensuring that only unique Gene Sequence Token'"'"'s text or pre-determined text hash values are hashed one time for each unique gene Sequence ID and one time for each of the unique hashing functions used in the locality sensitive hashing operations; maintaining a MinHash Set Item for each unique Sequence ID for which locality sensitive hashing operations are performed; each time a unique Gene Sequence Token and SkipDup key are encountered, determining if a MinHash Set Item exists for the Gene Sequence Token'"'"'s Sequence ID and creating a new MinHash Set Item for Sequence IDs when needed; each time a unique hashing function produces a minimum hash value, retaining the value within the MinHash Set Item for each unique hashing function; and providing a MinHash Set Item as locality sensitive hashing operations output for each unique Gene Sequence ID for which locality sensitive hashing operations are performed.
-
Specification