Data loading systems and methods

US 9,336,263 B2
Filed: 02/22/2011
Issued: 05/10/2016
Est. Priority Date: 06/04/2010
Status: Active Grant

First Claim

Patent Images

1. A method for processing and transferring data from a file system to a database system, the method comprising the steps of:

receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;

determining, based on the query, whether at least one partition of at least one attribute of the data has been previously loaded into the database system;

incrementally loading, based on a determination that the at least one partition of at least one attribute of the data has not been previously loaded into the database system, the at least one partition of the at least one attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one attribute of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and

joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;

wherein the incremental loading is performed during a map phase of a MapReduce processing task.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System, method, and computer program product for processing data are disclosed. The system is configured to perform transfer of data from a file system to a database system. Such transfer is accomplished through receiving a request for loading data into a database system, wherein the data includes a plurality of attributes, determining at least one attribute of the data for loading into the database system, and loading the at least one attribute of the data into the database system while continuing to process remaining attributes of the data.

Citations

58 Claims

1. A method for processing and transferring data from a file system to a database system, the method comprising the steps of:
- receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  determining, based on the query, whether at least one partition of at least one attribute of the data has been previously loaded into the database system;
  
  incrementally loading, based on a determination that the at least one partition of at least one attribute of the data has not been previously loaded into the database system, the at least one partition of the at least one attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one attribute of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the incremental loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 22, 56)
- - 2. The method according to claim 1, wherein the receiving further comprises parsing the attributes of the data as a result of the received request.
  - 3. The method according to claim 2, further comprisingincrementally loading, based on the request, at least one partition of each attribute of the data into the database system as soon as each parsed attribute is processed.
  - 4. The method according to claim 3, further comprisingindexing each processed parsed attribute in the database system.
  - 5. The method according to claim 4, wherein the indexing of the attributes is configured to be performed incrementally as soon as each processed parsed attribute is loaded into the database system.
  - 6. The method according to claim 5, further comprisingaccessing the received data loaded into the database system;
    - determining attributes of the data that have not been loaded into the database system;
      
      parsing the attributes that have not been loaded into the database system;
      
      processing the parsed unloaded attributes to determine at least one partition of at least one unloaded attribute for loading into the database system; and
      
      loading the at least one partition of the at least one processed parsed unloaded attribute of the data into the database system while continuing to process the remaining unloaded attributes of the data.
  - 7. The method according to claim 6, further comprisingindexing each loaded processed parsed unloaded attribute in the database system.
  - 8. The method according to claim 1, wherein the loading further includes loading the attributes on the column-store basis, whereby each different column of the received data is loaded independently into the database system.
  - 9. The method according to claim 1, wherein the loading is selected from a group consisting of:
    - direct loading, wherein parsed attributes of the received data are loaded into the database system as soon as the attributes are processed, and delayed loading, wherein parsed attributes of the received data are initially temporarily stored in a temporary storage and then loaded into the database system.
  - 10. The method according to claim 1, wherein the loading further comprises(a) dividing a column of loaded attributes into a plurality of portions;
    - (b) sorting the plurality of portions to obtain a first order of attributes within the column of loaded attributes;
      
      (c) dividing the plurality of sorted portions of attributes into further plurality of portions;
      
      (d) sorting the further plurality of portions to obtain a second order of attributes within the column of loaded attributes;
      
      (e) repeating steps (c)-(d) to obtain a final order of attributes of the data.
  - 22. The method according to claim 1, wherein the receiving and the determining are performed during the map phase of the MapReduce processing task.
  - 56. The method according to claim 1, wherein the identified plurality of attributes include at least one attribute of data that has been previously loaded into the database system and at least one attribute of data that has not been previously loaded into the database system.

11. A data processing system for transferring data having a plurality of attributes from a file system to a database system, the system comprising:
- a processor configured toreceive a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  determine, based on the query, whether at least one partition of at least one attribute of the data has been previously loaded into the database system;
  
  a data loader module configured to incrementally load, based on a determination that the at least one partition of at least one attribute of the data has not been previously loaded into the database system, the at least one partition of at least one attribute of the data into the database system while the processor continues to process the query without loading all attributes in the plurality of attributes at the time of receiving the query and without loading the at least one partition of at least one attribute of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  a join module configured to join the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the incremental loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 23, 57)
- - 12. The system according to claim 11, further comprisinga parsing module configured to parse the attributes of the data as a result of the request.
  - 13. The system according to claim 12, wherein at least one partition of each parsed attribute of the data is configured to be incrementally loaded, based on the request, into the database system as soon as each parsed attribute is processed.
  - 14. The system according to claim 13, further comprisingan indexing module configured to index each processed parsed attribute in the database system.
  - 15. The system according to claim 14, wherein the indexing of the attributes is configured to be performed incrementally as soon as each processed parsed attribute is loaded into the database system.
  - 16. The system according to claim 15, wherein upon accessing the received data loaded into the database system, attributes of the data that have not been loaded into the database system are determined;
    - whereinthe parsing module is configured to parse the attributes that have not been loaded into the database system;
      
      the processor is configured to process the parsed unloaded attributes to determine at least one partition of at least one unloaded attribute for loading into the database system; and
      
      the data loader is configured to load the at least one partition of the at least one processed parsed unloaded attribute of the data into the database system while continuing to process the remaining unloaded attributes of the data.
  - 17. The system according to claim 16, wherein each loaded processed parsed unloaded attribute is configured to be indexed in the database system.
  - 18. The system according to claim 11, wherein the loading further includes loading the attributes on the column-store basis, whereby each different column of the received data is loaded independently into the database system.
  - 19. The system according to claim 11, wherein the data loader is configured to perform loading selected from a group consisting of:
    - direct loading, wherein parsed attributes of the received data are loaded into the database system as soon as the attributes are processed, anddelayed loading, wherein parsed attributes of the received data are initially temporarily stored in a temporary storage and then loaded into the database system.
  - 20. The system according to claim 11, wherein the data loader is configured to(a) divide a column of loaded attributes into a plurality of portions;
    - (b) sort the plurality of portions to obtain a first order of attributes within the column of loaded attributes;
      
      (c) divide the plurality of sorted portions of attributes into further plurality of portions;
      
      (d) sort the further plurality of portions to obtain a second order of attributes within the column of loaded attributes;
      
      (e) repeat (c)-(d) to obtain a final order of attributes of the data.
  - 23. The system according to claim 11, wherein the processor is configured to perform the receiving and the determining operations during the map phase of the MapReduce processing task.
  - 57. The system according to claim 11, wherein the identified plurality of attributes include at least one attribute of data that has been previously loaded into the database system and at least one attribute of data that has not been previously loaded into the database system.

21. A non-transitory computer program product, tangibly embodied in a non-transitory computer-readable medium, the computer program product causing a data processing system for transferring data from a file system to a database system, to perform operations comprising:
- receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  determining, based on the query, whether at least one partition of at least one attribute of the data has been previously loaded into the database system;
  
  incrementally loading, based on a determination that the at least one partition of at least one attribute of the data has not been previously loaded into the database system, the at least one partition of the at least one attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one attribute of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the incremental loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (24, 58)
- - 24. The computer program product according to claim 21, wherein the receiving and the determining operations are performed during the map phase of the MapReduce processing task.
  - 58. The computer program product according to claim 21, wherein the identified plurality of attributes include at least one attribute of data that has been previously loaded into the database system and at least one attribute of data that has not been previously loaded into the database system.

25. A computer-implemented method for processing and transferring data from a file system to a database system, the method comprising the steps of:
- receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  parsing at least one attribute in the plurality of attributes from the data;
  
  incrementally loading at least one partition of the at least one parsed attribute;
  
  processing, based on the query, to determine whether the at least one partition of the at least one parsed attribute of the data has been previously loaded into the database system;
  
  loading the data containing the at least one partition of the at least one parsed attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one parsed attributed of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 26. The method according to claim 25, further comprisingincrementally loading, based on the request, at least one partition of each attribute of the data into the database system as soon as each parsed attribute is processed.
  - 27. The method according to claim 26, further comprisingindexing each processed parsed attribute in the database system.
  - 28. The method according to claim 27, wherein the indexing of the attributes is configured to be performed incrementally as soon as each processed parsed attribute is loaded into the database system.
  - 29. The method according to claim 28, further comprisingaccessing the received data loaded into the database system;
    - determining attributes of the data that have not been loaded into the database system;
      
      parsing the attributes that have not been loaded into the database system;
      
      processing the parsed unloaded attributes to determine at least one partition of the at least one unloaded attribute for loading into the database system; and
      
      loading the at least one partition of the at least one processed parsed unloaded attribute of the data into the database system while continuing to process the remaining unloaded attributes of the data.
  - 30. The method according to claim 29, further comprisingindexing each loaded processed parsed unloaded attribute in the database system.
  - 31. The method according to claim 25, wherein the incrementally loading further comprises loading the attributes on the column-store basis, whereby each different column of the received data is loaded independently into the database system.
  - 32. The method according to claim 25, wherein the incrementally loading is selected from a group consisting of:
    - direct loading, wherein parsed attributes of the received data are loaded into the database system as soon as the attributes are processed, anddelayed loading, wherein parsed attributes of the received data are initially temporarily stored in a temporary storage and then loaded into the database system.
  - 33. The method according to claim 25, wherein the incrementally loading further comprises(a) dividing a column of loaded attributes into a plurality of portions;
    - (b) sorting the plurality of portions to obtain a first order of attributes within the column of loaded attributes;
      
      (c) dividing the plurality of sorted portions of attributes into further plurality of portions;
      
      (d) sorting the further plurality of portions to obtain a second order of attributes within the column of loaded attributes; and
      
      (e) repeating steps (c)-(d) to obtain a final order of attributes of the data.
  - 34. The method according to claim 25, wherein parsing and processing are performed during a map processing task of a MapReduce processing task.

35. A system comprising:
- at least one processor;
  
  at least one memory coupled to the at least one processor, the at least one memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising;
  
  receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  parsing at least one attribute in the plurality of attributes from the data;
  
  incrementally loading at least one partition of the at least one parsed attribute;
  
  processing, based on the query, to determine whether the at least one partition of the at least one parsed attribute of the data has been previously loaded into the database system;
  
  loading the data containing the at least one partition of the at least one parsed attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one parsed attributed of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 36. The system according to claim 35, wherein the operations further compriseincrementally loading, based on the request, at least one partition of each attribute of the data into the database system as soon as each parsed attribute is processed.
  - 37. The system according to claim 36, wherein the operations further compriseindexing each processed parsed attribute in the database system.
  - 38. The system according to claim 37, wherein the indexing of the attributes is configured to be performed incrementally as soon as each processed parsed attribute is loaded into the database system.
  - 39. The system according to claim 38, wherein the operations further compriseaccessing the received data loaded into the database system;
    - determining attributes of the data that have not been loaded into the database system;
      
      parsing the attributes that have not been loaded into the database system;
      
      processing the parsed unloaded attributes to determine at least one partition of at least one unloaded attribute for loading into the database system; and
      
      loading the at least one partition of the at least one processed parsed unloaded attribute of the data into the database system while continuing to process the remaining unloaded attributes of the data.
  - 40. The system according to claim 39, wherein the operations further compriseindexing each loaded processed parsed unloaded attribute in the database system.
  - 41. The system according to claim 35, wherein the incrementally loading further comprises loading the attributes on the column-store basis, whereby each different column of the received data is loaded independently into the database system.
  - 42. The system according to claim 35, wherein the incrementally loading the at least one attribute is selected from a group consisting of:
    - direct loading, wherein parsed attributes of the received data are loaded into the database system as soon as the attributes are processed, anddelayed loading, wherein parsed attributes of the received data are initially temporarily stored in a temporary storage and then loaded into the database system.
  - 43. The system according to claim 35, wherein the incrementally loading further comprises(a) dividing a column of loaded attributes into a plurality of portions;
    - (b) sorting the plurality of portions to obtain a first order of attributes within the column of loaded attributes;
      
      (c) dividing the plurality of sorted portions of attributes into further plurality of portions;
      
      (d) sorting the further plurality of portions to obtain a second order of attributes within the column of loaded attributes; and
      
      (e) repeating steps (c)-(d) to obtain a final order of attributes of the data.
  - 44. The system according to claim 35, wherein parsing and processing are performed during a map processing task of a MapReduce processing task.

45. A non-transitory computer program product comprising non-transitory machine-readable medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
- receiving a query containing a request for accessing data from a file system, wherein the request for accessing data identifies a plurality of attributes, each attribute being associated with an object identifier;
  
  parsing at least one attribute in the plurality of attributes from the data;
  
  incrementally loading at least one partition of the at least one parsed attribute;
  
  processing, based on the query, to determine whether the at least one partition of the at least one parsed attribute of the data has been previously loaded into the database system;
  
  loading the data containing the at least one partition of the at least one parsed attribute of the data into the database system while continuing to process the query without loading all attributes in the plurality of attributes identified by the request at the time of receiving the query, and without loading the at least one partition of at least one parsed attributed of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the loading is performed during a map phase of a MapReduce processing task.
- View Dependent Claims (46, 47, 48, 49, 50, 51, 52, 53, 54)
- - 46. The computer program product according to claim 45, wherein the operations further compriseincrementally loading, based on the request, at least one partition of each attribute of the data into the database system as soon as each parsed attribute is processed.
  - 47. The computer program product according to claim 46, wherein the operations further compriseindexing each processed parsed attribute in the database system.
  - 48. The computer program product according to claim 47, wherein the indexing of the attributes is configured to be performed incrementally as soon as each processed parsed attribute is loaded into the database system.
  - 49. The computer program product according to claim 48, wherein the operations further compriseaccessing the received data loaded into the database system;
    - determining attributes of the data that have not been loaded into the database system;
      
      parsing the attributes that have not been loaded into the database system;
      
      processing the parsed unloaded attributes to determine at least one partition of at least one unloaded attribute for loading into the database system; and
      
      loading the at least one partition of the at least one processed parsed unloaded attribute of the data into the database system while continuing to process the remaining unloaded attributes of the data.
  - 50. The computer program product according to claim 49, wherein the operations further compriseindexing each loaded processed parsed unloaded attribute in the database system.
  - 51. The computer program product according to claim 45, wherein the incrementally loading further comprises loading the attributes on the column-store basis, whereby each different column of the received data is loaded independently into the database system.
  - 52. The computer program product according to claim 45, wherein the incrementally loading is selected from a group consisting of:
    - direct loading, wherein parsed attributes of the received data are loaded into the database system as soon as the attributes are processed, anddelayed loading, wherein parsed attributes of the received data are initially temporarily stored in a temporary storage and then loaded into the database system.
  - 53. The computer program product according to claim 45, wherein the incrementally loading further comprises(a) dividing a column of loaded attributes into a plurality of portions;
    - (b) sorting the plurality of portions to obtain a first order of attributes within the column of loaded attributes;
      
      (c) dividing the plurality of sorted portions of attributes into further plurality of portions;
      
      (d) sorting the further plurality of portions to obtain a second order of attributes within the column of loaded attributes; and
      
      (e) repeating steps (c)-(d) to obtain a final order of attributes of the data.
  - 54. The computer program product according to claim 45, wherein parsing and processing are performed during a map processing task of a MapReduce processing task.

55. A method for processing and transferring data from a file system to a database system, the method comprising the steps of:
- receiving a processing task containing a request for accessing data from a file system, wherein the data includes a plurality of attributes identified by the received processing task, each attribute being associated with an object identifier;
  
  determining, based on the processing task, whether at least one partition of at least one attribute of the data has been previously loaded into the database system;
  
  incrementally loading, based on a determination that the at least one partition of at least one attribute of the data has not been previously loaded into the database system, the at least one partition of the at least one attribute of the data into the database system while continuing to process the processing task without loading all attributes in the plurality of attributes identified by the received processing task at the time of receiving the processing task and without loading the at least one partition of at least one attribute of the data into the database system upon determination that the at least one partition has been previously loaded into the database system, the determination is being made based on a catalog containing a mapping of a portion of the plurality of attributes that has been previously loaded into the database system, the at least one loaded partition of the at least one attribute is being stored together with the object identifier associated with the at least one attribute; and
  
  joining the at least one loaded partition and at least another loaded partition of at least another attribute using the object identifier associated with the at least one attribute and another object identifier associated with the at least another attribute, to generate a dataset responsive to the received query;
  
  wherein the loading is performed during a map phase of a MapReduce processing task.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yale University
Original Assignee
Yale University
Inventors
Abadi, Daniel, Abouzied, Azza
Primary Examiner(s)
Reyes, Mariela
Assistant Examiner(s)
Almani, Mohsen

Application Number

US13/032,538
Publication Number

US 20110302226A1
Time in Patent Office

1,904 Days
Field of Search

707/825
US Class Current

1/1
CPC Class Codes

G06F 16/2386 Bulk updating operations da...

G06F 16/258 Data format conversion from...

Data loading systems and methods

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

58 Claims

Specification

Solutions

Use Cases

Quick Links

Data loading systems and methods

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

58 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links