Techniques for Reading From and Writing to Distributed Data Stores
First Claim
1. A system for writing files to a distributed file system, comprising:
- one or more processors; and
a non-transitory computer readable storage medium including instructions that, when executed by the one or more processors, cause the one or more processors to perform operations including;
receiving a request to write a file to a distributed file system, wherein the distributed file system corresponds to a plurality of data blocks distributed across a plurality of nodes;
partitioning the file into a plurality of file-parts;
assigning each of the plurality of file-parts to a file-part queue;
instantiating, at each of multiple nodes, a plurality of write tasks for completing the request to write the file to the distributed file system, wherein write tasks correspond to processes for writing data blocks to the distributed file system using pluralities of threads, and wherein data blocks include multiple data records; and
processing, in parallel, each plurality of write tasks, wherein processing each write task includes;
instantiating, for the write task, a plurality of threads for writing file-parts to the distributed file system; and
processing each of the plurality of threads in parallel, wherein processing each thread includes;
retrieving a file-part assignment from the file-part queue, wherein the file-part assignment corresponds to a particular file-part;
obtaining a data record from a data buffer associated with the file, wherein the data record corresponds to a portion of the particular file-part; and
writing the data record to a data block associated with local storage of a particular node on which the thread is processing.
1 Assignment
0 Petitions
Accused Products
Abstract
Described herein are techniques for reading data from a distributed storage system and for writing data to a distributed storage system. The disclosed techniques make use of efficient computing task and thread usage to minimize or reduce overhead and improve read or write efficiency. For example, read or write tasks may handle multiple read or write operations instead of just a single operation, which may reduce overhead associated with task creation and termination. Additionally, operations within a single task may be processed in parallel. For example, the disclosed techniques provide MapReduce implementations useful in Apache Hadoop that perform better than previous MapReduce implementations.
-
Citations
30 Claims
-
1. A system for writing files to a distributed file system, comprising:
-
one or more processors; and a non-transitory computer readable storage medium including instructions that, when executed by the one or more processors, cause the one or more processors to perform operations including; receiving a request to write a file to a distributed file system, wherein the distributed file system corresponds to a plurality of data blocks distributed across a plurality of nodes; partitioning the file into a plurality of file-parts; assigning each of the plurality of file-parts to a file-part queue; instantiating, at each of multiple nodes, a plurality of write tasks for completing the request to write the file to the distributed file system, wherein write tasks correspond to processes for writing data blocks to the distributed file system using pluralities of threads, and wherein data blocks include multiple data records; and processing, in parallel, each plurality of write tasks, wherein processing each write task includes; instantiating, for the write task, a plurality of threads for writing file-parts to the distributed file system; and processing each of the plurality of threads in parallel, wherein processing each thread includes; retrieving a file-part assignment from the file-part queue, wherein the file-part assignment corresponds to a particular file-part; obtaining a data record from a data buffer associated with the file, wherein the data record corresponds to a portion of the particular file-part; and writing the data record to a data block associated with local storage of a particular node on which the thread is processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product for writing files to a distributed file system, the computer-program product tangibly embodied in a non-transitory computer readable storage medium comprising instructions configured to, when executed by one or more processors, cause the one or more processors to perform operations including:
-
receiving a request to write a file to a distributed file system, wherein the distributed file system corresponds to a plurality of data blocks distributed across a plurality of nodes; partitioning the file into a plurality of file-parts; assigning each of the plurality of file-parts to a file-part queue; instantiating, at each of multiple nodes, a plurality of write tasks for completing the request to write the file to the distributed file system, wherein write tasks correspond to processes for writing data blocks to the distributed file system using pluralities of threads, and wherein data blocks include multiple data records; and processing, in parallel, each plurality of write tasks, wherein processing each write task includes; instantiating, for the write task, a plurality of threads for writing file-parts to the distributed file system; and processing each of the plurality of threads in parallel, wherein processing each thread includes; retrieving a file-part assignment from the file-part queue, wherein the file-part assignment corresponds to a particular file-part; obtaining a data record from a data buffer associated with the file, wherein the data record corresponds to a portion of the particular file-part; and writing the data record to a data block associated with local storage of a particular node on which the thread is processing. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer implemented method for writing files to a distributed file system, comprising:
-
receiving a request to write a file to a distributed file system, wherein the distributed file system corresponds to a plurality of data blocks distributed across a plurality of nodes; partitioning the file into a plurality of file-parts; assigning each of the plurality of file-parts to a file-part queue; instantiating, at each of multiple nodes, a plurality of write tasks for completing the request to write the file to the distributed file system, wherein write tasks correspond to processes for writing data blocks to the distributed file system using pluralities of threads, and wherein data blocks include multiple data records; and processing, in parallel, each plurality of write tasks, wherein processing each write task includes; instantiating, for the write task, a plurality of threads for writing file-parts to the distributed file system; and processing each of the plurality of threads in parallel, wherein processing each thread includes; retrieving a file-part assignment from the file-part queue, wherein the file-part assignment corresponds to a particular file-part; obtaining a data record from a data buffer associated with the file, wherein the data record corresponds to a portion of the particular file-part; and writing the data record to a data block associated with local storage of a particular node on which the thread is processing. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification