Method and apparatus for improving write performance in a cluster-based file system
First Claim
1. A method of writing to cache in a clustered environment comprising(a) receiving a request to write data in a first node of a storage cluster from a user application;
- (b) determining if the data is owned by a remote node;
(c) if the data is owned by the remote node;
causing an invalidation of the data in the remote node if necessary;
(d) writing the data in a cache of the first node;
(e) causing the data to be written in a cache of a partner node of the first node, wherein the partner node maintains, in the partner node'"'"'s cache, a secondary data copy of the first node'"'"'s cached data;
(f) receiving, in the first node, a response from the partner node; and
(g) removing the first node from the storage duster by;
(i) ensuring that data in the cache of the first node is safely stored;
(ii) establishing an owner-partner relationship between die partner node and a second node for which the first node was a partner; and
(iii) removing the first node.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of writing to cache in a clustered environment. A first node in a storage cluster receives a request to write data from a user application. The first node determines if the data is owned by a remote node. If the data is owned by a remote node, the data in the remote node may be invalidated, if necessary. Such invalidation may not be necessary if a global cache directory is utilized. Thereafter, the data is written in a cache of the first node. Additionally, the data is written in a cache of a partner node of the first node. Confirmation of the cache write in the partner node is then received in the first node.
-
Citations
48 Claims
-
1. A method of writing to cache in a clustered environment comprising
(a) receiving a request to write data in a first node of a storage cluster from a user application; -
(b) determining if the data is owned by a remote node;
(c) if the data is owned by the remote node;
causing an invalidation of the data in the remote node if necessary;
(d) writing the data in a cache of the first node;
(e) causing the data to be written in a cache of a partner node of the first node, wherein the partner node maintains, in the partner node'"'"'s cache, a secondary data copy of the first node'"'"'s cached data;
(f) receiving, in the first node, a response from the partner node; and
(g) removing the first node from the storage duster by;
(i) ensuring that data in the cache of the first node is safely stored;
(ii) establishing an owner-partner relationship between die partner node and a second node for which the first node was a partner; and
(iii) removing the first node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
the first node observing a read-intensive workload; and
decreasing the upper bound.
-
-
7. The method of claim 5 further comprising:
-
the first node observing a write-intensive workload; and
increasing the upper bound.
-
-
8. The method of claim 5 further comprising:
-
determining if the upper bound has been reached;
waiting until data has been flushed to disk prior to writing to the cache of the partner node.
-
-
9. The method of claim 1 further comprising:
-
determining if the first node crashes; and
recovering data using the data stored in the cache of the partner node.
-
-
10. The method of claim 1 further comprising:
-
writing data in the cache of the first node to disk;
causing any new write requests to the first node to be synchronously written to disk;
causing the second node to write data in a cache of the second node to disk;
causing the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and
removing the first node.
-
-
11. The method of claim 1 further comprising a global cache directory manager ensuring that directory information is consistent with information stored in the cache of the partner node and a cache of the second node, said ensuring comprising:
-
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states;
removing mirrored cache entries in the partner node that are owned by the first node;
removing directory entries that are owned by the first node; and
informing the first node that it way be removed.
-
-
12. The method of claim 1 further comprising:
-
the first node notifying the partner node of the removal of the first node;
causing the partner node to read mirrored cache data in the first node;
causing the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and
removing the first node.
-
-
13. The method of claim 1 further comprising:
-
storing additional information on who a node'"'"'s partner is in a phase number; and
determining a node'"'"'s partner based on an indirect lookup table and the phase number.
-
-
14. The method of claim 1 further comprising:
-
receiving a node removal command in the second node;
identifying the partner node as a partner of the second node;
flushing dirty cache from the second node to disk;
flushing dirty cache from the first node to disk;
invalidating entries in a global cache directory based on the flushing;
removing cache entries corresponding to the flashed cache lines from the global cache directory;
notifying the first node when the flushing has been completed in the second node; and
removing the first node.
-
-
15. The method of claim 14 wherein block addresses of written data are inserted into a hash table that is used to identify data that has been written to disk.
-
16. The method of claim 1 further comprising causing the data to be asynchronously written to disk.
-
17. An apparatus for writing cache in a clustered environment comprising:
-
(a) a cache;
(b) a first storage node and a partner storage node organized in a storage duster, each storage node having an interface for connecting to a host and a storage disk, wherein each storage node maintains cache, wherein the partner storage node maintains, in the partner storage node'"'"'s cache, a secondary data copy of the first storage node'"'"'s cached data and wherein at least one of the storage nodes is configured to;
(i) receive a request to write data from a user application;
(ii) determine if the data is owned by a remote node;
(iii) if the data is owned by the remote node, cause an invalidation of the data in the remote node if necessary;
(iv) write the data in a cache of the first node;
(v) cause the data to be written in a cache of a partner node of the first node; and
(vi) receive, in the first node, a response from the partner node;
(vii) remove the first node from the storage cluster by;
(1) ensuring that data in the cache of the first node is safely stored;
(2) establishing an owner-partner relationship between the partner node and a second node for which the first node was a partner; and
3) removing the first node. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
observe a read-intensive workload; and
decrease the upper bound.
-
-
23. The apparatus of claim 21 wherein at least one of the nodes is further configured to:
-
observe a write-intensive workload; and
increase the upper bound.
-
-
24. The apparatus of claim 21 wherein at least one of the nodes is further configured to:
-
determine if the upper bound has been reached;
wait until data has been flushed to disk prior to writing to the cache of the partner node.
-
-
25. The apparatus of claim 17 wherein at least one of the nodes is further configured to:
-
determine if the first node crashes; and
recover data using the data stored in the cache of the partner node.
-
-
26. The apparatus of claim 17 wherein at least one of the nodes is further configured to:
-
write data in the cache of the first node to disk;
cause any new write requests to the first node to be synchronously written to disk;
cause the second node to write data in a cache of the second node to disk;
cause the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and
remove the first node.
-
-
27. The apparatus of claim 17 further comprising a global cache directory manager configured to ensure that directory information is consistent with information stored in the cache of the partner node and a cache of the second node1 said manager configured to ensure by:
-
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states;
removing mirrored cache entries in the partner node that axe owned by the first node;
removing directory entries that are owned by the first node; and
informing the first node that it may be removed.
-
-
28. The apparatus of claim 17 wherein at least one of the nodes is configured to:
-
notify the partner node of the removal of the first node;
cause the partner node to read mirrored cache data in the first node;
cause the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and
remove the first node.
-
-
29. The apparatus of claim 17 wherein at least one of the nodes is further configured to:
-
store additional information on who a node'"'"'s partner is in a phase number; and
determine a node'"'"'s partner based on an indirect looking table and the phase number.
-
-
30. The apparatus of claim 17 wherein at least one of the nodes is further configured to:
-
receive a node removal command in the second node;
identify the partner node as a partner of the second node;
flush dirty cache from the second node to disk;
flush dirty cache from the first node to disk;
invalidate entries in a global cache directory based on the flushing;
remove cache entries corresponding to the flushed cache lines from the global cache directory;
notify the first node when the flushing has been completed in the second node; and
remove the first node.
-
-
31. The apparatus of claim 30 wherein at least one of the nodes is further configured to insert block addresses of written data into a hash table that is used to identify data that has been written to disk.
-
32. The apparatus of claim 17 wherein at least one of the nodes is further configured to cause the data to be asynchronously written to disk.
-
33. An article of manufacture, embodying logic to perform a method of writing cache in a clustered environment, the method comprising:
-
(a) receiving a request to write data in a first node of a storage cluster from a user application;
(b) determining if the data is owned by a remote node;
(c) if the data is owned by the remote node, causing an invalidation of the data in the remote node if necessary;
(c) writing the data in a cache of the first node;
(e) causing the data to be written in a cache of a partner node of the first node, wherein the partner node maintains, in the partner node'"'"'s cache, a secondary data copy of the first node'"'"'s cached data; and
(f) receiving, in the first node, a response from the partner node; and
(g) removing the first node from the storage cluster by;
(i) ensuring that data in the cache of the first node is safely stored;
(ii) establishing an owner-partner relationship between the partner node and a second node for which the first node was a partner; and
(iii) removing the first node. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
the first node observing a read-intensive workload; and
decreasing the upper bound.
-
-
39. The article of manufacture of claim 37, the method further comprising:
-
the first node observing a write-intensive workload; and
increasing the upper bound.
-
-
40. The article of manufacture of claim 37, the method further comprising:
-
determining if the upper bound has been reached;
waiting until data has been flushed to disk prior to writing to the cache of the partner node.
-
-
41. The article of manufacture of claim 33, the method further comprising:
-
determining if the first node crashes; and
recovering data using the data stored in the cache of the partner node.
-
-
42. The article of manufacture of claim 33, the method further comprising:
-
writing data in the cache of the first node to disk;
causing any new write requests to the first node to be synchronously written to disk;
causing the second node to write data in a cache of the second node to disk;
causing the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and
removing the first node.
-
-
43. The article of manufacture of claim 33, the method further comprising a global cache directory manager ensuring that directory information is consistent with information stored in the cache of the partner node and a cache of the second node, said ensuring comprising:
-
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states;
removing mirrored cache entries in the partner node that are owned by the first node;
removing directory entries that are owned by the first node; and
informing the first node that it may be removed.
-
-
44. The article of manufacture of claim 33, the method further comprising:
-
the first node notifying the partner node of the removal of the first node;
causing the partner node to read mirrored cache data in the first node;
causing the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and
removing the first node.
-
-
45. The article of manufacture of claim 33, the method further comprising:
-
storing additional information on who a node'"'"'s partner is in a phase number; and
determining a node'"'"'s partner based on an indirect lookup table and the phase number.
-
-
46. The article of manufacture of claim 33, the method further comprising:
-
receiving a node removal command in the second node;
identifying the partner node as a partner of the second node;
flushing dirty cache from the second node to disk;
flushing dirty cache from the first node to disk;
invalidating entries in a global cache directory based on the flushing;
removing cache entries corresponding to the flushed cache lines from the global cache directory;
notifying the first node when the flushing has been completed in the second node; and
removing the first node.
-
-
47. The article of manufacture of claim 46 wherein block addresses of written data arc inserted into a hash table that is used to identify data that has been written to disk.
-
48. The article of manufacture of claim 33, the method further comprising causing the data to be asynchronously written to disk.
Specification