Techniques for improving storage space efficiency with variable compression size unit
First Claim
1. A method of data processing for a data set comprising:
- performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including;
receiving a second data chunk;
determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and
responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and
compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, and wherein said determining whether to add the second data chunk to the first compression unit further includes;
determining whether a revised estimated compression ratio associated with adding the second data chunk to the first compression unit is larger than an estimated compression ratio associated with the first compression unit without the second data chunk; and
responsive to determining the revised estimated compression ratio is larger than the estimated compression ratio, determining to add the second data chunk to the first compression unit.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for data processing a data set may comprise: performing first processing that forms a first compression unit, wherein the first compression unit includes a data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including: receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit. The second chunk may be added if its entropy value is less than the entropy threshold and if entropy values of the first and second chunks are similar. The second chunk may be added if the resulting compression unit provides sufficient storage/compression benefit.
16 Citations
21 Claims
-
1. A method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, and wherein said determining whether to add the second data chunk to the first compression unit further includes; determining whether a revised estimated compression ratio associated with adding the second data chunk to the first compression unit is larger than an estimated compression ratio associated with the first compression unit without the second data chunk; and responsive to determining the revised estimated compression ratio is larger than the estimated compression ratio, determining to add the second data chunk to the first compression unit.
-
-
2. The method of claim 1, wherein the criteria specifies to add the second data chunk to the first compression unit if adding the second data chunk to the first compression unit is estimated to provide at least a specified storage savings benefit.
-
3. The method of claim 1, wherein the first data chunk and the second data chunk are located at consecutive sequential logical addresses of the data set.
-
4. The method of claim 3, wherein the first data chunk is written by a first I/O operation and the second data chunk is written by a second I/O operation.
-
5. The method of claim 4, wherein the first processing is performed as part of inline processing of an I/O path when processing the first I/O operation and the second I/O operation.
-
6. The method of claim 4, wherein the first processing is performed offline and not part of inline processing of an I/O path when processing the first I/O operation and the second I/O operation.
-
7. The method of claim 1, wherein the first compression unit is a first size and includes a first number of data chunks of the data set and wherein the method includes:
-
forming a second compression unit that is a second size and includes a second number of data chunks of the data set, the first number being different than the second number; and compressing the second compression unit as a single compressible unit.
-
-
8. A system comprising:
-
a processor; and a memory comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising; performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, and wherein said determining whether to add the second data chunk to the first compression unit further includes; determining whether a revised estimated compression ratio associated with adding the second data chunk to the first compression unit is larger than an estimated compression ratio associated with the first compression unit without the second data chunk; and responsive to determining the revised estimated compression ratio is larger than the estimated compression ratio, determining to add the second data chunk to the first compression unit.
-
-
9. A non-transitory computer readable medium comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, and wherein said determining whether to add the second data chunk to the first compression unit further includes; determining whether a revised estimated compression ratio associated with adding the second data chunk to the first compression unit is larger than an estimated compression ratio associated with the first compression unit without the second data chunk; and responsive to determining the revised estimated compression ratio is larger than the estimated compression ratio, determining to add the second data chunk to the first compression unit.
-
-
10. The non-transitory computer readable medium of claim 9, wherein the criteria specifies to add the second data chunk to the first compression unit if adding the second data chunk to the first compression unit is estimated to provide at least a specified storage savings benefit.
-
11. The non-transitory computer readable medium of claim 9, wherein the first data chunk and the second data chunk are located at consecutive sequential logical addresses of the data set, wherein the first data chunk is written by a first I/O operation and the second data chunk is written by a second I/O operation.
-
12. The non-transitory computer readable medium of claim 11, wherein the first I/O operation and the second I/O operations are sequentially issued I/O operations.
-
13. The non-transitory computer readable medium of claim 11, wherein the first I/O operation and the second I/O operations are not sequentially issued I/O operations.
-
14. A method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein said determining whether to add the second data chunk to the first compression unit includes; determining whether a cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than another entropy value associated with the first compression unit without the second data chunk; and responsive to determining the cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than the another entropy value associated with the first compression unit without the second data chunk, determining to add the second data chunk to the first compression unit.
-
-
15. The method of claim 14, wherein the second data chunk is added to the first compression unit, wherein the first compression unit includes at least two data chunks prior to adding the second data chunk, wherein the cumulative entropy value is an entropy value determined based on cumulative frequencies of symbols in the at least two data chunks combined with the second data chunk, and wherein the another entropy value is a second cumulative entropy value determined based on cumulative frequencies of the symbols in the at least two data chunks without the second data chunk.
-
16. A system comprising:
-
a processor; and a memory comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising; performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein said determining whether to add the second data chunk to the first compression unit includes; determining whether a cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than another entropy value associated with the first compression unit without the second data chunk; and responsive to determining the cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than the another entropy value associated with the first compression unit without the second data chunk, determining to add the second data chunk to the first compression unit.
-
-
17. A non-transitory computer readable medium comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein said determining whether to add the second data chunk to the first compression unit includes; determining whether a cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than another entropy value associated with the first compression unit without the second data chunk; and responsive to determining the cumulative entropy value associated with adding the second data chunk to the first compression unit is smaller than the another entropy value associated with the first compression unit without the second data chunk, determining to add the second data chunk to the first compression unit.
-
-
18. A method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein the second chunk is added to the first compression unit, wherein a first set of one or more data chunks includes at least the first data chunk, wherein the first compression unit includes the first set of one or more data chunks prior to adding the second chunk, and wherein the criteria specifies to add the second data chunk to the first compression unit if the second data chunk has an associated entropy value less than the entropy threshold, and if the second data chunk and the first set of one or more data chunks are determined, in accordance with second criteria, to have similar entropy values.
-
-
19. The method of claim 18, wherein the second criteria includes determining whether entropy values of the second data chunk and the first set of data chunks all fall within a specified range or are no more than a threshold numerical distance from one another.
-
20. A system comprising:
-
a processor; and a memory comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising; performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein the second chunk is added to the first compression unit, wherein a first set of one or more data chunks includes at least the first data chunk, wherein the first compression unit includes the first set of one or more data chunks prior to adding the second chunk, and wherein the criteria specifies to add the second data chunk to the first compression unit if the second data chunk has an associated entropy value less than the entropy threshold, and if the second data chunk and the first set of one or more data chunks are determined, in accordance with second criteria, to have similar entropy values.
-
-
21. A non-transitory computer readable medium comprising code stored thereon that, when executed, performs a method of data processing for a data set comprising:
-
performing first processing that forms a first compression unit, wherein the first compression unit includes a first plurality of data chunks including a first data chunk having a first entropy value less than an entropy threshold, the first processing including; receiving a second data chunk; determining, in accordance with criteria, whether to add the second data chunk to the first compression unit; and responsive to determining to add the second data chunk to the first compression unit, adding the second data chunk to the first compression unit; and compressing the first compression unit as a single compressible unit compressing the first compression unit as a single compressible unit, wherein the second chunk is added to the first compression unit, wherein a first set of one or more data chunks includes at least the first data chunk, wherein the first compression unit includes the first set of one or more data chunks prior to adding the second chunk, and wherein the criteria specifies to add the second data chunk to the first compression unit if the second data chunk has an associated entropy value less than the entropy threshold, and if the second data chunk and the first set of one or more data chunks are determined, in accordance with second criteria, to have similar entropy values.
-
Specification