Access method data compression with system-built generic dictionaries
First Claim
1. A computer-implemented method of forming system-built dictionaries used in the compression of uninterrupted character byte strings in the management of catalog-based data sets comprising records located in external storage, said catalogs containing information identifying the format of each data set and its location in the external storage, each data set assuming a closed status until subject to access commands, and thereupon assuming an open status, the method comprising the steps of:
- (a) providing a library of dictionary segments;
(b) scanning the records during the initial recordation of records to the external storage and collecting characteristic attributes and statistics from the scanning;
(c) selecting dictionary segments from the library of dictionary segments in accordance with the attributes and statistics; and
(d) combining the dictionary segments to form a system-built dictionary.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer system constructs a compression dictionary for compressing a character string by interrogating an initial substring portion to determine input string characteristics that are used to select one or more dictionary segments from a library of predetermined dictionary segments individually adapted for compressing strings with particular characteristics. The initial substring portion is dynamically determined during the interrogation. A first set of dictionary segments that meet predetermined automatic selection criteria are selected and a second set of candidate dictionary segments that meet second-level selection criteria are identified for a sampling phase. During the sampling phase, the candidate dictionary segments are alternately used to compress the initial substring portion and determine compression performance statistics. The performance of the dictionary segments in the sampling phase determines which candidate dictionary segments will be added to the first selected dictionary segments, within dictionary total size limits. The first selected dictionary segments and the identified segments constitute a system-built compression dictionary that is used to compress the remainder of the input string. In this way, predetermined compression dictionaries are selected for maximum efficiency in accordance with the data actually being compressed and compression can be carried out quickly and efficiently as input data is received.
178 Citations
89 Claims
-
1. A computer-implemented method of forming system-built dictionaries used in the compression of uninterrupted character byte strings in the management of catalog-based data sets comprising records located in external storage, said catalogs containing information identifying the format of each data set and its location in the external storage, each data set assuming a closed status until subject to access commands, and thereupon assuming an open status, the method comprising the steps of:
-
(a) providing a library of dictionary segments; (b) scanning the records during the initial recordation of records to the external storage and collecting characteristic attributes and statistics from the scanning; (c) selecting dictionary segments from the library of dictionary segments in accordance with the attributes and statistics; and (d) combining the dictionary segments to form a system-built dictionary. - View Dependent Claims (9)
-
-
2. A computer-implemented method of forming system-built dictionaries used in the compression of uninterrupted character byte strings in the management of catalog-based data sets comprising records located in external storage, said catalogs containing information identifying the format of each data set and its location in the external storage, each data set assuming a closed status until subject to access commands, and thereupon assuming an open status, the method comprising the steps of:
-
(a) providing a library of dictionary segments; (b) scanning the records during the initial recordation of records to the external storage and collecting characteristic attributes and statistics from the scanning; (c) selecting dictionary segments from the library of dictionary segments in accordance with the attributes and statistics; (d) combining the dictionary segments to form a system-built dictionary; (e) selecting a dictionary token comprising one or more predetermined symbols that represent the system-built dictionary such that the token identifies the selected dictionary segments; and (f) storing the dictionary token in the system catalog and discarding the system-built dictionary such that, after the data set is closed, the data set can be opened again, the token can be retrieved from the system catalog, and the corresponding selected dictionary segments it represents can be retrieved from the library of dictionary segments. - View Dependent Claims (3)
-
-
4. A computer-implemented method of forming system-built dictionaries used in the compression of uninterrupted character byte strings in the management of catalog-based data sets comprising records located in external storage, said catalogs containing information identifying the format of each data set and its location in the external storage, each data set assuming a closed status until subject to access commands, and thereupon assuming an open status, the method comprising the steps of:
-
(a) providing a library of dictionary segments; (b) scanning the records during the initial recordation of records to the external storage and collecting characteristic attributes and statistics from the scanning; (c) selecting dictionary segments from the library of dictionary segments in accordance with the attributes and statistics; and (d) combining the dictionary segments to form a system-built dictionary; wherein the data characteristic statistics include the occurrence frequency of characters found in the scanned records or a raw count of occurrences for each character in the scanned records; and
the step of scanning records and collecting statistics comprises the steps of;selecting a first record block for scanning and generating a first set of the data characteristic statistics; selecting a second record block, different from the first record block, for scanning and generating a second set of the data characteristic statistics; comparing like statistics of the first set and second set of data characteristic statistics and storing the second block in an interrogation buffer if the difference between the like statistics is below a predetermined stabilization threshold and otherwise selecting a new first record block and a new second record block for scanning and generating new respective sets of statistics; and repeating the steps of selecting and comparing until the difference between the like statistics is below the predetermined stabilization threshold or until the size of the first record block equals a predetermined scanning limit value. - View Dependent Claims (5, 6, 7, 8)
-
-
10. A computer-implemented method of forming system-built dictionaries used in the compression of uninterrupted character byte strings in the management of catalog-based data sets comprising records located in external storage, said catalogs containing information identifying the format of each data set and its location in the external storage, each data set assuming a closed status until subject to access commands, and thereupon assuming an open status, the method comprising the steps of:
-
(a) providing a library of dictionary segments; (b) scanning the records during the initial recordation of records to the external storage and collecting characteristic attributes and statistics from the scanning; (c) selecting dictionary segments from the library of dictionary segments in accordance with the attributes and statistics; and (d) combining the dictionary segments to form a system-built dictionary; wherein the step of providing a library of dictionary segments further comprises providing dictionary segments that are adapted for text compression of characters including alphabetic letters, numerals, and spacing and punctuation characters; and wherein the step of scanning records includes keeping statistics indicating the occurrences of characters in the scanned record block and the step of selecting dictionary segments for the system-built dictionary comprises selecting a dictionary segment adapted for compression of one or more characters in accordance with the statistics for those characters in the scanned record block. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A method of forming compression dictionaries in a computer system that includes data sets with corresponding data catalogs that contain information identifying the format of each data set and its location, the data sets containing records of uninterrupted byte strings that will be compressed by the computer system using the compression dictionaries, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to compress the data sets; (b) scanning one or more records of a data set during an initial recordation of the data set records to one or more external storage devices and generating predetermined data characteristic statistics as a result of the scanning; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics; and (d) combining the selected dictionary segments to form a system-built dictionary that is used by the computer system to compress the remaining data records of the data set as the data set is recorded to the one or more external storage devices.
-
-
19. A method of forming compression dictionaries in a computer system that includes data sets with corresponding data catalogs that contain information identifying the format of each data set and its location, the data sets containing records of uninterrupted byte strings that will be compressed by the computer system using the compression dictionaries, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to compress the data sets; (b) scanning one or more records of a data set during an initial recordation of the data set records to one or more external storage devices and generating predetermined data characteristic statistics as a result of the scanning; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics; and (d) combining the selected dictionary segments to form a system-built dictionary that is used by the computer system to compress the remaining data records of the data set as the data set is recorded to the one or more external storage devices, wherein the step of combining further includes representing the selected dictionary segments of the system-built dictionary by tokens and storing the dictionary tokens in the system data catalog entry for the data set from which the system-built dictionary was formed. - View Dependent Claims (20)
-
-
21. A method of forming compression dictionaries in a computer system that includes data sets with corresponding data catalogs that contain information identifying the format of each data set and its location, the data sets containing records of uninterrupted byte strings that will be compressed by the computer system using the compression dictionaries, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to compress the data sets; (b) scanning one or more records of a data set during an initial recordation of the data set records to one or more external storage devices and generating predetermined data characteristic statistics as a result of the scanning; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics; and (d) combining the selected dictionary segments to form a system-built dictionary that is used by the computer system to compress the remaining data records of the data set as the data set is recorded to the one or more external storage devices, wherein the generated data characteristic statistics include a frequency distribution of characters found in the scanned record, a count of occurrences for each character in the scanned record, and an entropy value for the scanned record; and
the step of scanning records comprises;completing a stabilization procedure comprising repeating the steps of selecting a current first block of the record for scanning and generating a first set of the data characteristic statistics, selecting a current second block of the record, different from the first block, for scanning and generating a second set of data characteristic statistics, comparing like statistics of the first set of data characteristic statistics and the second set of data characteristic statistics, and selecting a new first block and a new second block for scanning and generating new respective sets of data characteristic statistics until the step of comparing indicates that the statistical difference between the first and second sets of statistics is below a predetermined stabilization threshold value, or until the size of the second block equals a predetermined scanning limit value, whereupon the combination of the first block and second block is stored in an interrogation buffer if the entropy value of the second set of statistics is below a predetermined entropy threshold value. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A method of forming compression dictionaries in a computer system that includes data sets with corresponding data catalogs that contain information identifying the format of each data set and its location, the data sets containing records of uninterrupted byte strings that will be compressed by the computer system using the compression dictionaries, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to compress the data sets; (b) scanning one or more records of a data set during an initial recordation of the data set records to one or more external storage devices and generating predetermined data characteristic statistics as a result of the scanning; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics; and (d) combining the selected dictionary segments to form a system-built dictionary that is used by the computer system to compress the remaining data records of the data set as the data set is recorded to the one or more external storage devices, wherein the step of combining the dictionary segments comprises the steps of; selecting dictionary identifiers that comprise display code symbols representing each of the selected dictionary segments; and concatenating the dictionary identifiers that represent the selected dictionary segments to form the system-built dictionary token. - View Dependent Claims (35, 36, 37, 38)
-
-
39. A method of storing and retrieving data sets in a computer system, the computer system maintaining data catalogs containing information identifying the format of each data set and its storage location in the computer system, the data sets comprising records of uninterrupted byte strings and assuming either a closed status in which the data records cannot be accessed or an open status in which the data records are subject to access commands for retrieval of records, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to create a Ziv-Lempel tree structure for a data compression process; (b) scanning the records of a data set during an initial recordation of the records to their stored location and generating predetermined data characteristic statistics representative of the scanned records; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics and compressing the scanned record with the selected dictionary segments to generate compression statistics relating to the scanned record and the data compression outcome; (d) combining the selected dictionary segments to form a system-built dictionary; and (e) compressing the data set with the data compression process using the system-built dictionary and storing the compressed data set in their stored location. - View Dependent Claims (40)
-
-
41. A method of storing and retrieving data sets in a computer system, the computer system maintaining data catalogs containing information identifying the format of each data set and its storage location in the computer system, the data sets comprising records of uninterrupted byte strings and assuming either a closed status in which the data records cannot be accessed or an open status in which the data records are subject to access commands for retrieval of records, the method comprising the steps of:
-
(a) providing a library of dictionary segments that are adapted for use by the computer system to create a Ziv-Lempel tree structure for a data compression process; (b) scanning the records of a data set during an initial recordation of the records to their stored location and generating predetermined data characteristic statistics representative of the scanned records; (c) selecting dictionary segments from the library of dictionary segments according to the generated data characteristic statistics and compressing the scanned record with the selected dictionary segments to generate compression statistics relating to the scanned record and the data compression outcome; (d) combining the selected dictionary segments to form a system-built dictionary; and (e) compressing the data set with the data compression process using the system-built dictionary and storing the compressed data set in their stored location, wherein the step of providing a library of dictionary segments includes providing text-based dictionary segments adapted for compressing predetermined alphanumeric character sequences. - View Dependent Claims (42)
-
-
43. A combination for use in a computer system that receives user requests to compress uninterrupted byte strings of data sets recorded into storage locations of the computer system and that maintains data catalogs containing information identifying the format of each data set and its storage location, the data sets comprising records of data and assuming either a closed status in which the data set cannot be accessed or an open status in which the data set is subject to access commands, the combination comprising:
-
(a) a library of dictionary segments that are adapted for use by the computer system to create a Ziv-Lempel tree for a data compression process; (b) an interrogation processor that initiates scanning of a plurality of the data set records during an initial recordation of the records to the storage locations and generates predetermined data characteristic statistics relating to the scanned records for selecting dictionary segments from the library according to the generated data characteristic statistics and combining the selected dictionary segments to form a system-built dictionary. - View Dependent Claims (44, 45)
-
-
46. A combination for use in a computer system that receives user requests to compress uninterrupted byte strings of data sets recorded into storage locations of the computer system and that maintains data catalogs containing information identifying the format of each data set and its storage location, the data sets comprising records of data and assuming either a closed status in which the data set cannot be accessed or an open status in which the data set is subject to access commands, the combination comprising:
-
(a) a library of dictionary segments that are adapted for use by the computer system to create a Ziv-Lempel tree for a data compression process; (b) an interrogation processor that initiates scanning of a plurality of the data set records during an initial recordation of the records to the storage locations and generates predetermined data characteristic statistics relating to the scanned records for selecting dictionary segments from the library according to the generated data characteristic statistics and combining the selected dictionary segments to form a system-built dictionary; wherein the interrogation processor optionally designates one or more dictionary segments for a sampling process; and
the combination further includes;(c) a sampling processor that retrieves a designated sample data record and compresses the sample record with the dictionary segments designated by the interrogation processor and generates compression statistics relating to the sample data record and the results of the compression, and wherein the interrogation processor and the sampling processor designate the selected dictionary segments with identifiers representing the dictionary segments and the combination further comprises; (d) a dictionary build processor that receives the dictionary identifiers representing the dictionary segments of the system-built dictionary and retrieves the corresponding dictionary segments representing the system-built dictionary from the library of dictionary segments for use by the computer system in creating a Ziv-Lempel tree for reversing the data compression process; (e) an access method processor that, upon receiving the dictionary identifiers from the compression management processor, stores the dictionary identifiers in the system data catalog entry associated with a data set such that the data set can be closed and retrieves the identifiers upon the data set being opened; and (f) a process router that controls the operation of the combination to initiate operation of either the interrogation processor, the sampling processor, the access method processor, or the dictionary build processor. - View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69)
-
-
70. A computer system comprising:
-
a central processing unit including a data compression facility and a file access method interface; one or more users that communicate with the central processing unit; one or more applications that are controlled by the users and generate uninterrupted byte strings of datasets; external data storage devices that receive data sets from the file access method interface; memory containing data catalogs having information identifying the format of each data set and its location in external storage, the data sets comprising records of byte strings and assuming either a closed status in which the data set cannot be accessed or an open status in which the data set is subject to access commands; a library of dictionary segments that are adapted for use by the central processing unit to create a Ziv-Lempel tree for the data compression facility; an interrogation processor that initiates scanning the records during an initial recordation of a data set to the external storage and generating predetermined interrogation data characteristic statistics for selecting dictionary segments according to the generated interrogation statistics and either combining the selected dictionary segments to form a system-built dictionary or designating one or more dictionary segments for a sampling process; and a sampling processor that retrieves designated sample byte strings from the records and scans the sample byte strings with the dictionary segments designated by the interrogation processor to generate predetermined sampling statistics.
-
-
71. A computer system comprising:
-
a central processing unit including a data compression facility and a file access method interface; one or more users that communicate with the central processing unit; one or more applications that are controlled by the users and generate uninterrupted byte strings of datasets; external data storage devices that receive data sets from the file access method interface; memory containing data catalogs having information identifying the format of each data set and its location in external storage, the data sets comprising records of byte strings and assuming either a closed status in which the data set cannot be accessed or an open status in which the data set is subject to access commands; a library of dictionary segments that are adapted for use by the central processing unit to create a Ziv-Lempel tree for the data compression facility; an interrogation processor that initiates scanning the records during an initial recordation of a data set to the external storage and generating predetermined interrogation data characteristic statistics for selecting dictionary segments according to the generated interrogation statistics and either combining the selected dictionary segments to form a system-built dictionary or designating one or more dictionary segments for a sampling process; and a sampling processor that retrieves designated sample byte strings from the records and scans the sample byte strings with the dictionary segments designated by the interrogation processor to generate predetermined sampling statistics, wherein the interrogation processor and sampling processor designate the selected dictionary segments with a dictionary identifier representing each dictionary segment and the combination further comprises; a dictionary build processor that receives the dictionary identifiers representing the dictionary segments of the system-built dictionary and retrieves the corresponding dictionary segments from the library upon the data set being opened to create the system built dictionary for use during future access commands; an access method processor that stores the dictionary identifiers in the system data catalog upon the data set being closed and retrieves the identifiers upon the data set being opened; and a process router that controls the operation of the combination to initiate operation of either the interrogation processor, the sampling processor, the access method processor, or the dictionary build processor. - View Dependent Claims (72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89)
-
Specification