Strategies for sanitizing data items
First Claim
1. A method for sanitizing restricted data items in a data set to prevent the revelation of the restricted data items, comprising:
- transferring an original data set from a production environment to a sanitizer, the original data set being characterized by a state and including a plurality of data items stored in a plurality of different locations corresponding to separate data stores;
generating a data directory table, which is separate from the original data set, by;
identifying a plurality of different data items within the original data set;
classifying the data items as having a restricted or non-restricted status; and
mapping the restricted data items to their respective locations within the original data set;
sanitizing at least a portion of the original data set using the sanitizer, while preserving the state of the original data set, the sanitizing comprising;
identifying the locations of the restricted data items in the original data set,wherein identifying the locations of the restricted data items comprises using the data directory table to identify the locations of the restricted data items in the original data set;
identifying at least one sanitizing tool, from a plurality of sanitizing tools, to apply to the restricted data items which have been located in the original data set, wherein identifying the at least one sanitizing tool utilizes a stored reference in the data directory table which links the restricted data items to the at least one sanitizing tool, wherein each of the plurality of sanitizing tools;
modify the restricted data items by transforming the restricted data items so that at least one statistical feature of the restricted data items is preserved;
apply different randomizing algorithms transforming different types of restricted data items, wherein the different algorithms assign characters to text strings as a substitution to the restricted data items; and
produce sanitized data items that remain functional such that one or more applications in a testing environment can interact with the sanitized data items such that analysis and testing can be realized;
applying said at least one sanitizing tool to the restricted data items which have been located in the original data set to provide a sanitized data set, whereinapplying said at least one sanitizing tool to the restricted data items comprises a bulk sanitization operation;
forwarding the sanitized data set to a target environment; and
in an event that the original data set has changed subsequent to the bulk sanitization operation, performing a delta sanitization operation to sanitize the restricted data items in the original data set which have changed subsequent to the bulk sanitization operation.
2 Assignments
0 Petitions
Accused Products
Abstract
Strategies are described for sanitizing a data set, having the effect of obscuring restricted data in the data set to maintain its secrecy. The strategies operate by providing a production data set to a sanitizer. The sanitizer applies a data directory table to identify the location of restricted data items in the data set and to identify the respective sanitization tools to be applied to the restricted data items. The sanitizer then applies the identified sanitization tools to the identified restricted data items to produce a sanitized data set. A test environment receives the sanitized data set and performs testing, data mining, or some other application on the basis of the sanitized data set. Performing sanitization on a sanitized version of the production data set is advantageous because it preserves the state of the production data set. The data directory table also provides a flexible mechanism for applying sanitization tools to the production data set.
59 Citations
17 Claims
-
1. A method for sanitizing restricted data items in a data set to prevent the revelation of the restricted data items, comprising:
-
transferring an original data set from a production environment to a sanitizer, the original data set being characterized by a state and including a plurality of data items stored in a plurality of different locations corresponding to separate data stores; generating a data directory table, which is separate from the original data set, by; identifying a plurality of different data items within the original data set; classifying the data items as having a restricted or non-restricted status; and mapping the restricted data items to their respective locations within the original data set; sanitizing at least a portion of the original data set using the sanitizer, while preserving the state of the original data set, the sanitizing comprising; identifying the locations of the restricted data items in the original data set, wherein identifying the locations of the restricted data items comprises using the data directory table to identify the locations of the restricted data items in the original data set; identifying at least one sanitizing tool, from a plurality of sanitizing tools, to apply to the restricted data items which have been located in the original data set, wherein identifying the at least one sanitizing tool utilizes a stored reference in the data directory table which links the restricted data items to the at least one sanitizing tool, wherein each of the plurality of sanitizing tools; modify the restricted data items by transforming the restricted data items so that at least one statistical feature of the restricted data items is preserved; apply different randomizing algorithms transforming different types of restricted data items, wherein the different algorithms assign characters to text strings as a substitution to the restricted data items; and produce sanitized data items that remain functional such that one or more applications in a testing environment can interact with the sanitized data items such that analysis and testing can be realized; applying said at least one sanitizing tool to the restricted data items which have been located in the original data set to provide a sanitized data set, wherein applying said at least one sanitizing tool to the restricted data items comprises a bulk sanitization operation; forwarding the sanitized data set to a target environment; and in an event that the original data set has changed subsequent to the bulk sanitization operation, performing a delta sanitization operation to sanitize the restricted data items in the original data set which have changed subsequent to the bulk sanitization operation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for sanitizing restricted data items in a data set to prevent the revelation of the restricted data items, comprising:
one or more processors; a memory coupled to the one or more processors, the memory having computer-executable instructions embodied thereon, the computer-executable instructions, when executed by the one or more processors, configuring the system to sanitize the restricted data items; a production environment which relies on a production data set to perform its allotted functions; a data directory table; and a sanitizer configured to receive an original data set based on the production data set and to sanitize the original data set by; using the data directory table to identify locations of the restricted data items in the original data set; using the data directory table to identify at least one sanitizing tool, from a plurality of sanitizing tools, to apply to at least one of the restricted data items which have been located in the original data set, wherein the data directory table identifies the at least one sanitizing tool by storing a reference which links the at least one of the restricted data items to a corresponding sanitizing tool, wherein each of the plurality of sanitizing tools; modify the restricted data items by transforming the restricted data items so that at least one statistical feature of the restricted data items is preserved; apply different randomizing algorithms transforming different types of the restricted data items, wherein the different algorithms assign characters to text strings as a substitution to the restricted data items; and produce sanitized data items that remain functional such that one or more applications in a testing environment can interact with the sanitized data items such that analysis and testing can be realized; applying said at least one sanitizing tool to the at least one of the restricted data items which have been located in the original data set to provide a sanitized data set wherein the applying said at least one sanitizing tool to the at least one of the restricted data items comprises a bulk sanitization operation; forwarding the sanitized data set to a target environment configured to receive the sanitized data set; and in an event that the original data set has changed subsequent to the bulk sanitization operation, performing a delta sanitization operation to sanitize the restricted data items in the original data set which have changed subsequent to the bulk sanitization operation; and wherein the sanitizing preserves a state of the original data set.
-
12. A sanitizer for sanitizing restricted data items in a data set to prevent the revelation of the restricted data items, comprising:
-
one or more processors; a memory coupled to the one or more processors, the memory having computer-executable instructions embodied thereon, the computer-executable instructions, when executed by the one or more processors, configuring the sanitizer to sanitize the restricted data items; a data directory table; a sanitizing module configured to receive an original data set based on a production data set used in a production environment, and to sanitize the original data set by; using the data directory table to identify locations of the restricted data items in the original data set; using the data directory table to identify at least one sanitizing tool, from a plurality of sanitizing tools, to apply to at least one of the restricted data items which have been located in the original data set, wherein the data directory table identifies the at least one sanitizing tool by storing a reference which links the at least one of the restricted data items to a corresponding sanitizing tool, wherein each of the plurality of sanitizing tools; modify the restricted data items by transforming the restricted data items so that at least one statistical feature of the restricted data items is preserved; apply different randomizing algorithms transforming different types of the restricted data items, wherein the different algorithms assign characters to text strings as a substitution to the restricted data items; and produce sanitized data items that remain functional such that one or more applications in a testing environment can interact with the sanitized data items such that analysis and testing can be realized; and applying said at least one sanitizing tool to the at least one of the restricted data items which have been located in the original data set to provide a sanitized data set, wherein the applying said at least one sanitizing tool to the at least one of the restricted data items comprises a bulk sanitization operation; forwarding the sanitized data set to a target environment configured to receive the sanitized data set; and in an event that the original data set has changed subsequent to the bulk sanitization operation, performing a delta sanitization operation to sanitize the restricted data items in the original data set which have changed subsequent to the bulk sanitization operation; and wherein the sanitizing module is configured to sanitize the original data set while preserving a state of the original data set. - View Dependent Claims (13, 14, 15, 16, 17)
-
Specification