22
De-identify SQL: Transforming Production Data
Perhaps you've noticed similar comments where you work:
Sales:
"This sales demo needs updated data to demonstrate a new feature."
QA:
"Regression testing wasn't possible because the database is stale."
Development:
"Our staging environment doesn't contain the necessary data."
With data-driven applications, our production environment often contains the answer to Sales, QA & Development needs but crossing our fingers and hoping it all works can be problematic.
I need a tool that transforms a production database, modifies identifiable fields and creates de-identified data to share with prospective customers, create a development database or populate QA environments.
- Replace data with a fabricated value.
- Redact data with a constant value.
- Generate new data with JavaScript.
De-identify SQL accepts a passed-in file or piped input and creates an output .sql or pipes data to another process. So I can wire this into an automated process or manually create data as needed.
To begin I need a JSON file for each table I want to modify. Each strategy file is named after the table it's applied to. Here's example input SQL for a user table:
CREATE TABLE `user` (
`id` int NOT NULL DEFAULT '0',
`email` varchar(256),
`full_name` varchar(100),
`birth_date` date,
`gov_id` varchar(20),
PRIMARY KEY (`id`)
);
INSERT INTO `user` VALUES (1,'[email protected]','Florida Gibson','2000-12-10','111-22-3333'),(2,'[email protected]','Godfrey Mosciski','1980-01-01','333-22-1111'),(3,'[email protected]','Tommie Quitzon','1960-06-07','444-55-6666');
Any field I list in my strategy JSON file will be modified, all other fields will pass-through. To modify the above SQL I can use the following file:
{
"columns": [
{
"columnKey": "full_name",
"redactWith": "NAME REMOVED",
"tracked": false
},
{
"columnKey": "email",
"redactWith": "internet.email",
"tracked": false
},
{
"columnKey": "birth_date",
"redactWith": "createBirthday",
"tracked": false
},
{
"columnKey": "gov_id",
"redactWith": "{{datatype.number({\"min\":100,\"max\":999})}}-{{datatype.number({\"min\":10,\"max\":99})}}-{{datatype.number({\"min\":1000,\"max\":9999})}}",
"tracked": false
}
]
}
Running De-identify SQL with the above input and JSON strategy:
cat user.sql | de-identify-sql > user-de-identified.sql
I'll receive back (output generated randomly, your results may vary):
INSERT INTO `user` VALUES
(1,'[email protected]','NAME REMOVED','1976-05-17','420-15-1747'),
(2,'[email protected]','NAME REMOVED','1961-10-17','327-49-5054'),
(3,'[email protected]','NAME REMOVED','2002-11-10','106-57-2546');
I have a lot of flexibility when modifying SQL statements:
- In the above example,
full_name
is redacted with a constant:NAME_REMOVED
. - For the
email
column, we're calling a faker function to generate a new email address. - To generate a
gov_id
, we use a mustache-like template of faker functions. - Lastly, to create birth_date, we're calling custom JavaScript. This is coming from
user.js
:
const faker = require('faker')
module.exports = {
createBirthday: () => {
return faker.date.past(80).toISOString().slice(0, 10);
}
}
I may also need to work with denormalized or repeating data which needs to be consistently processed to retained the relationships between that data while removing identifying details. Here I can change the tracked
parameter in the JSON strategy file from false
to true
this will replace an repeated fields with the same fabricated values. For example, this order SQL:
INSERT INTO `order` VALUES
(1000,'[email protected]',12.99),
(1001,'[email protected]',123.45),
(1002,'[email protected]',70.00),
(1003,'[email protected]',542.01),
(1004,'[email protected]',67.52);
With this JSON strategy:
{
"columns": [
{
"columnKey": "order_email",
"redactWith": "internet.email",
"tracked": true
}
]
}
Could become:
INSERT INTO `order` VALUES
(1000,'[email protected]',12.99),
(1001,'[email protected]',123.45),
(1002,'[email protected]',70),
(1003,'[email protected]',542.01),
(1004,'[email protected]',67.52);
This allows us to remove identifying fields while preserving relationships between data elements that the application may require. This also makes our fabricated data appear more realistic: in certain tables we expect to see fields repeat.
- De-identify-SQL is currently compatible with MySQL and MariaDB. Feel free to request a different database.
-
mysqldump
should be used to create input SQL. - I'm a contributor for De-identify SQL.
22