De-identify SQL: Transforming Production Data

Perhaps you've noticed similar comments where you work:

Sales:

"This sales demo needs updated data to demonstrate a new feature."

QA:

"Regression testing wasn't possible because the database is stale."

Development:

"Our staging environment doesn't contain the necessary data."

With data-driven applications, our production environment often contains the answer to Sales, QA & Development needs but crossing our fingers and hoping it all works can be problematic.

I need a tool that transforms a production database, modifies identifiable fields and creates de-identified data to share with prospective customers, create a development database or populate QA environments.

De-identifying SQL: Creating Data

Replace data with a fabricated value.

Redact data with a constant value.

Generate new data with JavaScript.

De-identify SQL accepts a passed-in file or piped input and creates an output .sql or pipes data to another process. So I can wire this into an automated process or manually create data as needed.

Example Ahead: Creating Fake Users

To begin I need a JSON file for each table I want to modify. Each strategy file is named after the table it's applied to. Here's example input SQL for a user table:

CREATE TABLE `user` (
  `id` int NOT NULL DEFAULT '0',
  `email` varchar(256),
  `full_name` varchar(100),
  `birth_date` date,
  `gov_id` varchar(20),
PRIMARY KEY (`id`)
);

INSERT INTO `user` VALUES (1,'Florida.Gibson@hotmail.com','Florida Gibson','2000-12-10','111-22-3333'),(2,'godfreymo44@hotmail.com','Godfrey Mosciski','1980-01-01','333-22-1111'),(3,'Tommie1Quitzon@gmail.com','Tommie Quitzon','1960-06-07','444-55-6666');

Any field I list in my strategy JSON file will be modified, all other fields will pass-through. To modify the above SQL I can use the following file:

{
  "columns": [
    {
      "columnKey": "full_name",
      "redactWith": "NAME REMOVED",
      "tracked": false
    },
    {
      "columnKey": "email",
      "redactWith": "internet.email",
      "tracked": false
    },
    {
      "columnKey": "birth_date",
      "redactWith": "createBirthday",
      "tracked": false
    },
    {
      "columnKey": "gov_id",
      "redactWith": "{{datatype.number({\"min\":100,\"max\":999})}}-{{datatype.number({\"min\":10,\"max\":99})}}-{{datatype.number({\"min\":1000,\"max\":9999})}}",
      "tracked": false
    }
  ]
}

Running De-identify SQL with the above input and JSON strategy:

cat user.sql | de-identify-sql > user-de-identified.sql

I'll receive back (output generated randomly, your results may vary):

INSERT INTO `user` VALUES 
(1,'Titus1@yahoo.com','NAME REMOVED','1976-05-17','420-15-1747'),
(2,'Jeremy.Bechtelar@hotmail.com','NAME REMOVED','1961-10-17','327-49-5054'),
(3,'Emiliano.Grimes@gmail.com','NAME REMOVED','2002-11-10','106-57-2546');

Under the Hood

I have a lot of flexibility when modifying SQL statements:

In the above example, full_name is redacted with a constant: NAME_REMOVED.

For the email column, we're calling a faker function to generate a new email address.

To generate a gov_id, we use a mustache-like template of faker functions.

Lastly, to create birth_date, we're calling custom JavaScript. This is coming from user.js:

const faker = require('faker')

module.exports = {
  createBirthday: () => {
    return faker.date.past(80).toISOString().slice(0, 10);
  }
}

Preserving Data Relationships

I may also need to work with denormalized or repeating data which needs to be consistently processed to retained the relationships between that data while removing identifying details. Here I can change the tracked parameter in the JSON strategy file from false to true this will replace an repeated fields with the same fabricated values. For example, this order SQL:

INSERT INTO `order` VALUES 
(1000,'Dorthy.OKeefe@hotmail.com',12.99),
(1001,'Belle10@hotmail.com',123.45),
(1002,'Dorthy.OKeefe@hotmail.com',70.00),
(1003,'Garfield.Renner89@hotmail.com',542.01),
(1004,'Garfield.Renner89@hotmail.com',67.52);

With this JSON strategy:

{
  "columns": [
    {
      "columnKey": "order_email",
      "redactWith": "internet.email",
      "tracked": true
    }
  ]
}

Could become:

INSERT INTO `order` VALUES 
(1000,'Demario26@yahoo.com',12.99),
(1001,'Audra_Stroman@yahoo.com',123.45),
(1002,'Demario26@yahoo.com',70),
(1003,'Rogelio_OHara1@yahoo.com',542.01),
(1004,'Rogelio_OHara1@yahoo.com',67.52);

This allows us to remove identifying fields while preserving relationships between data elements that the application may require. This also makes our fabricated data appear more realistic: in certain tables we expect to see fields repeat.

Disclaimers

De-identify-SQL is currently compatible with MySQL and MariaDB. Feel free to request a different database.

mysqldump should be used to create input SQL.

I'm a contributor for De-identify SQL.