De-identify SQL: Transforming Production Data

Perhaps you've noticed similar comments where you work:

Sales:

"This sales demo needs updated data to demonstrate a new feature."

QA:

"Regression testing wasn't possible because the database is stale."

Development:

"Our staging environment doesn't contain the necessary data."

With data-driven applications, our production environment often contains the answer to Sales, QA & Development needs but crossing our fingers and hoping it all works can be problematic.

I need a tool that transforms a production database, modifies identifiable fields and creates de-identified data to share with prospective customers, create a development database or populate QA environments.

De-identifying SQL: Creating Data

  • Replace data with a fabricated value.
  • Redact data with a constant value.
  • Generate new data with JavaScript.

De-identify SQL accepts a passed-in file or piped input and creates an output .sql or pipes data to another process. So I can wire this into an automated process or manually create data as needed.

Example Ahead: Creating Fake Users

To begin I need a JSON file for each table I want to modify. Each strategy file is named after the table it's applied to. Here's example input SQL for a user table:

CREATE TABLE `user` (
  `id` int NOT NULL DEFAULT '0',
  `email` varchar(256),
  `full_name` varchar(100),
  `birth_date` date,
  `gov_id` varchar(20),
PRIMARY KEY (`id`)
);

INSERT INTO `user` VALUES (1,'[email protected]','Florida Gibson','2000-12-10','111-22-3333'),(2,'[email protected]','Godfrey Mosciski','1980-01-01','333-22-1111'),(3,'[email protected]','Tommie Quitzon','1960-06-07','444-55-6666');

Any field I list in my strategy JSON file will be modified, all other fields will pass-through. To modify the above SQL I can use the following file:

{
  "columns": [
    {
      "columnKey": "full_name",
      "redactWith": "NAME REMOVED",
      "tracked": false
    },
    {
      "columnKey": "email",
      "redactWith": "internet.email",
      "tracked": false
    },
    {
      "columnKey": "birth_date",
      "redactWith": "createBirthday",
      "tracked": false
    },
    {
      "columnKey": "gov_id",
      "redactWith": "{{datatype.number({\"min\":100,\"max\":999})}}-{{datatype.number({\"min\":10,\"max\":99})}}-{{datatype.number({\"min\":1000,\"max\":9999})}}",
      "tracked": false
    }
  ]
}

Running De-identify SQL with the above input and JSON strategy:

cat user.sql | de-identify-sql > user-de-identified.sql

I'll receive back (output generated randomly, your results may vary):

INSERT INTO `user` VALUES 
(1,'[email protected]','NAME REMOVED','1976-05-17','420-15-1747'),
(2,'[email protected]','NAME REMOVED','1961-10-17','327-49-5054'),
(3,'[email protected]','NAME REMOVED','2002-11-10','106-57-2546');

Under the Hood

I have a lot of flexibility when modifying SQL statements:

  • In the above example, full_name is redacted with a constant: NAME_REMOVED.
  • For the email column, we're calling a faker function to generate a new email address.
  • To generate a gov_id, we use a mustache-like template of faker functions.
  • Lastly, to create birth_date, we're calling custom JavaScript. This is coming from user.js:
const faker = require('faker')

module.exports = {
  createBirthday: () => {
    return faker.date.past(80).toISOString().slice(0, 10);
  }
}

Preserving Data Relationships

I may also need to work with denormalized or repeating data which needs to be consistently processed to retained the relationships between that data while removing identifying details. Here I can change the tracked parameter in the JSON strategy file from false to true this will replace an repeated fields with the same fabricated values. For example, this order SQL:

INSERT INTO `order` VALUES 
(1000,'[email protected]',12.99),
(1001,'[email protected]',123.45),
(1002,'[email protected]',70.00),
(1003,'[email protected]',542.01),
(1004,'[email protected]',67.52);

With this JSON strategy:

{
  "columns": [
    {
      "columnKey": "order_email",
      "redactWith": "internet.email",
      "tracked": true
    }
  ]
}

Could become:

INSERT INTO `order` VALUES 
(1000,'[email protected]',12.99),
(1001,'[email protected]',123.45),
(1002,'[email protected]',70),
(1003,'[email protected]',542.01),
(1004,'[email protected]',67.52);

This allows us to remove identifying fields while preserving relationships between data elements that the application may require. This also makes our fabricated data appear more realistic: in certain tables we expect to see fields repeat.

Disclaimers

23