De-identify SQL: Transforming Production Data

Perhaps you've noticed similar comments where you work:
Sales:

"This sales demo needs updated data to demonstrate a new feature."

QA:

"Regression testing wasn't possible because the database is stale."

Development:

"Our staging environment doesn't contain the necessary data."

With data-driven applications, our production environment often contains the answer to Sales, QA & Development needs but crossing our fingers and hoping it all works can be problematic.
I need a tool that transforms a production database, modifies identifiable fields and creates de-identified data to share with prospective customers, create a development database or populate QA environments.
De-identifying SQL: Creating Data
  • Replace data with a fabricated value.
  • Redact data with a constant value.
  • Generate new data with JavaScript.
  • De-identify SQL accepts a passed-in file or piped input and creates an output .sql or pipes data to another process. So I can wire this into an automated process or manually create data as needed.
    Example Ahead: Creating Fake Users
    To begin I need a JSON file for each table I want to modify. Each strategy file is named after the table it's applied to. Here's example input SQL for a user table:
    CREATE TABLE `user` (
      `id` int NOT NULL DEFAULT '0',
      `email` varchar(256),
      `full_name` varchar(100),
      `birth_date` date,
      `gov_id` varchar(20),
    PRIMARY KEY (`id`)
    );
    
    INSERT INTO `user` VALUES (1,'Florida.Gibson@hotmail.com','Florida Gibson','2000-12-10','111-22-3333'),(2,'godfreymo44@hotmail.com','Godfrey Mosciski','1980-01-01','333-22-1111'),(3,'Tommie1Quitzon@gmail.com','Tommie Quitzon','1960-06-07','444-55-6666');
    Any field I list in my strategy JSON file will be modified, all other fields will pass-through. To modify the above SQL I can use the following file:
    {
      "columns": [
        {
          "columnKey": "full_name",
          "redactWith": "NAME REMOVED",
          "tracked": false
        },
        {
          "columnKey": "email",
          "redactWith": "internet.email",
          "tracked": false
        },
        {
          "columnKey": "birth_date",
          "redactWith": "createBirthday",
          "tracked": false
        },
        {
          "columnKey": "gov_id",
          "redactWith": "{{datatype.number({\"min\":100,\"max\":999})}}-{{datatype.number({\"min\":10,\"max\":99})}}-{{datatype.number({\"min\":1000,\"max\":9999})}}",
          "tracked": false
        }
      ]
    }
    Running De-identify SQL with the above input and JSON strategy:
    cat user.sql | de-identify-sql > user-de-identified.sql
    I'll receive back (output generated randomly, your results may vary):
    INSERT INTO `user` VALUES 
    (1,'Titus1@yahoo.com','NAME REMOVED','1976-05-17','420-15-1747'),
    (2,'Jeremy.Bechtelar@hotmail.com','NAME REMOVED','1961-10-17','327-49-5054'),
    (3,'Emiliano.Grimes@gmail.com','NAME REMOVED','2002-11-10','106-57-2546');
    Under the Hood
    I have a lot of flexibility when modifying SQL statements:
  • In the above example, full_name is redacted with a constant: NAME_REMOVED.
  • For the email column, we're calling a faker function to generate a new email address.
  • To generate a gov_id, we use a mustache-like template of faker functions.
  • Lastly, to create birth_date, we're calling custom JavaScript. This is coming from user.js:
  • const faker = require('faker')
    
    module.exports = {
      createBirthday: () => {
        return faker.date.past(80).toISOString().slice(0, 10);
      }
    }
    Preserving Data Relationships
    I may also need to work with denormalized or repeating data which needs to be consistently processed to retained the relationships between that data while removing identifying details. Here I can change the tracked parameter in the JSON strategy file from false to true this will replace an repeated fields with the same fabricated values. For example, this order SQL:
    INSERT INTO `order` VALUES 
    (1000,'Dorthy.OKeefe@hotmail.com',12.99),
    (1001,'Belle10@hotmail.com',123.45),
    (1002,'Dorthy.OKeefe@hotmail.com',70.00),
    (1003,'Garfield.Renner89@hotmail.com',542.01),
    (1004,'Garfield.Renner89@hotmail.com',67.52);
    With this JSON strategy:
    {
      "columns": [
        {
          "columnKey": "order_email",
          "redactWith": "internet.email",
          "tracked": true
        }
      ]
    }
    Could become:
    INSERT INTO `order` VALUES 
    (1000,'Demario26@yahoo.com',12.99),
    (1001,'Audra_Stroman@yahoo.com',123.45),
    (1002,'Demario26@yahoo.com',70),
    (1003,'Rogelio_OHara1@yahoo.com',542.01),
    (1004,'Rogelio_OHara1@yahoo.com',67.52);
    This allows us to remove identifying fields while preserving relationships between data elements that the application may require. This also makes our fabricated data appear more realistic: in certain tables we expect to see fields repeat.
    Disclaimers
  • De-identify-SQL is currently compatible with MySQL and MariaDB. Feel free to request a different database.
  • mysqldump should be used to create input SQL.
  • I'm a contributor for De-identify SQL.
  • 35

    This website collects cookies to deliver better user experience

    De-identify SQL: Transforming Production Data