I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.
In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.
Getting some anonymous records
Check out anonymize.strd.co, where you can design your simple web API call. You can use that URL in your browser, with Powershell, or with any custom application that can request a URL and read JSON data.
I’ve rate-limited the service to only return a handful of rows for every 60 seconds, so if you want more rows, you’ll need a free API key.
I’m going to start by fetching a number of anonymized user records using Powershell and putting them in a variable:
$rowset = (Invoke-WebRequest ` -Uri "https://anonymize.strd.co/json?culture=US&rows=100").Content ` | ConvertFrom-Json
Once we have that in place, let’s use the absolutely outstanding DBATools Powershell module to load them into a staging table in your database, with just one easy command.
Write-DbaDataTable ` -InputObject $rowset ` -SqlInstance SERVER01 ` -Table "DatabaseName.dbo.CustomerMap"
I’m assuming that this table already exists, but if you want to automatically create it, just add “-AutoCreateTable” to the command. There are a bunch of other ways to load data into a table, including ETL tools; to each their own.
Replacing actual names with the fake ones
From here on, it’s just a simple T-SQL command to replace your sensitive customer information with the made-up data in the staging table.
WITH cust AS ( SELECT ROW_NUMBER() OVER (ORDER BY ID) AS _rownum, * FROM dbo.Customers) UPDATE cust SET cust.FirstName=fake.FirstName, cust.LastName=fake.FirstName, cust.[Address]=fake.[Address], cust.ZipCode=fake.ZipCode FROM cust INNER JOIN dbo.CustomerMap AS fake ON cust._rownum=fake.ID;
The ID column of the API dataset is an incremental key that starts with 1, so irrespective of what primary key you use in your live table, you can apply a ROW_NUMBER() function to get a matching key that you can join the two on.
Needless to say, the actual names and table definitions of your live customer table(s) will vary, so you’ll need to plan accordingly. You’ll also need to have enough records in the staging table to match all of your customer records!
Call for volunteers!
I’ve compiled statistically accurate data from official census sources for the US and Sweden, so that you can get realistic-looking data for those two countries, but if you want your country’s names and places featured here, please send me as detailed information as you can gather on your region’s population. Here’s an idea what I’m looking for:
- distributions by age and gender of:
- the 100 most common women’s and men’s names
- population per state/region/postal code/similar
- how national IDs are formatted/created, including checksums
- how streets are named
And, of course, if there’s something else that you think might be an easy fix, please let me know.
6 thoughts on “Anonymize your dev databases!”
Pingback: Generating Anonymous Data – Curated SQL
This is a great idea! Have you looked at Mockaroo?
They have a very nice interface to create custom random datasets with most kinds of data, both through web interface and through an API. Also free for a limited number of items in the dataset. I don’t think they offer a Swedish dataset though :)
Interesting, I didn’t know about Mockaroo. Also, my dataset is reasonably accurate statistically, with regards to distribution of age, state/zip, etc.
Daniel, this looks promising, I’ll try it out.
One question: the column “Personnummer” is a mockup, right?
Could the Personnummer-mockup be a real one (by accident)?
Yeah, the “personnummer” is a mockup, and it may very well overlap existing real-world ones.
Another online tool to generate random data: