Just for laughs, we’re going to take a look at a slightly-too-complex yet elegant solution to a seemingly simple task: creating a list of 100 fictional company names for a demo database.
What I need
I’m working on a demo database where I need some fictional customer records (company names). I could obviously just call them “Customer 1”, “Customer 2”, and so on, and then go to lunch like ordinary people. But where’s the sport? And besides, when you’re showing off a cool demo to future clients, you’d rather the data looked “authentic”.
What I tried
You could build a dynamic, randomized string generator function that jumbles just the right proportions of vowels and consonants together to form nonsensical, name-sounding strings, but this turns out to be actually really hard. Building T-SQL code to create a string that actually looks like something you could pronounce is a lot of work, the details of which I won’t go into right here.
Let’s just say I know this from experience.
What if we had a random mass of text?
So, ideally I would like a large repository of text to work with and generate my names from. I could probably build a script to extract random strings from some database that I have lying around, but there’s always the obvious risk that even a scrap of text could contain/reveal sensitive customer information. But wait! There’s the sys.messages DMV, where SQL Server keeps all its error messages.
How I invented 100 cool tech startup names
First off, let’s collect some random strings of text and dump it into a table variable:
DECLARE @strings TABLE ( message_id int NOT NULL, [text] nvarchar(2048) NOT NULL, PRIMARY KEY CLUSTERED (message_id) ) INSERT INTO @strings (message_id, [text]) SELECT TOP (500) message_id, [text] FROM sys.messages WHERE language_id=1033 ORDER BY NEWID();
Three things happen here:
- The sys.messages DMV contains over 12 000 rows on my SQL Server 2016 SP1, and those are just the us_english ones, which means that for performance reasons, I’m going to just pick 500 messages at random.
- Also, the language_id column identifies the language of the message (us_english is 1033, german is 1031, 1036 is french, 3082 is spanish, and so on), so you could easily have the script generate names in your preferred language if you want to.
- Finally, to randomize the selection, I’m sorting by NEWID() (which creates a “random” uniqueidentifier column).
Now, let’s parse all those strings into words. And by “words”, I specifically mean contiguous sequences of the characters a-z, to weed out all the whitespaces and just keep the “real words”. We’ll do this using a recursive common table expression:
DECLARE @words TABLE ( word nvarchar(100) NOT NULL, PRIMARY KEY CLUSTERED (word) ); WITH cte AS ( --- The anchor: SELECT CAST(NULL AS nvarchar(100)) AS word, [text] AS remain FROM @strings UNION ALL --- The recursion: SELECT CAST( LEFT(remain, PATINDEX(N'%[^a-z]%', remain+N' ')-1 ) AS nvarchar(100)), CAST(SUBSTRING(remain, PATINDEX(N'%[^a-z]%', remain+N' ')+1, LEN(remain)) AS nvarchar(2048)) FROM cte WHERE remain!=N'' ) INSERT INTO @words SELECT DISTINCT LOWER(word) FROM cte WHERE LEN(word)>=3 OPTION (MAXRECURSION 0);
The anchor part of the CTE sets up one line for each string. For every recursion, we’ll check the offset to the first non-alphabetic character (including spaces and numbers). That’s what the PATINDEX() function is for. Then we’ll put what’s to the left of that offset in the “word” column and what’s to the right of the offset in the “remain” column for the next iteration.
As we iterate through this, we’re gradually truncating the “remain” column until it’s a just a blank string, which is our stop condition for the recursion. The output of the CTE will be one row for each word of each string.
For good measure, we’ll eliminate all words that are shorter than three characters and then save a DISTINCT set of the remaining words in the @words table variable.
Now, the home stretch: putting the words together.
SELECT TOP (100) UPPER(LEFT(a.word, 1))+SUBSTRING(a.word, 2, LEN(a.word))+ ISNULL(N' '+b.word, '')+ ISNULL(N' '+c.word, '') AS CoolStartupName --- First word: FROM @words AS a --- Second word: OUTER APPLY ( SELECT TOP (1) UPPER(LEFT(x.word, 1))+ SUBSTRING(x.word, 2, LEN(x.word)) AS word FROM @words AS x WHERE LEFT(a.word, 3)!=LEFT(x.word, 3) AND RIGHT(a.word, 3)!=RIGHT(x.word, 3) ORDER BY NEWID() ) AS b --- Third word (if there's space): OUTER APPLY ( SELECT TOP (1) UPPER(LEFT(x.word, 1))+ SUBSTRING(x.word, 2, LEN(x.word)) AS word FROM @words AS x WHERE LEFT(a.word, 3)!=LEFT(x.word, 3) AND RIGHT(a.word, 3)!=RIGHT(x.word, 3) AND LEFT(b.word, 3)!=LEFT(x.word, 3) AND RIGHT(b.word, 3)!=RIGHT(x.word, 3) AND LEN(a.word)+LEN(b.word)<=12 ORDER BY NEWID() ) AS c ORDER BY NEWID();
This query is actually a lot simpler than it first appears. Here’s how it breaks down:
- Pick 100 words at random (table “a”)
- For each word in “a”, if possible, pick a single random word (“b”) that doesn’t start or end with the same three letters as the “a” word.
- For each word in “b”, if possible, pick a single random word (“c”) that doesn’t start or end with the same three letters as the “a” nor the “b” word.
- The UPPER(), LEFT() and SUBSTRING() stuff is just to turn the names into title case.
- As before, the ORDER BY NEWID() randomizes the order in which the TOP (1) row is returned.
The APPLY operator runs a subquery once for each “calling” parent row, which is what we want here, as we want to randomize the sort order each time. In the execution plan, APPLY shows up as a Nested Loops operator:
Obviously, a real performance killer here is the randomized sort order, which puts no less than three Sort operations in our plan.
With a reasonably simple text parsing script and the built-in messages in sys.messages, we can create a list of 100 cool tech startup company names, including brilliant names such as “Intended Incoming”, “Impacted Algorithm”, “Candidate Terminator” or “Granted Cleanup”.
Would you name your tech startup “Impacted Algorithm”? Let me know in the comments.