Fun with random names

Just for laughs, we’re going to take a look at a slightly-too-complex yet elegant solution to a seemingly simple task: creating a list of 100 fictional company names for a demo database.

What I need

I’m working on a demo database where I need some fictional customer records (company names). I could obviously just call them “Customer 1”, “Customer 2”, and so on, and then go to lunch like ordinary people. But where’s the sport? And besides, when you’re showing off a cool demo to future clients, you’d rather the data looked “authentic”.

What I tried

You could build a dynamic, randomized string generator function that jumbles just the right proportions of vowels and consonants together to form nonsensical, name-sounding strings, but this turns out to be actually really hard. Building T-SQL code to create a string that actually looks like something you could pronounce is a lot of work, the details of which I won’t go into right here.

Let’s just say I know this from experience.

What if we had a random mass of text?

So, ideally I would like a large repository of text to work with and generate my names from. I could probably build a script to extract random strings from some database that I have lying around, but there’s always the obvious risk that even a scrap of text could contain/reveal sensitive customer information. But wait! There’s the sys.messages DMV, where SQL Server keeps all its error messages.

How I invented 100 cool tech startup names

First off, let’s collect some random strings of text and dump it into a table variable:

DECLARE @strings TABLE (
    message_id    int NOT NULL,
            nvarchar(2048) NOT NULL,
    PRIMARY KEY CLUSTERED (message_id)
)

INSERT INTO @strings (message_id, )
SELECT TOP (500) message_id, 
FROM sys.messages
WHERE language_id=1033
ORDER BY NEWID();

Three things happen here:

  • The sys.messages DMV contains over 12 000 rows on my SQL Server 2016 SP1, and those are just the us_english ones, which means that for performance reasons, I’m going to just pick 500 messages at random.
  • Also, the language_id column identifies the language of the message (us_english is 1033, german is 1031, 1036 is french, 3082 is spanish, and so on), so you could easily have the script generate names in your preferred language if you want to.
  • Finally, to randomize the selection, I’m sorting by NEWID() (which creates a “random” uniqueidentifier column).

Now, let’s parse all those strings into words. And by “words”, I specifically mean contiguous sequences of the characters a-z, to weed out all the whitespaces and just keep the “real words”. We’ll do this using a recursive common table expression:

DECLARE @words TABLE (
    word       nvarchar(100) NOT NULL,
    PRIMARY KEY CLUSTERED (word)
);

WITH cte AS (
    --- The anchor:
    SELECT CAST(NULL AS nvarchar(100)) AS word,  AS remain
    FROM @strings

    UNION ALL

    --- The recursion:
    SELECT CAST(     LEFT(remain, PATINDEX(N'%[^a-z]%', remain+N' ')-1             ) AS nvarchar(100)),
           CAST(SUBSTRING(remain, PATINDEX(N'%[^a-z]%', remain+N' ')+1, LEN(remain)) AS nvarchar(2048))
    FROM cte WHERE remain!=N''
)

INSERT INTO @words
SELECT DISTINCT LOWER(word)
FROM cte
WHERE LEN(word)>=3
OPTION (MAXRECURSION 0);

The anchor part of the CTE sets up one line for each string. For every recursion, we’ll check the offset to the first non-alphabetic character (including spaces and numbers). That’s what the PATINDEX() function is for. Then we’ll put what’s to the left of that offset in the “word” column and what’s to the right of the offset in the “remain” column for the next iteration.

As we iterate through this, we’re gradually truncating the “remain” column until it’s a just a blank string, which is our stop condition for the recursion. The output of the CTE will be one row for each word of each string.

For good measure, we’ll eliminate all words that are shorter than three characters and then save a DISTINCT set of the remaining words in the @words table variable.

Now, the home stretch: putting the words together.

SELECT TOP (100)
    UPPER(LEFT(a.word, 1))+SUBSTRING(a.word, 2, LEN(a.word))+
    ISNULL(N' '+b.word, '')+
    ISNULL(N' '+c.word, '') AS CoolStartupName

--- First word:
FROM @words AS a

--- Second word:
OUTER APPLY (
    SELECT TOP (1)
           UPPER(LEFT(x.word, 1))+
           SUBSTRING(x.word, 2, LEN(x.word)) AS word
    FROM @words AS x
    WHERE LEFT(a.word, 3)!=LEFT(x.word, 3) AND RIGHT(a.word, 3)!=RIGHT(x.word, 3)
    ORDER BY NEWID()
    ) AS b

--- Third word (if there's space):
OUTER APPLY (
    SELECT TOP (1)
           UPPER(LEFT(x.word, 1))+
           SUBSTRING(x.word, 2, LEN(x.word)) AS word
    FROM @words AS x
    WHERE LEFT(a.word, 3)!=LEFT(x.word, 3) AND RIGHT(a.word, 3)!=RIGHT(x.word, 3)
      AND LEFT(b.word, 3)!=LEFT(x.word, 3) AND RIGHT(b.word, 3)!=RIGHT(x.word, 3)
      AND LEN(a.word)+LEN(b.word)<=12
    ORDER BY NEWID()
    ) AS c

ORDER BY NEWID();

This query is actually a lot simpler than it first appears. Here’s how it breaks down:

  • Pick 100 words at random (table “a”)
  • For each word in “a”, if possible, pick a single random word (“b”) that doesn’t start or end with the same three letters as the “a” word.
  • For each word in “b”, if possible, pick a single random word (“c”) that doesn’t start or end with the same three letters as the “a” nor the “b” word.
  • The UPPER(), LEFT() and SUBSTRING() stuff is just to turn the names into title case.
  • As before, the ORDER BY NEWID() randomizes the order in which the TOP (1) row is returned.

The APPLY operator runs a subquery once for each “calling” parent row, which is what we want here, as we want to randomize the sort order each time. In the execution plan, APPLY shows up as a Nested Loops operator:

Obviously, a real performance killer here is the randomized sort order, which puts no less than three Sort operations in our plan.

Bottom line

With a reasonably simple text parsing script and the built-in messages in sys.messages, we can create a list of 100 cool tech startup company names, including brilliant names such as “Intended Incoming”, “Impacted Algorithm”, “Candidate Terminator” or “Granted Cleanup”.

Would you name your tech startup “Impacted Algorithm”? Let me know in the comments.

25 comments

  1. Seconds away from buying the domain for ‘Impacted Algorithm’. A pretty funny solution to a problem I came up against just recently.

  2. “Applications Deleted”, a subdivision of “Floating Slow Startup”.
    I like this script. It’ll bring some fun into the demo’s.
    And I have a new user name too. πŸ˜‰

  3. This is my new favourite game for a Friday afternoon!
    Favourites to date:
    Replicated Illegal
    Using Construction
    Dispatch Return
    The Descriptor
    and the slightly creepy: Population Processing

  4. Haha! OK, my list of faves:

    Problems Valid
    Error Targeted
    Delete Operator
    End Transactions
    Result Signal You
    Nvarchar Violates
    Tolerant Permissible
    Certificate Forceseek
    Total Sensitivity

  5. Altering Per Browser
    Explicitly Unicode
    Commit Impersonation
    Atomic Identify

    and my favorite: Because Memory

  6. I think these also make good Band names
    Source Tabular
    Attempted Versions
    Decode Some Specifies
    Explicit Otherwise
    Connect Performance
    Corruption Subset
    Object Fashion
    Residual Beyond
    Redefinition Won
    …

  7. And some possible lines of business (I assume coming in v2?):

    Body Alter Form — Personal Training
    Semantics Possible — Consultant philosophers
    Argument Resumed — Dating agency
    Deprecated Tables — Reclaimed furnishings

  8. Negative Bytes
    Query Coherency
    Empty Particle
    Showmemo Password
    Dumped Indexed Namespace

    and my all-time favorite:

    Deadlock Pool

  9. And some more :

    ” Temporarily Foreign” – US Immigration Agency
    ” Snapshots Works” – A trendy Photo studio
    ” Deleted Siblings” – Population Control Agency for Brothers & Sisters
    ” Damaged Environment ” – The local Branch of EPA
    ” Commit Used Reader” – Local Library’s Law Enforcement branch
    “Skipped Password” – EQUIFAX’s Security Policy
    ” Roll Initialize” – Pillsbury’s Instruction’s Manual
    ” Capture Full Sender ” – Post Office’s Law Enforcement Agency
    ” Making Succeeded” – In-Vitro Fertilization Lab ?
    “Weekly Parent Received” – Amazon’s New Service for Mom & Dad Shipments
    “Uses Block Soap” – Public Bath’s Moto

  10. I “like” the following:
    Parameter Protection
    Troubleshooting Document
    Previously Distributed
    Four Settings Compliance
    Regenerating Protocol
    Bulk Permission
    Cascading Security
    Literal Interval

    And the deeply personal:
    Inconsistent Stream

  11. Here are my favorites πŸ™‚ I think some of them can even be great book or movie titles.

    Replaced Problem
    Last Declared Session
    Clone Durability
    Agent Delivered
    Exposing Avoid

Leave a comment

Your email address will not be published. Required fields are marked *