I made a new demo/workshop database

For a couple of years now, I’ve taught my SQL Server workshops using a demo database that I constructed way back in 2014. The old demo database I used was modelled on a fictional airline, but the data was made up, and the modelling.. wasn’t great.

Chicago, the city of complicated parking. Photo credit: Sawyer Bengtson on unsplash.com

So I spent a couple of hundred hours to build a new database based on a public dataset from the City of Chicago with parking tickets issued between 1997 and 2017. I’ve also refined this dataset with more details from other public sources – from ZIP codes, to socioeconomic indicators, to weather information and more.

I’m not saying you should ditch the Stack Overflow database that Brent Ozar provides, but I wanted an alternative. If you want a much larger database, absolutely go for the Stack database, which is at least 10 times as big. But if you want some more avenues to explore real-world data, maybe check this one out.

How to download

The download is an 8 GB compressed SQL Server backup file and restores to about 25 GB.

It’s built on a 2019 instance, so you should probably be able to restore it on a SQL Server 2017-or-newer instance. I think.

Overview of the data model

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Schema “Tickets”

The “Tickets” schema contains data from the base dataset from the City of Chicago, modelled roughly as a star schema with a fact table in the center.

Tickets.Parking_Violations

Fact table with about 54 million rows, one row for every parking ticket.

  • Ticket_number, bigint
  • Issued_date, datetime2(0)
  • Location_ID, int
  • Plate_ID, int
  • Unit_ID, int
  • Location_Text_ID, int
  • Violation_ID, int
  • Officer_ID, int
  • Ticket_Status_ID, tinyint
  • Ticket_Status_date, date
  • Notice_Level_ID, tinyint
  • Notice_Number, bigint
  • Hearing_Outcome_is_Liable, bit
  • Fine_amount, numeric(10, 2)
  • Late_fee, numeric(10, 2)
  • Collection_fee, numeric(10, 2)
  • Amount_paid, numeric(10, 2)
  • Amount_due, numeric(10, 2)

The dataset looks like it’s based on either hand-written, OCR’ed text, or in some cases perhaps a portable device with a tiny keyboard. There are a lot of spelling issues in the original data, particularly for the location information. For this reason, I’ve tried to create a “normalized” dataset where I’ve interpreted and mapped the location to an existing location in the City of Chicago. That’s the Location_ID column. If you want to review the original data, that information is in the Location_Text_ID column.

The ticket status can be seen in Ticket_Status_ID (as well as Ticket_Status_date).

If there was a notice sent out to the owner of the vehicle, this would be recorded in the Notice_Number and Notice_Level_ID columns.

If the ticket was appealed, the outcome of the resulting hearing (found liable or not) is in the Hearing_Outcome_is_Liable column.

Tickets.Violations

One row per type of parking violation.

  • Violation_ID, int
  • Violation_Code, varchar(15)
  • Violation_Description, varchar(150)

Tickets.Fine_amount

The set amount you’d be fined for each respective type of parking violation. This data is derived from the dataset, so it contains a couple of qualified guesses to make up for inconsistencies in the original data.

  • Violation_ID, int
  • From_Date, date
  • To_Date, date
  • Fine_amount, numeric(10, 2)
  • Fine_including_fees, numeric(10, 2)

The Fine_including_fees column indicates the total fine including late fees if not paid by the due date. Normally, the fine doubles, but there are exceptions.

Tickets.Locations

One row for every unique location, about 2.2 million rows.

  • Location_ID, int
  • geocoded_lat, numeric(12, 7)
  • geocoded_lng, numeric(12, 7)
  • Street, varchar(200)
  • Street_number, int
  • ZIP, numeric(5, 0)
  • Ward, tinyint
  • Tract, char(6)
  • Community, tinyint
  • Neighborhood, varchar(3)

The “natural key” of this table is (Street, Street_number). Every location in Chicago fits into a ZIP code, one of 50 wards, a census tract, one of 77 community areas, and one of 98 neighborhoods. These attributes are modelled in separate tables and connected with foreign keys.

Tickets.Location_Text

Plaintext locations, about 4.7 million unique rows, as entered in the original data set. I put them in a separate table to keep the size of the fact table down. You won’t realistically need this table unless you want to dig into unmapped or strangely mapped locations.

  • Location_Text_ID, int
  • Location_Text, varchar(50)

Tickets.License_Plates

One row for each license plate, about 17 million rows with 13.6 million unique plates. Plate numbers are hashed and can theoretically be reused on multiple vehicles over time, for instance if the owner has a personalized plate and gets a new vehicle. For that reason, every row has a From_Date and To_Date.

  • Plate_ID, int
  • License_Plate_Hash, binary(32)
  • From_Date, datetime2(0)
  • To_Date, datetime2(0)
  • ZIP, numeric(5, 0)
  • License_Plate_State, char(2)
  • Plate_Type, varchar(5)
  • Vehicle_Make, varchar(10)

Almost all plates have an issuing state, and most have a registered ZIP code.

Tickets.Ticket_Status

The “current” status of a ticket, 8 rows. For example: paid, contested, dismissed, part of a bankruptcy case, etc.

Tickets.Notice_Level

If a notice was sent out, and what kind, 6 rows.

Tickets.Officers

Badge number of the officer who issued the ticket. 46 000 rows.

Tickets.Units

The “agency” (department, division, unit, etc) that issued the ticket, 383 rows.

Tickets.Accounting

This is a derived fact table, based on the base dataset. As such, it contains assumptions and guesses with regards to dates and things like notice levels and late fees, but the amount should match up with the original dataset. 123 million rows.

  • Date, date
  • Ticket_number, bigint
  • Notice_Level_ID, tinyint
  • Plate_ID, int
  • Location_ID, int
  • Violation_ID, int
  • Amount_issued, numeric(10, 2)
  • Amount_written_off, numeric(12, 2)
  • Amount_paid, numeric(10, 2)
  • Amount_due, numeric(12, 2)

This table was designed to mimic a ledger-style table.

Schema “Chicago”

This schema contains publicly available data from the City of Chicago. The age varies between these data sources, so they won’t match perfectly in all respects. Geographical boundaries change over time, buildings and streets are constructed and torn down.

  • Chicago.Buildings, 488 692 rows
  • Chicago.Census_Tracts, 801 rows
  • Chicago.Community_Areas, 77 rows
  • Chicago.Neighborhoods, 98 rows
  • Chicago.Population_by_Census_Block, 46 291 rows
  • Chicago.Population_by_Zip_Code, 118 rows
  • Chicago.Socioeconomic_Indicators_per_Community_Area, 77 rows
  • Chicago.StreetNames, 2 582 rows
  • Chicago.Wards, 50 rows
  • Chicago.ZipCodes, 59 rows

Schema “ReferenceData”

This “catch-all” schema contains other information that was sourced publicly, but not from the City of Chicago.

  • ReferenceData.License_Plate_Types, 55 rows
  • ReferenceData.Months, 276 rows
  • ReferenceData.Vehicle_Makes, 6 253 rows
  • ReferenceData.ZipCodes, 41 098 rows

Schema “Weather”

Think people are worse at parking when it’s raining? Or officers are grumpier when on a scorching summer’s day or a blistering winter day? Just for fun, I added monthly average temperatures and precipitation, so you could perhaps investigate a correlation between weather and parking tickets.

  • Weather.Precipitation, 276
  • Weather.Temperature, 276

“But Daniel, how do you pay for the bandwidth?”

I’ve placed the backup file on Cloudflare R2 storage, which is their flavor of S3 buckets – except, there are no egress costs. So hosting the 8 GB file will cost me about $0.20 per year, but the downloads won’t cost me anything.

Watch out for Merge Interval with date range Index Seeks

In my last post, I found that DATEDIFF, DATEADD and the other date functions in SQL Server are not as datatype agnostic as the documentation would have you believe. Those functions would perform an implicit datatype conversion to either datetimeoffset or datetime (!), which would noticeably affect the CPU time of a query.

Well, today I was building a query on an indexed date range, and the execution plan contained a Merge Interval operator. Turns out, this operator brings a few unexpected surprises to your query performance. The good news is, it’s a relatively simple fix.

Continue reading

DATEDIFF performs implicit conversions

As I was performance tuning a query, I found that a number of date calculation functions in SQL Server appear to be forcing a conversion of their date parameters to a specific datatype, adding computational work to a query that uses them. In programming terms, it seems that these functions do not have “overloads”, i.e. different code paths depending on the incoming datatype.

So let’s take a closer look at how this manifests itself.

Continue reading

Computing the modulus from very large numbers

… and what of this all has to do with IBAN numbers.

The modulus is the remainder of a division of two integers*. Suppose you divide 12 by 4, the result is 3. But divide 11 by 4, and the result is 2.75. This could also be expressed by saying that 11/4 is 2 with a remainder of 3. Computing that 3 is the work of the modulo operator, which in T-SQL is represented by the % operator.

Let’s explore how to compute the modulus of large numbers in SQL Server, and how this is useful in the real world.

Continue reading

T-SQL template parser

Just for the heck of it, I scratched together a template parser for T-SQL GitHub. The usage of this function is similar to the STRING_SPLIT() function, except instead of splitting a string by a delimiter character, we want to split a string according to a defined template.

SELECT *
FROM dbo.Template_Split(
    'Wingtip Toys (Bethel Acres, OK)',
    '% (%, %)'
);

… will generate the following output:

Notice how the “%” wildcard character denotes how the string is split. Unlike the fancy stuff you can do with regular expressions, T-SQL wildcards don’t allow you to define capture groups, so this function is unfortunately constrained to just using “%”. I hope it will still come in handy to someone out there.

That’s it, that’s the post. Enjoy!

Turn your list into human-readable intervals

If you’ve worked with reporting, you’ve probably come across the following problem. You have a list of values, say “A, B, C, D, K, L, M, N, R, S, T, U, Z” that you want to display in a more user-friendly, condensed manner, “A-D, K-N, R-U, Z”.

Today, we’re going to look at how you can accomplish this in T-SQL, and what this has to do with window functions and gaps and islands.

Continue reading

Use Ctrl+F1 as a “preview” button in SSMS

Every time I set up SQL Server Management Studio, I take the time to add a shortcut to the “Query Shortcuts” section of the options:

Tools -> Options -> Environment -> Keyboard -> Query Shortcuts

On the surface, these query shortcuts are just what the name implies – a key combination that you can press to run a command or execute a stored procedure. But there’s a hidden super power: whatever text you’ve selected in SSMS when you press the keyboard combination gets appended to the shortcut statement.

So if you select the name of a table, for instance “dbo.Votes”, and press Ctrl+F1, SSMS will run:

SELECT TOP (1000) * FROM dbo.Votes
Preview the contents of a table

This allows you to create a keyboard shortcut to instantly preview the contents of a table or view.

And you can select not just the name of one table, but any other query text you want to tack on:

Preview the contents of two joined tables.

Because we’ve selected both the name of a table and the next line, pressing Ctrl+F1 in SSMS will effectively run the following command:

SELECT TOP (1000) * FROM dbo.Votes AS v
INNER JOIN dbo.VoteTypes AS vt ON v.VoteTypeId=vt.Id

You can go on to include as many joins, WHERE clauses, ORDER BY, as long as the syntax makes sense:

Remember that query shortcuts only apply to new windows, so if you change them, you’ll have to open a new window for the change to take effect.

About wildcards and data type precedence

Implicit conversions in SQL Server follow a specific, predictable order, called data type precedence. This means that if you compare or add/concatenate two values, a and b, with different data types, you can predict which one will be implicitly converted to the data type of the other one in order to be able to complete the operation.

I stumbled on an interesting exception to this rule the other day.

Continue reading