I made a new demo/workshop database

For a couple of years now, I’ve taught my SQL Server workshops using a demo database that I constructed way back in 2014. The old demo database I used was modelled on a fictional airline, but the data was made up, and the modelling.. wasn’t great.

Chicago, the city of complicated parking. Photo credit: Sawyer Bengtson on unsplash.com

So I spent a couple of hundred hours to build a new database based on a public dataset from the City of Chicago with parking tickets issued between 1997 and 2017. I’ve also refined this dataset with more details from other public sources – from ZIP codes, to socioeconomic indicators, to weather information and more.

I’m not saying you should ditch the Stack Overflow database that Brent Ozar provides, but I wanted an alternative. If you want a much larger database, absolutely go for the Stack database, which is at least 10 times as big. But if you want some more avenues to explore real-world data, maybe check this one out.

Sources and usage

The dataset has been compiled from a range of public datasets. I haven’t been able to maintain a clear record of all sources throughout my work, and while I’ve made an effort to keep the quality high, there are cases where data is an aggregate or an intersection of multiple sources, or the original data has been refined. This particularly applies to plaintext interpretations and geocoding.

Sources include, but may not be limited to:

This data is probably not suitable for research or other applications where you need 100% accuracy and accountability. The intended use for this data is to learn to work with data platform tools.

How to download

The download is an 8 GB compressed SQL Server backup file and restores to about 25 GB.

It’s built on a 2019 instance, so you should probably be able to restore it on a SQL Server 2019-or-newer instance.

Overview of the data model

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Schema “Tickets”

The “Tickets” schema contains data from the base dataset from the City of Chicago, modelled roughly as a star schema with a fact table in the center.

Tickets.Parking_Violations

Fact table with about 54 million rows, one row for every parking ticket.

  • Ticket_number, bigint
  • Issued_date, datetime2(0)
  • Location_ID, int
  • Plate_ID, int
  • Unit_ID, int
  • Location_Text_ID, int
  • Violation_ID, int
  • Officer_ID, int
  • Ticket_Status_ID, tinyint
  • Ticket_Status_date, date
  • Notice_Level_ID, tinyint
  • Notice_Number, bigint
  • Hearing_Outcome_is_Liable, bit
  • Fine_amount, numeric(10, 2)
  • Late_fee, numeric(10, 2)
  • Collection_fee, numeric(10, 2)
  • Amount_paid, numeric(10, 2)
  • Amount_due, numeric(10, 2)

The dataset looks like it’s based on either hand-written, OCR’ed text, or in some cases perhaps a portable device with a tiny keyboard. There are a lot of spelling issues in the original data, particularly for the location information. For this reason, I’ve tried to create a “normalized” dataset where I’ve interpreted and mapped the location to an existing location in the City of Chicago. That’s the Location_ID column. If you want to review the original data, that information is in the Location_Text_ID column.

The ticket status can be seen in Ticket_Status_ID (as well as Ticket_Status_date).

If there was a notice sent out to the owner of the vehicle, this would be recorded in the Notice_Number and Notice_Level_ID columns.

If the ticket was appealed, the outcome of the resulting hearing (found liable or not) is in the Hearing_Outcome_is_Liable column.

Tickets.Violations

One row per type of parking violation.

  • Violation_ID, int
  • Violation_Code, varchar(15)
  • Violation_Description, varchar(150)

Tickets.Fine_amount

The set amount you’d be fined for each respective type of parking violation. This data is derived from the dataset, so it contains a couple of qualified guesses to make up for inconsistencies in the original data.

  • Violation_ID, int
  • From_Date, date
  • To_Date, date
  • Fine_amount, numeric(10, 2)
  • Fine_including_fees, numeric(10, 2)

The Fine_including_fees column indicates the total fine including late fees if not paid by the due date. Normally, the fine doubles, but there are exceptions.

Tickets.Locations

One row for every unique location, about 2.2 million rows.

  • Location_ID, int
  • geocoded_lat, numeric(12, 7)
  • geocoded_lng, numeric(12, 7)
  • Street, varchar(200)
  • Street_number, int
  • ZIP, numeric(5, 0)
  • Ward, tinyint
  • Tract, char(6)
  • Community, tinyint
  • Neighborhood, varchar(3)
  • Police_District, tinyint

The “natural key” of this table is (Street, Street_number). Every location in Chicago fits into a ZIP code, one of 50 wards, a census tract, one of 77 community areas, one of 98 neighborhoods, and one of about 23 districts. These attributes are modelled in separate tables and connected with foreign keys.

Tickets.Location_Text

Plaintext locations, about 4.7 million unique rows, as entered in the original data set. I put them in a separate table to keep the size of the fact table down. You won’t realistically need this table unless you want to dig into unmapped or strangely mapped locations.

  • Location_Text_ID, int
  • Location_Text, varchar(50)

Tickets.License_Plates

One row for each license plate, about 17 million rows with 13.6 million unique plates. Plate numbers are hashed and can theoretically be reused on multiple vehicles over time, for instance if the owner has a personalized plate and gets a new vehicle. For that reason, every row has a From_Date and To_Date.

  • Plate_ID, int
  • License_Plate_Hash, binary(32)
  • From_Date, datetime2(0)
  • To_Date, datetime2(0)
  • ZIP, numeric(5, 0)
  • License_Plate_State, char(2)
  • Plate_Type, varchar(5)
  • Vehicle_Make, varchar(10)

Almost all plates have an issuing state, and most have a registered ZIP code.

Tickets.Ticket_Status

The “current” status of a ticket, 8 rows. For example: paid, contested, dismissed, part of a bankruptcy case, etc.

Tickets.Notice_Level

If a notice was sent out, and what kind, 6 rows.

Tickets.Officers

Badge number of the officer who issued the ticket. 46 000 rows.

Tickets.Units

The “agency” (police district, department, division, unit, etc) that issued the ticket, 187 rows.

Tickets.Officer_assignments

What unit each officer was assigned to, including from- and to dates (computed), 111 000 rows.

Tickets.Accounting

This is a derived fact table, based on the base dataset. As such, it contains assumptions and guesses with regards to dates and things like notice levels and late fees, but the amount should match up with the original dataset. 123 million rows.

  • Date, date
  • Ticket_number, bigint
  • Notice_Level_ID, tinyint
  • Plate_ID, int
  • Location_ID, int
  • Violation_ID, int
  • Amount_issued, numeric(10, 2)
  • Amount_written_off, numeric(12, 2)
  • Amount_paid, numeric(10, 2)
  • Amount_due, numeric(12, 2)

This table was designed to mimic a ledger-style table.

Schema “Chicago”

This schema contains publicly available data from the City of Chicago. The age varies between these data sources, so they won’t match perfectly in all respects. Geographical boundaries change over time, buildings and streets are constructed and torn down.

  • Chicago.Buildings, 488 692 rows
  • Chicago.Census_Tracts, 801 rows
  • Chicago.Community_Areas, 77 rows
  • Chicago.Intersections, 36 000 rows (computed)
  • Chicago.Neighborhoods, 98 rows
  • Chicago.Population_by_Census_Block, 46 291 rows
  • Chicago.Population_by_Zip_Code, 118 rows
  • Chicago.Socioeconomic_Indicators_per_Community_Area, 77 rows
  • Chicago.StreetNames, 2 582 rows
  • Chicago.Wards, 50 rows
  • Chicago.ZipCodes, 59 rows
  • Chicago.Police_Districts, 22 rows
  • Chicago.Police_Stations, 22 rows

Schema “ReferenceData”

This “catch-all” schema contains other information that was sourced publicly, but not from the City of Chicago.

  • ReferenceData.License_Plate_Types, 55 rows
  • ReferenceData.Months, 276 rows
  • ReferenceData.Population_by_Zip_Codes, 264 000 rows
  • ReferenceData.Population_by_State, 510 rows
  • ReferenceData.Vehicle_Makes, 6 253 rows
  • ReferenceData.States, 51 rows
  • ReferenceData.ZipCodes, 41 098 rows

Schema “Weather”

Think people are worse at parking when it’s raining? Or officers are grumpier when on a scorching summer’s day or a blistering winter day? Just for fun, I added monthly average temperatures and precipitation, so you could perhaps investigate a correlation between weather and parking tickets.

  • Weather.Precipitation, 276
  • Weather.Temperature, 276

“But Daniel, how do you pay for the bandwidth?”

I’ve placed the backup file on Cloudflare R2 storage, which is their flavor of S3 buckets – except, there are no egress costs. So hosting the 8 GB file will cost me about $0.20 per year, but the downloads won’t cost me anything.

21 comments

Leave a comment

Your email address will not be published. Required fields are marked *