The best ten years for movies were 1994-2003, and I have proof

Stay with me. This is not clickbait!

I made you a script so you can create your own SQL Server database using the public IMDB dataset. All you need is a SQL Server 2022, Powershell 7, and a little time.

If you didn’t know, the IMDB dataset is public, and you can download it for free. There are some license restrictions, but as long as you use the data for your own, non-commercial purpose, you’ll be fine. I’ve always claimed that the years 1994 thru 2003 were a true modern golden age of Hollywood movies, and I wanted to find out if I could prove it.

It has to be acknowledged, though, that the IMDB movie ratings are pretty far from perfect. Look, I’m not saying that reviewers have poor taste, but please consider the fact that “Avengers: Infinity War” and “Spider-Man: Into the Spider-Verse” both have an impressive average rating of 8.4, while the epic crime noir masterpiece that is “Heat” comes in at a meager 8.3, and the genre-defining cyberpunk classic “Bladerunner” gets just a 8.1 average.

I’m sure Spider-Man is a fine movie, but come on. Anyway, I digress.

How to create your own IMDB database in SQL Server

You’ll need a couple of scripts that I’ve put in a Github repository, a SQL Server 2022, and about 20 GB of free disk space to do this.

  • Create a new SQL Server database.
  • Create the database schema by running the Create IMDB-schema.sql script in the database.
  • Download and import the IMDB dataset using the Load-ImdbStaging.ps1 script. The script will download the data files to your current working directory and perform the ingest from there. The download is just a few GB, but inserting the data may take a few hours depending on your hardware.
  • Populate the relational tables using the Load IMDB relational tables.sql script.
  • Once this is done, you can drop the staging tables in the Raw schema if you want.

The proof

I’ve chosen the following highly scientific method to determine the best consecutive ten years of moviemaking, not because I know anything about statistical analysis, but because it kind of confirms my bias what already I knew to be true.

--- Movies
--- * with at least 50,000 votes
--- * at least 1 hours long
WITH movies AS (
    SELECT YEAR(t.startYear) AS startYear, t.averageRating, t.voteCount,
           ROW_NUMBER() OVER (
                PARTITION BY t.startYear
                ORDER BY t.averageRating DESC, t.voteCount DESC
                ) AS ranking,
           tn.title
    FROM dbo.Titles AS t
    INNER JOIN dbo.TitleTypes AS tt ON t.titleTypeId=tt.titleTypeId
    LEFT JOIN dbo.TitleNames AS tn ON t.titleId=tn.titleId AND tn.isOriginal=1
    WHERE t.voteCount>=50000
      AND tt.titleType='movie'
      AND t.runtime>='01:00:00'),

--- Limit to the top-10 each year to filter out the low-quality stuff
--- you would never have watched anyway
top10 AS (
    SELECT *
    FROM movies
    WHERE ranking<=10),

--- Create a rolling aggregate of 10 years. The average is weighted using
--- the vote count, so a movie with a 9.0 and 50,000 votes does not impact
--- the average as much as a movie with a 7.0 with a million votes.
yearAggregate AS (
    SELECT startYear,
           --- Rolling 10-year vote count
           SUM(SUM(voteCount)) OVER (ORDER BY startYear ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING) AS voteCount_10yrs,
           --- Rolling 10-year average rating
           SUM(SUM(averageRating*voteCount)) OVER (ORDER BY startYear ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING)
          /SUM(SUM(voteCount)) OVER (ORDER BY startYear ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING) AS averageRating_10yrs,
           --- Collect the top 1 movie for the starting year
           MIN((CASE WHEN ranking=1 THEN title END)) AS top1_title
    FROM top10
    GROUP BY startYear)

--- Finally, DENSE_RANK() the results to see which year(s) come out on top.
SELECT startYear,
       startYear+9 AS endYear,
       top1_title,
       voteCount_10yrs,
       averageRating_10yrs,
       DENSE_RANK() OVER (ORDER BY averageRating_10yrs DESC) AS [rank]
FROM yearAggregate AS y
ORDER BY startYear;

If you’ve made it this far, I’m sure you have opinions. Let me hear them in the comments.

Leave a comment

Your email address will not be published. Required fields are marked *