Data behind parking tickets in New York City

Here, we shall analyse the "NYC Parking Tickets" dataset from Kaggle. To dig deeper into the data and uncover various insights, I will introduce such tools as Docker and Metabase, and explain the setup step by step.

Live demo of the resulting dashboard can be found here.

About the dataset

The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year!). This data is made publicly available to aid in ticket resolution and to guide policymakers.

This dataset is quite large - it contains about 42.3M rows, and covers the period from Aug 2013 to June 2017. Columns include information about the vehicle ticketed, the ticket issued, location, and time.

Here are some questions that might be interesting to address

When are tickets most likely to be issued? Any seasonality?
Where are tickets most commonly issued?
What are the most common years and types of cars to be ticketed?
Geographical distribution of the parking violations
What time of the day are you most likely to get a ticket?
What police precincts issue the most tickets?

Tools

The raw data is provided via multiple CSV files, the combined size of which exceeds 8GB. This is hard to analyse directly. My approach will involve importing the data into a Postgres database first, so I can run SQL and iterate faster.

Then, I will connect an open-source BI solution called Metabase and will build the resulting queries and the dashboard using Metabase's UI. Metabase is a great BI tool for individuals and small startups and is an ideal solution for prototyping.

In addition, it is open-source and has solid visualisation capabilities, and will be ideal for small teams that might not even have a data analyst.

The above will be packed into a Docker container using Compose and deployed to AWS Cloud for demo purposes. I'd assume that you have some basic understanding of containers, Docker, and Compose, but just in case, here's a link to the "Get Started" section of Docker documentation.

So, here's the summary of the steps we're going to take:

Download CSV files from Kaggle
Create Postgres container and import the files into a database
Create Metabase container and attach it to our Postgres database
Run analysis using SQL queries and build the resulting dashboard using Metabase

Importing data into Postgres

Container Definition

Our Postgres container definition will look like this

FROM postgres:13.3
COPY pg.conf /etc/postgresql/postgresql.conf
COPY init.sql /docker-entrypoint-initdb.d/

# importing data using 'copy' command
COPY ./data/* /docker-entrypoint-initdb.d/

CMD ["-c", "config_file=/etc/postgresql/postgresql.conf"]

We are using postgres:13.3 image from the official Postgres repository, and we are providing our own Postgres config, which is almost identical to the default one, with only a few adjustments in ./containers/postgres/pg.conf:

statement_timeout = 1200000 # in milliseconds (20 minutes) to allow COPY command to finish
max_wal_size = 3GB

Let's download the data files and put them in ./containers/postgres/data folder under the following names:

./containers/postgres/data/year_2014.csv
./containers/postgres/data/year_2015.csv
./containers/postgres/data/year_2016.csv
./containers/postgres/data/year_2017.csv

Database Initialisation

Now, let's look at the init.sql file in the postgres folder. There's an SQL definition of the table nyc_tickets that will contain all our data. I will create a few indices on the table, to speed up the future queries:

CREATE INDEX "IDX_tickets_vehicle_make" ON "nyc_tickets" ("vehicle_make");
CREATE INDEX "IDX_tickets_issue_date" ON "nyc_tickets" ("issue_date");

⚠️Cleaning and Preparing the Data

As per the dataset description on Kaggle, I am using summons_number as the primary key. However, in the input CSV files, there is a number of collisions and duplicates (i.e. multiple rows with identical summons_number values).

Normally, when inserting a possible duplicate into a table in Postgres, I would use something like

INSERT ... ON CONFLICT DO NOTHING;

or use a rule

CREATE RULE ... AS ON INSERT TO ... WHERE EXISTS ... DO INSTEAD NOTHING;

but, unfortunately, COPY command won't allow that syntax. So instead, we'll apply the following trick:

First, we will create a temporary table temp_tickets that is identical to the original but doesn't contain any primary keys or indices.
Then, we will COPY the data to that temporary table.
And finally, we will copy the data from the temporary table to nyc_tickets using ON CONFLICT DO NOTHING rule.

One thing to note though is that the 2017 CSV file doesn't have 8 additional columns, which are present in the other files (2013-2016). To address that issue, I will specify which exact fields I need to copy, while the rest will be replaced with default values:

COPY "temp_tickets" ("summons_number", "plate_id", "registration_state", "plate_type", "issue_date", "violation_code",
                     "vehicle_body_type", "vehicle_make", "issuing_agency", "street_code_1", "street_code_2",
                     "street_code_3", "vehicle_expiration_date", "violation_location", "violation_precinct",
                     "issuer_precinct", "issuer_code", "issuer_command", "issuer_squad", "violation_time",
                     "time_first_observed", "violation_county", "violation_in_front_of_or_opposite", "house_number",
                     "street_name", "intersecting_street", "date_first_observed", "law_section", "sub_division",
                     "violation_legal_code", "days_parking_in_effect", "from_hours_in_effect", "to_hours_in_effect",
                     "vehicle_color", "unregistered_vehicle", "vehicle_year", "meter_number", "feet_from_curb",
                     "violation_post_code", "violation_description", "no_standing_or_stopping_violation",
                     "hydrant_violation", "double_parking_violation")
FROM '/docker-entrypoint-initdb.d/year_2017.csv' DELIMITER ',' CSV HEADER;

Here is what the final version of init.sql looks like:

-- creating Metabase DB first
CREATE DATABASE "metabase";
GRANT ALL PRIVILEGES ON DATABASE "metabase" TO "analyst";

BEGIN;
GRANT ALL PRIVILEGES ON DATABASE "tickets" TO "analyst";
CREATE TABLE "nyc_tickets"
(
    "summons_number"                    BIGINT NOT NULL,
    "plate_id"                          TEXT,
    "registration_state"                TEXT,
    "plate_type"                        TEXT,
    "issue_date"                        DATE,
    "violation_code"                    INTEGER,
    "vehicle_body_type"                 TEXT,
    "vehicle_make"                      TEXT,
    "issuing_agency"                    TEXT,
    "street_code_1"                     INTEGER,
    "street_code_2"                     INTEGER,
    "street_code_3"                     INTEGER,
    "vehicle_expiration_date"           TEXT,
    "violation_location"                TEXT,
    "violation_precinct"                INTEGER,
    "issuer_precinct"                   INTEGER,
    "issuer_code"                       INTEGER,
    "issuer_command"                    TEXT,
    "issuer_squad"                      TEXT,
    "violation_time"                    TEXT,
    "time_first_observed"               TEXT,
    "violation_county"                  TEXT,
    "violation_in_front_of_or_opposite" TEXT,
    "house_number"                      TEXT,
    "street_name"                       TEXT,
    "intersecting_street"               TEXT,
    "date_first_observed"               TEXT,
    "law_section"                       INTEGER,
    "sub_division"                      TEXT,
    "violation_legal_code"              TEXT,
    "days_parking_in_effect"            TEXT,
    "from_hours_in_effect"              TEXT,
    "to_hours_in_effect"                TEXT,
    "vehicle_color"                     TEXT,
    "unregistered_vehicle"              TEXT,
    "vehicle_year"                      INTEGER,
    "meter_number"                      TEXT,
    "feet_from_curb"                    DECIMAL,
    "violation_post_code"               TEXT,
    "violation_description"             TEXT,
    "no_standing_or_stopping_violation" TEXT,
    "hydrant_violation"                 TEXT,
    "double_parking_violation"          TEXT,
    "latitude"                          DECIMAL,
    "longitude"                         DECIMAL,
    "community_board"                   TEXT,
    "community_council"                 TEXT,
    "census_tract"                      TEXT,
    "bin"                               TEXT,
    "bbl"                               TEXT,
    "nta"                               TEXT,
    CONSTRAINT "PK_tickets" PRIMARY KEY ("summons_number")
);

CREATE INDEX "IDX_tickets_vehicle_make" ON "nyc_tickets" ("vehicle_make");
CREATE INDEX "IDX_tickets_issue_date" ON "nyc_tickets" ("issue_date");

COMMIT;

BEGIN;

CREATE TEMP TABLE "temp_tickets"
(
    LIKE "nyc_tickets"
) ON COMMIT DROP;

COPY "temp_tickets" FROM '/docker-entrypoint-initdb.d/year_2014.csv' DELIMITER ',' CSV HEADER;
COPY "temp_tickets" FROM '/docker-entrypoint-initdb.d/year_2015.csv' DELIMITER ',' CSV HEADER;
COPY "temp_tickets" FROM '/docker-entrypoint-initdb.d/year_2016.csv' DELIMITER ',' CSV HEADER;
COPY "temp_tickets" ("summons_number", "plate_id", "registration_state", "plate_type", "issue_date", "violation_code",
                     "vehicle_body_type", "vehicle_make", "issuing_agency", "street_code_1", "street_code_2",
                     "street_code_3", "vehicle_expiration_date", "violation_location", "violation_precinct",
                     "issuer_precinct", "issuer_code", "issuer_command", "issuer_squad", "violation_time",
                     "time_first_observed", "violation_county", "violation_in_front_of_or_opposite", "house_number",
                     "street_name", "intersecting_street", "date_first_observed", "law_section", "sub_division",
                     "violation_legal_code", "days_parking_in_effect", "from_hours_in_effect", "to_hours_in_effect",
                     "vehicle_color", "unregistered_vehicle", "vehicle_year", "meter_number", "feet_from_curb",
                     "violation_post_code", "violation_description", "no_standing_or_stopping_violation",
                     "hydrant_violation", "double_parking_violation")
    FROM '/docker-entrypoint-initdb.d/year_2017.csv' DELIMITER ',' CSV HEADER;

INSERT INTO "nyc_tickets"
SELECT *
FROM "temp_tickets"
ON CONFLICT DO NOTHING;

COMMIT;

Metabase container

Metabase dockerfile doesn't contain anything except for the official image and environment files

FROM metabase/metabase:v0.39.4

Docker Compose

Now, we are ready to combine both containers into a docker-compose setup:

version: "3.0"

services:
  postgres:
    build: ./postgres
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=analyst
      - POSTGRES_PASSWORD=tickets
      - POSTGRES_DB=tickets
    volumes:
      - pg-data:/var/lib/postgresql/data

  metabase:
    # Database env variables are set in environment specific env files in ../containers/metabase
    build: ./metabase
    ports:
      - "3000:3000"

volumes:
  pg-data:

Let’s now launch the containers:

cd path/to/project
docker-compose -f containers/compose.yml up

Note, that you should probably give enough system resources to Docker, and it can take up to 10 minutes to import all 42M records into the database. Once it’s done, it time now to open Metabase at localhost:3000 and after a quick Metabase setup, start building our dashboard.

Analysis and Visualisation

This is probably the most interesting part of this tutorial, as we will be building SQL queries here to uncover different kinds of insights and visualising them on a dashboard.

Our first question is very simple: what types of cars are the most likely to receive a parking ticket

SELECT COUNT(*) AS "vehicle_count", "vehicle_make"
FROM "nyc_tickets" 
GROUP BY "vehicle_make" 
ORDER BY "vehicle_count" DESC

Let's group further by the make and year:

SELECT COUNT(*) AS "vehicle_count", concat("vehicle_make", '_', "vehicle_year") as make_year
FROM "nyc_tickets"
GROUP BY "vehicle_make", "vehicle_year" 
ORDER BY "vehicle_count" DESC

When are tickets most likely to be issued? Any seasonality?

For this one, we will group our results using a Postgres function [date_trunc](https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC) which truncates the input date to the specified date precision, in our case month

SELECT date_trunc('month', issue_date) AS "ticket_month", COUNT(*) AS "tickets_count" 
FROM "nyc_tickets"
WHERE issue_date BETWEEN '2013-07-01 00:00:00+00' AND '2017-07-01 00:00:00+00'
GROUP BY "ticket_month" 
ORDER BY "ticket_month" ASC

What time of the day are you most likely to get a ticket:

SELECT COUNT(*) AS "vehicle_count", concat(left("violation_time", 2), right("violation_time", 1)) as vtime
FROM "nyc_tickets"
GROUP BY vtime 
ORDER BY "vehicle_count" DESC

Distribution by month

SELECT to_char("issue_date", 'month') AS "ticket_month", COUNT(*) AS "tickets_count" 
FROM "nyc_tickets"
WHERE issue_date BETWEEN '2013-07-01 00:00:00+00' AND '2017-07-01 00:00:00+00'
GROUP BY "ticket_month" 
ORDER BY "ticket_month" DESC

July is the most quit months, while October, March and May are when more tickets are issued.

Tickets issued by Police precinct

SELECT COUNT(*) AS "vehicle_count", "violation_precinct"
FROM "nyc_tickets"
WHERE "violation_precinct" != 0
GROUP BY "violation_precinct"
ORDER BY "vehicle_count" DESC

Visualising geographical distribution is a bit hard because the initial dataset doesn’t have geodata in it. Enriching the data by geocoding the address would not be possible either, as geocoding of 42M rows will cost approximately ~$20,000.

Instead, we can use the police precinct id as the geolocation source and get the geodata from here.