Here's how you can upload tons of data to a PostgreSQL database in a snap

Prerequisites

  • Python
  • A CSV file containing data

1. Install psycopg2

$ pip install psycopg2

2. Imports

Spin up a text editor and add these imports to your python file. We will use the csv package to import data from a csv file.

import psycopg2.sql as psql
import psycopg2.extras
import psycopg2
import csv

If you're curious, psycopg2.sql contains SQL related helpers for generating SQL queries using python, and psycopg2.extras contains the most important method of this tutorial - execute_values() - which we will use to finally execute the import query.

3. Create a database connection object and cursor

These objects can be treated as a singletons which will hold the connection to the database and they can be used multiple times.

DB = psycopg2.connect(
    dsn=None, 
    dbname='your_database', 
    user='postgres', 
    host='localhost', 
    password='password'
)
db_cursor = DB.cursor

4. Load the data from your csv file

  • Place your csv file in the same directory as your python script (or just update the file path to point to wherever it lives)
  • You can use the following code to easily import the csv data
with open('your_csv_file.csv') as f:
    reader = csv.reader(f, delimiter=',')
    allrows = []

    for line in reader:
        allrows.append({ "column1": line[0], "column2": line[1] })
  • line in the for loop is a list of all the values in the csv row currently being looped through. line[0] would be the first field, line[1] the second, and so on.
  • For each line/row in the csv that is looped over, we create a dictionary (map) where the keys are the field/column names and the values are... the values.
  • The allrows variable will store a list of all the created dictionaries (each representing a row which we will insert).

PS: It's important that the keys be the exact same as the column names into the table you want to insert your data into. This consistency is required to make the rest of the steps below work. You can surely choose to use different names, but then you'd need to also update the steps below to work with that.

5. Generate re-usable SQL

In this step, we will generate some SQL using the following function:

def create_sql(table_name, column_names):
    query = psql.SQL("INSERT INTO {} ({}) VALUES %s").format(
        psql.Identifier(table_name),
        psql.SQL(', ').join(map(psql.Identifier, column_names))
    )
    query2 = psql.SQL("({})").format(
        psql.SQL(', ').join(map(psql.Placeholder, column_names)),
    )

    return query.as_string(DB), query2.as_string(DB)
  • The helper functions used here make it easy to dynamically create SQL queries based on how many, and which columns we want to update
  • We are not really running 2 queries, but query2 represents a portion of the dynamically generated SQL - specifically the part that comes after VALUES in a typical insert SQL query.

Call the function with the name of the table you want to update, as well as the column names:

insert_sql, template_sql = create_sql('your_table', ['column1', 'column2'])

I encourage you to add print statements or breakpoints to inspect what different values look like. Below, you can see that I've printed allrows, insert_sql and template_sql to see how my dummy data will get inserted:

6. Execute the query

psycopg2.extras.execute_values(db_cusror, insert_sql, allrows, template_sql)

Note that the changes are not yet committed. This is a good checkpoint to test for errors (i.e. if you get no errors until this point, it means you're good to perform the final step).

7. Commit the changes 🎉

DB.commit()

Congratulations! You can now re-use this script to import huge amounts of data with a single command 🪄

If you have questions about any part or thoughts you want to share, feel free to mention those in the comments below :)

57