Adventures in Tabbing

Beginnings

Recently, I decided that I would do some data sciencey stuff at home. I have a great new job but it doesn't afford me time to work with reams of data or SQL. Since I am doing this on the cheap I decided that I would learn Postgres. I found a old motherboard (AMD A10-7500K) that has eight SATA-3 connectors. I bought some cheap SSDs. I installed Centos 8-Stream. I made myself a software RAID and installed Postgres onto it. That was an adventure unto itself and a story for another day.

The Problem

The data for my experiments came to me in tab delimited format. I am surely coming from a place of ignorance and there are other ways I could have solved the problem I am about to describe but I decided to take a direct approach. I previously have worked with MS SQL Server and the import/export facilities are top notch. I was happy to discover Postgres has a built in import facility but in the end I was not satisfied with the results of which there were none. My data has empty strings in the place of null values. Postgres expects that those values have \n to denote null values. This proved problematic for numeric and date values. I decided that since I am just learning Python that I would write a tab delimited import facility that satisfied my needs and was generally useful to the open source community at large.

The Approach

Here the packages that were required to write this program

import argparse
import getpass
import psycopg2
import psycopg2.extras

I wanted a generic utility that would be configurable from the command line so using Pythons' argparse package was necessary. I wanted it to work by either passing the username and password on the command line or for it to prompt for them if not supplied on the command line. The getpass utility asks for passwords without echoing characters. The psycopg package is what is used to access Postgres.

For the program to work the way I wanted the program would need to to be aware of the types of the data being imported. It is easy to get this data from the catalog. Here is an example of how this works.

SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = %s and table_name = %s;

The schema and table name are passed on the command line. This query is executed and those values are substituted in this string. The results are turned into a dictionary of columns_name:data_type.

data_types = db_cursor.fetchall()
column_def = dict((x,y) for x,y in data_types)

Since I have to look up these types, the first row must be a header row with the proper column names. Casing does not have to be correct. The remainder of the program consist of spinning through the tab delimited flat file rows and replacing the values that should be null and not empty strings with the Python None value.

A Note on Performance

Committing individual rows is not the most performative approach. The pgsycopg.extras package provides execute_values which allows multiple tuples to be inserted at one time. Here is the SQL.

insert_sql = 'INSERT INTO ' + args.schema +"."+ args.tablename + ' (' + ','.join(column_names) + ') VALUES %s'

When the group of rows is ready to be added we use execute_many.

psycopg2.extras.execute_values(db_cursor,insert_sql,values)

There is a command line argument that can specify the number of rows to be included in an individual commit. The size is only limited by the amount of available memory for the tuples that will be loaded into Postgres in a commit.

Conclusion

I am pleased with the results considering and am overall happy with this, my second Python program ever. I am sure I am not doing certain things the Pythonic way but in the end it's results that count. I have other plans such as being able to load a json file that maps column names in the file to different columns in the database (at this time they must match). Currently it isn't very verbose (okay it is quite silent) but I plan on adding levels of verbosity. I've been looking for years for an open source project that I could create and am pleased to have finally found it. The code can be found here:

If nothing else there is code that shows how to add many rows to Postgres quickly. I hope that someone besides me finds this useful. Lastly please be kind, it's my first time putting code in the public sphere. Thank you for reading.