(Slightly) Quicker PySpark Tests

First of all, let me start by stating this isn't a post about the usage of fancy new technology but rather about sharing how to slightly improve the timing of automated pipelines that require testing PySpark.

Recently I needed to implement an automated workflow, used in multiple git repositories which required pyspark tests to run at every commit made on an open pull/merge request as well as after merge (because, you know... we love tests).

As usual, the team was on a rush to put together a lot of different elements simultaneously but, in the spirit of Agile development, we wanted to make sure that (1) everybody understood how to run the tests and (2) that experience was (could be) the same between running them on a machine or in our CICD pipeline.

If anyone has experience with (Py)Spark, there's a lot of common mishaps that you'll know how to handle, but if you have newcomers to the technology, it's sometimes hard to identify some of the aforementioned issues, such has: wrong java version installed/selected, wrong python version, input test file doesn't exist, etc. If there's a small change to introduce to the business logic, it should be easy to know if (and where) the intended use of the logic was broken.

Originally, we ended up this very practical Dockerfile and docker-compose.yaml, which installed whatever requirements.txt we placed in the root of the repo and ran all the tests.

FROM openjdk:8
ARG PYTHON_VERSION=3.7
ENV PATH="/root/miniconda3/bin:${PATH}"

# provision python with miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && mkdir /root/.conda \
    && bash Miniconda3-latest-Linux-x86_64.sh -b \
    && rm -f Miniconda3-latest-Linux-x86_64.sh
    && conda install python=$PYTHON_VERSION -y \
    && conda init bash

RUN . /root/.bashrc

We placed it in tests/Dockerfile and then the following docker-compose.yaml in the root of the repo.

version: "3.0"

services:
 test:
    build:
      context: .
      dockerfile: ./tests/Dockerfile
      args:
        PYTHON_VERSION: 3.7
    working_dir: /app
    volumes:
      - .:/app/
    command:
      - /bin/bash
      - -c
      - |
        pip install pytest
        pip install -r ./requirements.txt # including pyspark==2.4.2
        pytest ./tests

Then, you can just run docker-compose up test either on the designated CICD pipeline step and see the results. At least, there's no excuse for anyone to say they can't run the tests.

But waiting for this on every commit...?

If you try to run all of this, you'll notice that just provisioning Java, Python and then PySpark itself takes something between 2-4 minutes, depending on your CICD engine muscle and network bandwidth. Doesn't seem very significant, but if you run such logic every commit and that is required to pass before a Pull Request can be merged, might become relevant over time.

For this reason, we created a public image on Docker Hub, which you can use in your project and which is by no means rocket science but will hopefully be as convenient for your as it was for us.

Yeah, but there's jupyter pyspark notebook images out there with the similar pre-installed stack.

Right, right... But we wanted (1) something stripped down of unneeded packages (2) the ability to choose specific combinations of Python and PySpark versions per docker image and a (3) smaller download footprint - the images have almost half the size when decompressed (1.7 Gb vs 3.3 Gb).

Slightly quicker

The quick improvement would then be to simply reference directly the mklabsio/pyspark:py37-spark242 image in your docker-compose.yaml, get rid of the Dockerfile and leave everything as before.

version: "3.0"

services:
 test:
    **image: mklabsio/pyspark:py37-spark242**
    working_dir: /app
    volumes:
      - .:/app/
    command:
      - /bin/bash
      - -c
      - |
        pip install pytest
        pip install -r ./requirements.txt # including pyspark==2.4.2
        pytest ./tests

We also took the opportunity to create a simple GitHub Actions workflow to build all the required combinations of Python, Java and PySpark.

Hope this could help you in any way and if you see a need for improvement, let us know in the comments below or feel free to open an issue in the dedicated GitHub repository:

15