15
(Slightly) Quicker PySpark Tests
First of all, let me start by stating this isn't a post about the usage of fancy new technology but rather about sharing how to slightly improve the timing of automated pipelines that require testing PySpark.
Recently I needed to implement an automated workflow, used in multiple git repositories which required pyspark
tests to run at every commit made on an open pull/merge request as well as after merge (because, you know... we love tests).
As usual, the team was on a rush to put together a lot of different elements simultaneously but, in the spirit of Agile development, we wanted to make sure that (1) everybody understood how to run the tests and (2) that experience was (could be) the same between running them on a machine or in our CICD pipeline.
If anyone has experience with (Py)Spark, there's a lot of common mishaps that you'll know how to handle, but if you have newcomers to the technology, it's sometimes hard to identify some of the aforementioned issues, such has: wrong java version installed/selected, wrong python version, input test file doesn't exist, etc. If there's a small change to introduce to the business logic, it should be easy to know if (and where) the intended use of the logic was broken.
Originally, we ended up this very practical Dockerfile
and docker-compose.yaml
, which installed whatever requirements.txt
we placed in the root of the repo and ran all the tests.
FROM openjdk:8
ARG PYTHON_VERSION=3.7
ENV PATH="/root/miniconda3/bin:${PATH}"
# provision python with miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
&& mkdir /root/.conda \
&& bash Miniconda3-latest-Linux-x86_64.sh -b \
&& rm -f Miniconda3-latest-Linux-x86_64.sh
&& conda install python=$PYTHON_VERSION -y \
&& conda init bash
RUN . /root/.bashrc
We placed it in tests/Dockerfile
and then the following docker-compose.yaml
in the root of the repo.
version: "3.0"
services:
test:
build:
context: .
dockerfile: ./tests/Dockerfile
args:
PYTHON_VERSION: 3.7
working_dir: /app
volumes:
- .:/app/
command:
- /bin/bash
- -c
- |
pip install pytest
pip install -r ./requirements.txt # including pyspark==2.4.2
pytest ./tests
Then, you can just run docker-compose up test
either on the designated CICD pipeline step and see the results. At least, there's no excuse for anyone to say they can't run the tests.
If you try to run all of this, you'll notice that just provisioning Java, Python and then PySpark itself takes something between 2-4 minutes, depending on your CICD engine muscle and network bandwidth. Doesn't seem very significant, but if you run such logic every commit and that is required to pass before a Pull Request can be merged, might become relevant over time.
For this reason, we created a public image on Docker Hub, which you can use in your project and which is by no means rocket science but will hopefully be as convenient for your as it was for us.
Yeah, but there's
jupyter pyspark notebook
images out there with the similar pre-installed stack.
Right, right... But we wanted (1) something stripped down of unneeded packages (2) the ability to choose specific combinations of Python and PySpark versions per docker image and a (3) smaller download footprint - the images have almost half the size when decompressed (1.7 Gb vs 3.3 Gb).
The quick improvement would then be to simply reference directly the mklabsio/pyspark:py37-spark242
image in your docker-compose.yaml
, get rid of the Dockerfile
and leave everything as before.
version: "3.0"
services:
test:
**image: mklabsio/pyspark:py37-spark242**
working_dir: /app
volumes:
- .:/app/
command:
- /bin/bash
- -c
- |
pip install pytest
pip install -r ./requirements.txt # including pyspark==2.4.2
pytest ./tests
We also took the opportunity to create a simple GitHub Actions workflow to build all the required combinations of Python, Java and PySpark.
Hope this could help you in any way and if you see a need for improvement, let us know in the comments below or feel free to open an issue in the dedicated GitHub repository:
15