16
Parameterizing and automating Jupyter notebooks with papermill
Have you ever created a Jupyter notebook and wished you could generate the notebook with a different set of parameters? If so, you’ve probably done at least one of the following:
- Edited the variables in a cell and reran the notebook, saving off a copy as needed
- Saved a copy of the notebook and maybe hacked up code to edit the values directly in the .ipynb files and reran notebooks
- Built some custom code to set the variables with data loaded from a database or configuration file, then reran the notebook
It turns out that there is a good solution for this problem that parameterizes interactive notebooks and coexists well with automated jobs, it’s called papermill.
Many notebook authors use the standard practice of designating a cell near the top of their notebooks for global variables. The author or other users of the notebook then modifies the values in the cell and runs the entire notebook to obtain different results. To persist the output, the author will manually download the notebook in another format or save it as a different notebook file. But using only a notebook server and these manual methods can quickly become messy and difficult to track, not to mention error prone. Which notebook is the one you edit? Papermill helps solve this problem. In this article, I’ll introduce papermill and basic usage, walk through an example of parameterization, and finally talk about ways to fully schedule and automate notebook execution using cron.
With papermill, a special cell in the notebook is designated for parameters. When papermill executes a parameterized notebook, either via the command line interface (CLI) or using the Python API, parameters are passed in and executed in a subsequent cell. This allows the notebook to be run multiple times with different parameters quickly. The resulting executed notebook can then be saved in a variety of places, including local or cloud storage.
To install papermill, use pip. I’d recommend using a virtual environment using virtualenv or conda. I often recommend using pyenv to install a recent Python version and for creating a virtualenv. But use whatever you are most comfortable with.
pip install papermill
If you would like to use the various input and output options (like Amazon’s s3
or Microsoft’s azure
, you can install all the dependencies. I won’t get into the detail here, but the documentation covers those options, and you can even extend papermill to add other handlers for input/output (I/O) of notebooks.
pip install papermill[all]
The first thing most users will want to do with papermill is parameterize a notebook. I made a simple example notebook that you can download and follow along. Once you have Jupyter running and have opened a notebook, all you need to do is add a parameters tag to the cell with parameters in it.
Save the notebook, and now you are ready to execute it using papermill. For the example notebook, use the CLI to run the notebook, supplying your own name.
papermill -p name Matt papermill_example1.ipynb papermill_matt.ipynb
This command is telling papermill to execute the input notebook papermill_example1.ipynb
and write the output to papermill_matt.ipynb
, while setting the parameter name
to the value Matt
. If you open the resulting notebook, the contents will now include a new cell after the parameters
-tagged one with an injected-parameters
tag like this.
You should now see how you can add as many parameters as you need to make new notebooks from an existing notebook. Think of the main notebook (in our case, papermill_example1.ipynb
) as a template that you can use to make as many copies as you want by quickly injecting parameters.
You may want to fetch or build your injected parameters using Python code, and so a Python API is also available to execute papermill. We can achieve the exact same result as above, in a Python script (or in a notebook, it works great there as well – and will show you the progress dynamically).
import papermill as pm
name = "Matt"
res = pm.execute_notebook(
'papermill_example1.ipynb',
'papermill_{name}.ipynb',
parameters = dict(name=name)
)
{"version_major":2,"version_minor":0,"model_id":"cf8280b216094bf6a75a9536b6505051"}
So far we’ve passed only one parameter, and have used the -p option to do this. You can pass parameters a couple of ways.
You can run these all using the example notebook, then view the results yourself. First, you can specify multiple parameters from the CLI. Even if a parameters doesn’t exist in the notebook yet, parameters can be passed in and created. In that case, papermill will create an injected-parameters
cell and execute it at the top of the notebook.
Here’s an example.
papermill -p name Matt -p level 5 -p factor 0.33 -p alive True papermill_example1.ipynb papermill_matt.ipynb
or with long options instead…
papermill --parameters name Matt --parameters level 5 --parameters factor 0.33 --parameters alive True papermill_example1.ipynb papermill_matt.ipynb
Note that the -p
or --parameters
option will try to parse integers and floats, so if you want them to be interpreted as strings, you use the -r
or --raw
option to get all values in as strings.
papermill -r name Matt -r level 5 -r factor 0.33 -r alive True papermill_example1.ipynb papermill_matt.ipynb
You can also use yaml for specifying parameters. This can be passed in via a file (-f
or --parameters_file
), a string (-y
or --parameters_yaml
) or a base64 encoded string (-b
or --parameters_base64
). This allows you to pass in more complex data, including lists and dictionaries.
papermill papermill_example1.ipynb papermill_matt.ipynb -y "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
- 1.0
- 2.5
- 3.7
params:
x: 3
y: 4"
You can base64 encode the string pretty easily. (Run this in your shell on Mac or Linux or Windows WSL in the directory where the notebook file is).
echo "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
- 1.0
- 2.5
- 3.7
params:
x: 3
y: 4" > params.yaml
Now you can run the file version.
papermill papermill_example1.ipynb papermill_matt.ipynb -f params.yaml
Or the base64 version
PARAMS=$(cat params.yaml| base64) # makes the base64 version of the yaml file
papermill papermill_example1.ipynb papermill_matt.ipynb -b $PARAMS
Either way, you should get the idea that you can pass complex data into your notebook from the command line, and also via the API. These examples all use the local filesystem for input and output of notebooks, but note that you can read and write notebooks from Amazon s3
, Azure, Google Cloud Storage, or web servers.
You can also inspect the available parameters of a notebook, from the CLI.
$ papermill --help-notebook papermill_example1.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]
Parameters inferred for notebook 'papermill_example1.ipynb':
name: Unknown type (default "Joe")
Or using the Python API.
pm.inspect_notebook('papermill_example1.ipynb')
{'name': {'name': 'name',
'inferred_type_name': 'None',
'default': '"Joe"',
'help': ''}}
A typical workflow for papermill is to have a parameterized notebook, run it with multiple values, then convert the resulting notebooks into another format for review or reporting. Let’s walk through an example of how this might be setup.
First, we have a parameterized notebook that uses the Yahoo! finance API to fetch stock prices and plot data with the all time high price of the stock (or at least it’s the high for the last two years since I’m only fetching that much data at this point).
If you want to run this example, you will need to ensure you have the yfinance
API installed as well as matplotlib
. You can install both with pip if needed.
We can use the papermill CLI to inspect the parameters.
$ papermill --help-notebook papermill_example2.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]
Parameters inferred for notebook 'papermill_example2.ipynb':
symbol: Unknown type (default 'AAPL')
We’ll run this notebook with several symbols. I’ve chosen to use a shell script for this so that I can run it through a scheduled cron job. If desired, this could just as easily be done using a simple Python script. However, if you are using a virtual enviroment you may end up needing a script anyway for ensuring the virtualenv is loaded properly. In that case, it might just be easier to use shell script for the entire process.
I’m also going to use the jupyter nbconvert
(or you can run it as jupyter-nbconvert
) command to convert the notebook into an html file for viewing via a web browser. Just like papermill, nbconvert is available via the command line or using the Python API.
#!/bin/bash
set -eux
# activate our virtualenv (this was created using pyenv-virtualenv, yours will be elsewhere)
source /Users/mcw/.pyenv/versions/3.8.6/envs/pandas/bin/activate
# get to the script directory if running via cron
cd $(dirname "${BASH_SOURCE[0]}")
for S in AAPL MSFT GOOG FB
do
papermill -p symbol $S papermill_example2.ipynb papermill_${S}.ipynb
jupyter-nbconvert --no-input --to html papermill_${S}.ipynb
done
You can run this command from your shell (after adjusting the line that activates the virtual environment to reflect your own setup). You can also schedule it to run regularly in cron pretty easily. For example, you can run this report every weekday at 4 PM like this (with your own path).
00 16 * * mon-fri /Users/mcw/projects/python_blogposts/tools/run_papermill.sh
With just a little more creativity (and software configuration on nbconvert), you can output the notebooks to PDF or other formats, send them via email, or upload them to a server to have nice looking reports updated on a daily basis.
Note that the per-symbol notebooks are saved to the local disk. They can be opened in Jupyter server and re-executed easily if debugging or further work is required. Just know that if you have an automated job running, the notebooks will be replaced each time it runs. Ideally, you want to work on your main template notebook, then generate new versions for each symbol with automation.
One other tip is that papermill can read and write to standard input and output. This means that if you have other tools that take notebook files as input, you don’t have to write the files out to disk. For example, in our shell script above, we could prevent writing out each individual notebook file per symbol and do the following inside our loop instead.
papermill -p symbol $S papermill_example2.ipynb | jupyter-nbconvert --stdin --no-input --to html --output report_${S}.html
Note that if you do this, you’ll need to open the main notebook (papermill_example2.ipynb
) and edit your parameters to debug issues. But maybe that’s prefereable if you need to save disk space and don’t need the ability to debug each notebook separately.
Papermill is a useful library to parameterize and execute Jupyter notebooks. You can use it to automate execution of your notebooks with any sets of parameters you can dream up. Follow this up with a conversion of the notebook using nbconvert to provide readable and useful versions of your notebooks.
There is much more that can be done with notebook automation, but starting with papermill as a tool to execute and parameterize notebooks is a good platform to build on.
The post Parameterizing and automating Jupyter notebooks with papermill appeared first on wrighters.io.
16