The good research code handbook is written by Patrick Mineault and can be viewed here.
This is my abridged version of the handbook and a reference point for me if I need one (though the goal is for all of this to be second-nature). The sections of notes will follow the sections in the book.
The following ideas are not my own. Rather, they are those of Patrick Mineault. If an idea is my own, it will be denoted with a dagger, †.
The broad steps to setting up an organized project are
*† I tend to find it easier to create a git repository, clone it locally
(git clone https://<repo_url>
), and then proceed. This takes care of the first step, as well.
mkdir project_name
$ echo "# Project Name" >> README.md
$ git init
$ git add README.md
$ git commit -m "First commit -- adds README"
$ git branch -M main
$ git remote add origin https://<repo_url>
$ git push -u origin main
conda
pipenv
poetry
venv
virtualenv
asdf
docker
Below is how to set up a Conda package manager environment.
Creating and activating an environment
~/project_name$ conda create --name project_name python=3.12
~/project_name$ conda activate project_name
Installing packages
(project_name) ~/project_name$ conda install pandas numpy scipy matplotlib seaborn
Export your environment
(project_name) ~/project_name$ conda env export > environment.yml
Then, you may consider committing this environment file:
~/project_name$ git add environment.yml
~/project_name$ git commit environment.yml -m "Adds conda environment file"
~/project_name$ git push
This file can then be used to recreate this environment:
$ conda env create --name recoveredenv --file environment.yml
Note that this will only work for the same operating system. It will not be portable to a different operating system. This is due to it documenting low-level, OS-specific packages. Manually adjust the file to account for this if portability is necessary.
Add dependencies to your environment
(project_name) ~/project_name$ conda env update --prefix ./env --file environment.yml --prune
Creating and activating an environment:
~/project_name$ python -m venv project_name-env
~/project_name$ source project_name-env/bin/activate
Installing packages
(project_name-env) ~/project_name$ pip install pandas numpy scipy matplotlib seaborn
Export your environment
(project_name-env) ~/project_name$ pip freeze > requirements.txt
Then, you may consider committing this requirements file:
~/project_name$ git add requirements.txt
~/project_name$ git commit requirements.txt -m "Adds pip requirements file"
~/project_name$ git push
This file can then be used to recreate this environment:
$ python -m venv recovered-env
$ source recovered-env/bin/activate
(recovered-env) $ pip install -r requirements.txt
Note the requirements.txt file does not distinguish between packages installed via pip or other package managers. If non-Python dependencies need to be documented separately, consider using additional tools or documentation.
You can use pip inside of a conda environment. A big point of confusion is how conda relates to pip. For conda:
• Conda is both a package manager and a virtual environment manager
• Conda can install big, complicated-to-install, non-Python software, like gcc
• Not all Python packages can be installed through condaFor pip:
• pip is just a package manager
• pip only installs Python packages
• pip can install every package on PyPI in addition to local packagesConda tracks which packages are pip installed and will include a special section in
environment.yml
for pip packages. However, installing pip packages may negatively affect conda’s ability to install conda packages correctly after the first pip install. Therefore, people generally recommend installing big conda packages first, then installing small pip packages second.
There is not a default standard for a Python project. The following is as good as any but feel free to tweak it to specific needs:
.
├── data/
├── docs/
├── results/
├── scripts/
├── src/
├── tests/
├── .gitignore
├── environment.yml (or requirements.txt)
└── README.md
$ mkdir {data, docs, results, scripts, src, tests}
data/
A place for raw data. This doesn't typically get added to source control unless the datasets are small.
docs/
Where documentation goes. Naming it docs
makes publishing it through, say, GitHub pages, easier.
results/
Where you put test results including checkpoints, hdf5 files, pickle files, as well as figures and tables. If files are large don't add to source control.
scripts/
Python and bash scripts, Jupyter notebooks, etc.
src/
Reusable Python modules for the project. Code you would consider import
ing.
tests/
Where tests for your code go.
.gitignore
A list of files that git should ignore.
README.md
A description of your project, including installation instructions. What people will see on the top level of the repository.
environment.yml
(or requirements.txt
) Description of your environment.
This will make the project pip installable.
The steps for creating a locally pip installable package only involves a few steps:
setup.py
file
from setuptools import find_packages, setup
setup (
name='src',
packages=find_packages()
)
This should be done in the root (top level) of your project.
__init__.py
file
src
directory.find_packages
to find the package.~/project_name$ touch src/__init__.py
pip install
your package
(env) ~/project_name$ pip install -e .
.
indicates the package is being installed in the current directory.-e
indicates the package should be editable.
src
folder you don't need to re-install the
package for your changes to be picked up by Python.(env) ~/project_name$ echo "print('hello world')" > src/hello_world.py
(env) ~/project_name$ cd scripts
(env) ~/project_name$ python
>>> import src.hello_world
hello world
>>> exit()
(env) ~/project_name$ cd ~
(env) ~/project_name$ python
>>> import src.helloworld
hello world
src
to, say, project_name
, simply
(env) ~/project_name$ mv src project_name
(env) ~/project_name$ pip install -e .
This is a shortcut to make your code accessible to other files in different directories but is not the most future-proof method.
By adding the src
folder to your Python path you should be able to access the code anywhere:
import sys
sys.path.append('~/project_name/src')
from src.lib import cool_function
You can skip everything we just went over using the cookiecutter
tool.
As an example to do exactly what we did (there are other cookiecutter
flavors, including the robust
Data Science
cookiecutter
):
(env) ~/project_name$ pip install cookiecutter
(env) ~/project_name$ cookiecutter gh:patrickmineault/true-neutral-cookiecutter
variable_name
, cool_module.py
CoolClass
Cool Jupyter Notebook.ipynb
You could study the style guide. Or you could consider using a linter and/or a code formatter, such as
flake8
(linter)pylint
(linter)black
(code formatter)ruff
(linter and code formatter; † personal favorite)Simply clean up dead code, at least from the main branch. Consider using Vulture if there is a lot of dead code.
This photo from the book adequatley summarizes a good Jupyter approach.
Additionally, you should
%load_ext autoreload
%autoreload 2
(env) ~/project_name$ pip install jupytext
(env) ~/project_name$ jupytext --set-formats ipynb,percent notebook.ipynb
Code smells shoudl be avoided and include:
Spaghetti code == code so tightly wound that when you pull on one strand, the entire thing unravels.
Pure functions follow the canonical data flow:
return
statementand are considered stateless and deterministic. They are easy to reason about.
A side effect is anything that happens outside the canonical data flow from arguments to return, including
Not every function with side effects is problematic, however.
To have well-behaved side effects,
_
.
self._x
denotes a class member _x
which should be managed by
the class itself.assert
† Good for notebooks!
assert
throws an error whenever the statement is false.
a.
>>> assert 1 == 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
b.
def fib(x):
if x <= 2:
return 1
else:
return fib(x-1) + fib(x-2)
if __name__ == ’__main__’: assert fib(0) == 0
assert fib(1) == 1
assert fib(2) == 1
assert fib(6) == 8
assert fib(40) == 102334155
print("Tests passed")
$ python fib.py
Traceback (most recent call last):
File "fib.py", line 8, in <module> assert fib(0) == 0
AssertionError
Once you have a lot of tests it makes sense to group them into a test suite that gets run with a test runner. You
can use pytest
or unittest
. The book focuses on the more common pytest
.
This is most easily shown through example:
After installing pytest
by running pip install -U pytest
, create a file in
tests/
called test_fib.py
that looks like:
from src.fib import fib
import pytest
def test_typical():
assert fib(1) == 1
assert fib(2) == 1
assert fib(6) == 8
assert fib(40) == 102334155
def test_edge_case():
assert fib(0) == 0
def test_raises():
with pytest.raises(NotImplementedError): # checks that this error is raised
fib(-1)
with pytest.raises(NotImplementedError):
fib(1.5)
Then, run the test suite:
$ pytest test_fib.py
...
def fib_inner(x):
nonlocal cache
if x in cache:
return cache[x]
> if x == 0:
E RecursionError: maximum recursion depth exceeded in comparison
../src/fib.py:7: RecursionError
============================= short test summary info ========================== FAILED test_fib.py::test_raises-RecursionError: maximum recursion depth exceed =========================== 1 failed, 2 passed in 1.18s ========================
At this point the code can be fixed, the test suite ran again, and an output like
test_fib.py ... [100%]
================================ 3 passed in 0.02s =============================
is seen.
There are more types of tests than just unit tests:
For many people, documenting code is synonymous with commenting code. That's a narrow view of documentation.
Raise errors in your code so that you know they will get read -- docstrings often aren't. Some common and generic
errors to consider mixing in are NotImplementedError
, ValueError
, and
NameError
. You can also mix in the aforementioned asserts
to make a block of code fail
more gracefully if, for example, an input isn't the correct format.
Optionally-enforced type checking using decorators. These are a matter of preference (and to some Python purists are sacreligious) but they do increase documentation.
The three prevailing styles of docstrings are reST, Google, and NumPy. Pick one and stick to it (along with your colleagues). IDEs can parse and display these docstrings, which tends to be very helpful.
Docstrings can age poorly. When your arguments change, it’s easy to forget to change the docstring accordingly. I prefer to wait until later in the development process when function interfaces are stable to start writing docstrings.
Generating docs is easy with Sphinx:
pip install sphinx
cd docs
sphinx -quickstart
make html
You can then publish your docs to Readthedocs by linking your repository. You can, of course, publish to other places as well, such as GitHub pages or netlify.
Sphinx by default works for reST docstrings. It can also work for Google and NumPy docstrings with a plugin.
Instead of commenting and un-commenting code, we can have different code paths execute depending on flags passed as command line arguments.
argparse
makes this easy. Here is an example of a use of argparse
:
import argparse
def main(args):
# TODO: Implement a neural net here.
print(args.model) # Prints the model type.
if __name__ == ’__main__’:
parser = argparse.ArgumentParser(description="Train a neural net")
parser.add_argument(" --model", required=True, help="Model type (resnet or alexnet)")
parser.add_argument(" --niter", type=int, default=1000, help="Number of iterations")
parser.add_argument(" --in_dir", required=True, help="Input directory with images")
parser.add_argument(" --out_dir", required=True, help="Output directory with trained model")
args = parser.parse_args()
main(args)
$ python train_net.py -h
usage: train_net.py [ -h] --model MODEL [ --niter NITER] --in_dir IN_DIR --out_dir
OUT_DIR
Train a neural net
optional arguments:
-h, --help
--model MODEL
--niter NITER
--in_dir IN_DIR
--out_dir OUT_DIR Output directory with trained model
Once code is taking configuration as command line flags, a record of the flags that you used when you invoke your code should be kept. An easy way to do this is a shell file that contains multiple shell commands that are run one after the other.
Here is an example shell file. Notice that it not only runs the code but it is also documentation of the pipeline:
#!/bin/bash
# This will cause bash to stop executing the script if there’s an error
set -e
# Download files
aws s3 cp s3://codebook -testbucket/images/ data/images --recursive
# Train network
python scripts/train_net.py --model resnet --niter 100000 --in_dir data/images \
--out_dir results/trained_model
# Create output directory
mkdir results/figures/
# Generate plots
python scripts/generate_plots.py --in_dir data/images \
--out_dir results/figures/ --ckpt results/trained_model/model.ckpt
There are a lot of make options out there. Consider adding them as your pipeline grows. A makefile specifies both the inputs to each pipeline step and its outputs.
† I have personally worked with and like snakemake
for machine-learning projects.
Here is a nice shortcut to having figures and tables versioned with your codebase:
import git
import matplotlib.pyplot as plt
repo = git.Repo(search_parent_directories=True)
short_hash = repo.head.object.hexsha[:10]
# Plotting code goes here...
plt.savefig(f’figure.{short_hash}.png’)
This reduces ambiguity about how and when this figure was generated. The same can be done with any other type of results file.
There are other, more full-featured tools that can do this in a much more robust fashion if that better meets your needs. For example, consider Wandb, Neptune, Gigantum, or datalad, to name a few.
README.md
file At minimum, consider including
and keep the filel up-to-date.
Writing notes in Markdown allows them to be extended to a wide variety of platforms for other's ingestion. It tends to vary by environment on just how much you need to document, but it saves yourself future work to write in Markdown.
An effective method of sharng knowledge through active practice -- two programmers collaborate actively on a programming task. Traditionally, there is a driver, who physically types the code and thinks about micro-issues in teh code (tactics), and the navigator, who assists the driver in telling them what to write and focus on macro issues e.g., what a function should accomplish.
† I think code reviews are essential for a team to work well.
Code review is the practice of peer reviewing other people's code. Common ways to do this are through pull/merge requests. Alternatively, a code review meeting can be organized where everyone reads code and comments on it at once.
This is the end of the document. I hope you found, and continue to find, it helpful.