Reproducible scientific Python environments with conda
This is a brief explanation of a workflow that I’ve been using for research/data-science projects in Python. It makes use of conda
environments co-located with project files. This meets several key criteria that I was looking for:
- Environments are easily recreatable, meaning less worry about ever borking things
- Reproducible workspace across different machines
- Dependencies placed under version control for open-science and collaboration
- “Portable” environments that are easy to move around like normal folders
Below are the key steps to use this setup.
Note This post replaces an earlier version I was drafting using Craft. I took down that post from this site, but you can still access the draft version here. The commands below were run on macOS and should be relatively similar to other Nix-y systems (e.g. Ubuntu and Windows subsystem for Linux).
Setting up Anaconda or Miniconda
You can use this link to grab the latest Miniconda (on macOS you want the bash
script). Or you can use the installation the link at the bottom of this page to download Anaconda instead. Miniconda is a bit faster cause it’s more bare-bones, but Anaconda includes some default packages. Once you’ve downloaded either of those files you’ll need to open up a Terminal and cd
to the location of the file (probably Downloads
or Desktop
). From there you’ll need to run the installer and follow the prompts, which you can do by typing bash fileYouDownloaded.sh
.
You can verify the installation worked by asking your system what Python it sees now using which python
. If everything worked it should point to the Python installed inside your anaconda3
or miniconda3
directory.
Creating a new environment for each project
Most guides will tell you to create a new named environment using the -n/--name
flag to conda create
. But a more reproducible setup is to create a local environment within your project folder. Let’s say you have a project folder called myproject/
. The command below creates a new environment in a sub-directory called env
. It installs Python 3.8, pip
(for libraries on pypi), and specifically uses the conda-forge channel for grabbing them:
# From within myproject/
conda create -p ./env python=3.8 pip -c conda-forge
You should now see a new env/
directory. You can activate the environment by pointing to it: conda activate ./env
. You should not commit this folder to version control as it can be quite large depending on how complex your project requirements get. So make sure to echo 'env/' >> .gitignore
.
Backing up and restoring your environment
This is the critical piece that makes this setup work: the environment.yml
file. This file is a recipe for rebuilding your environment in a platform-independent way. To create this recipe run the following command:
# Export the environment recipe to a file called environment.yml
conda env export --no-builds -f environment.yml
You should rerun this command any time you install or uninstall new libraries and packages. You should also commit those changes to version control:
git add environment.yml
git commit -m "saved environment"
To restore an environment from this file (e.g. if you or a collaborator are working on another machine or you break something) just do:
# Make sure the env isn't active
conda deactivate
# Delete the env folder
rm -r env
# Create a new env using the spec in environment.yml
conda env create -p ./env -f environment.yml
You can also make sure your environment is in-sync with environment.yml
by running the following command which will install and uninstall dependencies as needed:
conda env update -p ./env -f environment.yml --prune
A few closing suggestions
Make sure you know what environment is active before running conda install commands
Make sure whenever you want to add or remove package to a project environment (e.g. myproject
), you conda activate ./env
it first. If you don’t, you’ll accidentally be adding and removing things from your base
environment!
Be careful about mixing and matching conda channels!
In snipped above I used the -c
flag to conda install
from the conda-forge
channel. By default conda
will install from the defaults
channel which points to anaconda.org, whereas conda-forge
points to conda-forge.org.
In general, you can save yourself a lot of headaches by simply sticking with the same channel for installing everything. For example if I installed numpy using conda install -c conda-forge numpy
, then it’s a good idea to keep using conda-forge
for other packages I want like pandas
: conda install -c conda-forge pandas
, rather than conda install pandas
, which is equivalent to conda install -c defaults pandas
.
There might be times when you can’t avoid mixing and matching, but it’s a good heuristic to help avoid the dreaded “environment conflict” messages that you might encounter otherwise. I have been a long time defaults
user because of one or two less than pleasant experiences with conda-forge
several years ago. Plus there used to be a significant speed difference when using the Intel compiled MKL (Math Kernel Libraries) and the open BLAS (Basic Linear Algebra Subprograms) that power libraries like numpy
. But lately most of that seems to have changed and for the last few projects I’ve been exclusively preferring conda-forge
for its breadth and especially for any R related dependencies.
Optional extra automation
If you’re interested in automating this workflow a bit, I made a few bash functions and aliases that you should be able to drop in to a .zshrc
or .bashrc
file. Specifically:
envinit()
for creating a brand newenv/
with a few basic libs and exporting itsenvironment.yml
envsave()
which exports the currentenv
toenvironment.yml
and also appendspip
packages installed from something other than pypi, e.g. github or locally from source. That’s because currently conda doesn’t install these packages when recreating an environment. So you’ll have to install them manually withpip
envcheck()
which checks ifenvsave
needs to be run and does so. Useful as a git pre-commit hook.envactivate()
which basically “overloads”conda activate
to prefer a./env
if one exists in the current directorynewproject.py
a python script that bootstraps a folder structure I often use while also setting up a VSCode Workspace and git repo
When starting a new project I’ll usually do something like this:
# Create new project scaffold using the `newproject` script at the end of this post
newproject --name coolscience
# Create a new conda env using an alias for the `envinit()` bash function
cd coolscience
ie
# Setup up verison control for env
echo 'env' >> .gitignore
git add environment.yml
git commit -m "saved environment"
This creates the following project structure and gives me a Python environment ready to go with some reasonable defaults and editor setup:
# contents of coolscience/
|-- analysis/
|-- code/
|-- data/
|-- docs/
|-- env/ # actual env contents
|-- figures/
|-- presentations/
|-- LICENSE
|-- README.md
|-- environment.yml # env recipe
|-- .vscode/
|-- .gitignore
As I continue working I’ll do the following:
- whenever I first
cd
intocoolscience
I’ll useca
to activate the environment installed incoolscience/env
- I’ll
ce
to check if myenvironment.yml
needs to be updated with any new packages Iconda install
-ed or removed - If I remember I’ll use
se
whenever Iconda install
something. But honestly I forget a lot so end up usingce
Note As mentioned above the environment.yml
generated by envsave()
and envcheck()
will also output installed pip
packages for convenience. However, any pip packages installed from source or after git clone
-ing something will need to be reinstalled manually whenever you recreate the environment. The aliases will print out a message indicating whenever you’re in that situation.
And the aliases/functions/scripts are here:
#/usr/bin/python | |
""" | |
Scaffold a project directory quickly and put it under version control with reasonable defaults. | |
Assumes python and git are available. | |
""" | |
import os | |
from subprocess import check_output, STDOUT | |
import argparse | |
def sys_call(str): | |
return check_output(str, shell=True, stderr=STDOUT) | |
parser = argparse.ArgumentParser( | |
description="script to auto-generate project structure." | |
) | |
parser.add_argument( | |
"--base_dir", help="Where to initialize project. Defaults to cwd.", required=False | |
) | |
parser.add_argument("--name", help="Project name.", required=True) | |
args = parser.parse_args() | |
if not args.base_dir: | |
base_dir = os.getcwd() | |
else: | |
base_dir = args.base_dir | |
if not args.name: | |
parser.error("Need a project name!") | |
else: | |
project_name = args.name | |
abs_project_path = os.path.join(base_dir, project_name) | |
# Make dir structure | |
print("Setting up project...") | |
call_string = ( | |
"mkdir -p " | |
+ abs_project_path | |
+ "/{analysis,data,papers,presentations,code,figures,paradigms,.vscode};" | |
) | |
sys_call(call_string) | |
# Create a README file | |
readme = open(os.path.join(abs_project_path, "README.md"), "w") | |
readme.write( | |
f""" | |
# {project_name} | |
--- | |
## Python environment setup | |
The environment.yml file in this repo can be used to bootstrap a conda environment for | |
reproducibility:\n | |
`conda env create -p ./env -f environment.yml`\n | |
To update the environment file after installing/removing packages: `conda env export --no-builds -f environment.yml`\n\n | |
To update the environment itself after editing the `environment.yml` file: `conda env update --file environment.yml --prune`\n | |
""" | |
) | |
readme.close() | |
# Create a gitignore file | |
ignore_file = open(os.path.join(abs_project_path, ".gitignore"), "w") | |
ignore_file.write("*.DS_Store\n") | |
ignore_file.write("*.ipynb_checkpoints\n") | |
ignore_file.write("*.pyc\n") | |
ignore_file.write("#Don't commit actual conda env\n") | |
ignore_file.write("env\n") | |
ignore_file.write("#Don't commit data, figs, or presentations by default\n") | |
ignore_file.write("*.csv\n") | |
ignore_file.write("*.txt\n") | |
ignore_file.write("*.png\n") | |
ignore_file.write("*.jpg\n") | |
ignore_file.write("*.jpeg\n") | |
ignore_file.write("*.ppt*\n") | |
ignore_file.write("*.key\n") | |
ignore_file.close() | |
# Vscode settings file | |
vscode_file = open(os.path.join(abs_project_path, ".vscode", "settings.json"), "w") | |
vscode_file.write("{\n") | |
vscode_file.write( | |
'\t"python.defaultInterpreterPath": "${workspaceFolder}/env/bin/python",\n' | |
) | |
vscode_file.write('\t"python.terminal.activateEnvironment": true,\n') | |
vscode_file.write('\t"editor.formatOnSave": true,\n') | |
vscode_file.write('\t"python.analysis.extraPaths": ["${workspaceFolder}/code"]\n') | |
vscode_file.write("}") | |
vscode_file.close() | |
# CD to project dir | |
os.chdir(abs_project_path) | |
# Create gitkeep files to commit empty dir structure | |
call_string = "find * -type d -not -path '*/\.*' -exec touch {}/.gitkeep \;" | |
sys_call(call_string) | |
# Initialize repo | |
sys_call("git init") | |
# Add stuff | |
git_add_call_string = "git add .gitignore README.md .vscode analysis code data figures papers paradigms presentations" | |
# Perform initial commit | |
sys_call(git_add_call_string) | |
call_string = "git commit -m 'Initial project commit.'" | |
sys_call(call_string) | |
if setup_precommit: | |
print("Setting up pre-commit hook ...") | |
sys_call( | |
"cp /Users/Esh/Documents/cmd_programs/check_env.sh ./.git/hooks/pre-commit" | |
) | |
# Messages | |
print(f"New project folder and repo created in:\n\n {abs_project_path}") | |
if setup_py: | |
print( | |
"""\nNew python environment created in ./env with environment.yml packages\n\nYou can activate this environment in a terminal using: conda activate ./env\n""" | |
) |
# Just copy these to your .zshrc, .bashrc, etc, or source this file from them | |
savepackagesfromgit() { | |
# Get pip installed packages from github | |
PIP_FROM_SOURCE=$(conda run -p ./env pip freeze | grep git) | |
if [ ! -z $PIP_FROM_SOURCE ]; then | |
echo 'saving pip packages installed from github...' | |
# Get package names from PIP_FROM_SOURCE | |
package_names=$(echo "$PIP_FROM_SOURCE" | sed 's/ @.*//') | |
# Read the PIP_FROM_SOURCE variable and replace @ - git+.. | |
modified_pip_from_source=$(echo "$PIP_FROM_SOURCE" | sed 's/.* @/ -/') | |
# Find the line number of the prefix line in environment.yml | |
prefix_line_num=$(grep -n "^prefix:" environment.yml | cut -d: -f1) | |
# Split the environment.yml file into two parts: before and after the prefix line | |
head -n $(($prefix_line_num - 1)) environment.yml > environment_part1.yml | |
tail -n +$prefix_line_num environment.yml > environment_part2.yml | |
# Remove lines containing the same names as in the package_names variable | |
while read -r package_name; do | |
grep -vE "^ -* *$package_name(==| @)" environment_part1.yml > environment_part1_filtered.yml | |
command mv -f environment_part1_filtered.yml environment_part1.yml | |
done <<< "$package_names" | |
# Combine the parts with the modified content from PIP_FROM_SOURCE | |
cat environment_part1.yml > environment_new.yml | |
echo "$modified_pip_from_source" >> environment_new.yml | |
cat environment_part2.yml >> environment_new.yml | |
# Replace the original environment.yml with the new file | |
command mv -f environment_new.yml environment.yml | |
# Delete intermediate files | |
command rm -f environment_part1.yml | |
command rm -f environment_part2.yml | |
fi | |
} | |
envsave() { | |
# Export to environment.yml rewriting it | |
# Also adds git packages installed via git | |
CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV") | |
if [ "$CURRENT_ENV" = "base" ]; then | |
echo "conda base env is active...activate your environment and then rerun envcheck" | |
return | |
fi | |
conda env export --no-builds -f environment.yml | |
printf "successfully saved environment.yml\n" | |
savepackagesfromgit | |
} | |
envinit(){ | |
# Bootstrap from environment.yml if it exists otherwise create a vanilla env and export it to environment.yml | |
if [ -f "environment.yml" ]; then | |
if [ -d "./env" ]; then | |
echo "Existing ./env folder found...remove and try again" | |
return | |
fi | |
echo "environment.yml found...bootstrapping environment" | |
conda env create --file environment.yml -p ./env | |
echo "activating environment..." | |
envactivate | |
else | |
echo "no environment.yml found...creating basic environment with python 3.8 and conda-forge" | |
conda create -y -p ./env python=3.8 pip pycodestyle black ipykernel -c conda-forge | |
echo "activating environment..." | |
envactivate | |
echo "saving created environment to environment.yml..." | |
envsave | |
fi | |
} | |
# Note this always rewrites enviroment.yml if there are git+ packages as it doesn't | |
# check for those! | |
envcheck() { | |
ENV_FILE='environment.yml' | |
# Ensure the user doesn't accidentally run this from a shell in which is the base environment is active | |
CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV") | |
if [ "$CURRENT_ENV" = "base" ]; then | |
echo "conda base env is active...activate your environment and then rerun envcheck" | |
return | |
fi | |
if [ -f "$ENV_FILE" ]; then | |
echo "Checking $ENV_FILE against environment for changes..." | |
# Get current environment list | |
CONDA_YAML=$(conda env export --no-builds) | |
# Get diff with existing files | |
DIFF=$(echo "$CONDA_YAML" | git diff --no-index -- "$ENV_FILE" -) | |
if [ "$DIFF" != "" ]; then | |
echo "Changes found...updating $ENV_FILE" | |
envsave | |
else | |
echo "No changes found...$ENV_FILE is up-to-date" | |
fi | |
else | |
if [ ! -f "$ENV_FILE" ]; then | |
echo "no $ENV_FILE file found creating..." | |
envsave | |
fi | |
fi | |
} | |
envactivate() { | |
if [ "$1" != "" ]; then | |
conda activate $1 | |
else | |
if [ -d "./env" ]; then | |
echo "Activating ./env" | |
conda activate ./env | |
else | |
echo "No ./env found. Need to specify which env to activate" | |
return | |
fi | |
fi | |
} | |
# Setup some aliases | |
alias ca='envactivate' | |
alias cdd='conda deactivate' | |
alias checkenv='envcheck' | |
alias initenv='envinit' | |
alias saveenv='envsave' | |
alias ce="envcheck" | |
alias se="envsave" | |
alias ie="envinit" | |
alias ae="envactivate" |