Eshin Jolly

Reproducible scientific Python environments with conda

This is a brief explanation of a workflow that I’ve been using for research/data-science projects in Python. It makes use of conda environments co-located with project files. This meets several key criteria that I was looking for:

  • Environments are easily recreatable, meaning less worry about ever borking things
  • Reproducible workspace across different machines
  • Dependencies placed under version control for open-science and collaboration
  • “Portable” environments that are easy to move around like normal folders

Below are the key steps to use this setup.

Note This post replaces an earlier version I was drafting using Craft. I took down that post from this site, but you can still access the draft version here. The commands below were run on macOS and should be relatively similar to other Nix-y systems (e.g. Ubuntu and Windows subsystem for Linux).

Setting up Anaconda or Miniconda

You can use this link to grab the latest Miniconda (on macOS you want the bash script). Or you can use the installation the link at the bottom of this page to download Anaconda instead. Miniconda is a bit faster cause it’s more bare-bones, but Anaconda includes some default packages. Once you’ve downloaded either of those files you’ll need to open up a Terminal and cd to the location of the file (probably Downloads or Desktop). From there you’ll need to run the installer and follow the prompts, which you can do by typing bash fileYouDownloaded.sh.

You can verify the installation worked by asking your system what Python it sees now using which python. If everything worked it should point to the Python installed inside your anaconda3 or miniconda3 directory.

Creating a new environment for each project

Most guides will tell you to create a new named environment using the -n/--name flag to conda create. But a more reproducible setup is to create a local environment within your project folder. Let’s say you have a project folder called myproject/. The command below creates a new environment in a sub-directory called env. It installs Python 3.8, pip (for libraries on pypi), and specifically uses the conda-forge channel for grabbing them:

# From within myproject/
conda create -p ./env python=3.8 pip -c conda-forge

Change your Python version to what makes most sense for your project. You can also omit -c conda-forge to just install from the normal defaults channel.

You should now see a new env/ directory. You can activate the environment by pointing to it: conda activate ./env. You should not commit this folder to version control as it can be quite large depending on how complex your project requirements get. So make sure to echo 'env/' >> .gitignore.

Backing up and restoring your environment

This is the critical piece that makes this setup work: the environment.yml file. This file is a recipe for rebuilding your environment in a platform-independent way. To create this recipe run the following command:

# Export the environment recipe to a file called environment.yml
conda env export --no-builds -f environment.yml

The --no-builds flag exports the current environment in a platform-independent way. This means you should be able to use the same environment.yml across different operating systems (e.g. macOS, Windows, etc)

You should rerun this command any time you install or uninstall new libraries and packages. You should also commit those changes to version control:

git add environment.yml
git commit -m "saved environment"

To restore an environment from this file (e.g. if you or a collaborator are working on another machine or you break something) just do:

# Make sure the env isn't active 
conda deactivate
# Delete the env folder
rm -r env
# Create a new env using the spec in environment.yml
conda env create -p ./env -f environment.yml

You can also make sure your environment is in-sync with environment.yml by running the following command which will install and uninstall dependencies as needed:

conda env update -p ./env -f environment.yml --prune

A few closing suggestions

Make sure you know what environment is active before running conda install commands

Make sure whenever you want to add or remove package to a project environment (e.g. myproject ), you conda activate ./env it first. If you don’t, you’ll accidentally be adding and removing things from your base environment!

Be careful about mixing and matching conda channels!

In snipped above I used the -c flag to conda install from the conda-forge channel. By default conda will install from the defaults channel which points to anaconda.org, whereas conda-forge points to conda-forge.org.

In general, you can save yourself a lot of headaches by simply sticking with the same channel for installing everything. For example if I installed numpy using conda install -c conda-forge numpy, then it’s a good idea to keep using conda-forge for other packages I want like pandas: conda install -c conda-forge pandas, rather than conda install pandas, which is equivalent to conda install -c defaults pandas.

There might be times when you can’t avoid mixing and matching, but it’s a good heuristic to help avoid the dreaded “environment conflict” messages that you might encounter otherwise. I have been a long time defaults user because of one or two less than pleasant experiences with conda-forge several years ago. Plus there used to be a significant speed difference when using the Intel compiled MKL (Math Kernel Libraries) and the open BLAS (Basic Linear Algebra Subprograms) that power libraries like numpy. But lately most of that seems to have changed and for the last few projects I’ve been exclusively preferring conda-forge for its breadth and especially for any R related dependencies.

Optional extra automation

If you’re interested in automating this workflow a bit, I made a few bash functions and aliases that you should be able to drop in to a .zshrc or .bashrc file. Specifically:

  • envinit() for creating a brand new env/ with a few basic libs and exporting its environment.yml
  • envsave() which exports the current env to environment.yml and also appends pip packages installed from something other than pypi, e.g. github or locally from source. That’s because currently conda doesn’t install these packages when recreating an environment. So you’ll have to install them manually with pip
  • envcheck() which checks if envsave needs to be run and does so. Useful as a git pre-commit hook.
  • envactivate() which basically “overloads” conda activate to prefer a ./env if one exists in the current directory
  • newproject.py a python script that bootstraps a folder structure I often use while also setting up a VSCode Workspace and git repo

When starting a new project I’ll usually do something like this:

# Create new project scaffold using the `newproject` script at the end of this post
newproject --name coolscience

# Create a new conda env using an alias for the `envinit()` bash function
cd coolscience
ie

# Setup up verison control for env
echo 'env' >> .gitignore
git add environment.yml
git commit -m "saved environment"

This creates the following project structure and gives me a Python environment ready to go with some reasonable defaults and editor setup:

# contents of coolscience/
|-- analysis/
|-- code/
|-- data/
|-- docs/
|-- env/                  # actual env contents
|-- figures/
|-- presentations/
|-- LICENSE
|-- README.md
|-- environment.yml       # env recipe
|-- .vscode/
|-- .gitignore

As I continue working I’ll do the following:

  • whenever I first cd into coolscience I’ll use ca to activate the environment installed in coolscience/env
  • I’ll ce to check if my environment.yml needs to be updated with any new packages I conda install-ed or removed
  • If I remember I’ll use se whenever I conda install something. But honestly I forget a lot so end up using ce

Note As mentioned above the environment.yml generated by envsave() and envcheck() will also output installed pip packages for convenience. However, any pip packages installed from source or after git clone-ing something will need to be reinstalled manually whenever you recreate the environment. The aliases will print out a message indicating whenever you’re in that situation.

And the aliases/functions/scripts are here:

#/usr/bin/python
"""
Scaffold a project directory quickly and put it under version control with reasonable defaults.
Assumes python and git are available.
"""
import os
from subprocess import check_output, STDOUT
import argparse
def sys_call(str):
return check_output(str, shell=True, stderr=STDOUT)
parser = argparse.ArgumentParser(
description="script to auto-generate project structure."
)
parser.add_argument(
"--base_dir", help="Where to initialize project. Defaults to cwd.", required=False
)
parser.add_argument("--name", help="Project name.", required=True)
args = parser.parse_args()
if not args.base_dir:
base_dir = os.getcwd()
else:
base_dir = args.base_dir
if not args.name:
parser.error("Need a project name!")
else:
project_name = args.name
abs_project_path = os.path.join(base_dir, project_name)
# Make dir structure
print("Setting up project...")
call_string = (
"mkdir -p "
+ abs_project_path
+ "/{analysis,data,papers,presentations,code,figures,paradigms,.vscode};"
)
sys_call(call_string)
# Create a README file
readme = open(os.path.join(abs_project_path, "README.md"), "w")
readme.write(
f"""
# {project_name}
---
## Python environment setup
The environment.yml file in this repo can be used to bootstrap a conda environment for
reproducibility:\n
`conda env create -p ./env -f environment.yml`\n
To update the environment file after installing/removing packages: `conda env export --no-builds -f environment.yml`\n\n
To update the environment itself after editing the `environment.yml` file: `conda env update --file environment.yml --prune`\n
"""
)
readme.close()
# Create a gitignore file
ignore_file = open(os.path.join(abs_project_path, ".gitignore"), "w")
ignore_file.write("*.DS_Store\n")
ignore_file.write("*.ipynb_checkpoints\n")
ignore_file.write("*.pyc\n")
ignore_file.write("#Don't commit actual conda env\n")
ignore_file.write("env\n")
ignore_file.write("#Don't commit data, figs, or presentations by default\n")
ignore_file.write("*.csv\n")
ignore_file.write("*.txt\n")
ignore_file.write("*.png\n")
ignore_file.write("*.jpg\n")
ignore_file.write("*.jpeg\n")
ignore_file.write("*.ppt*\n")
ignore_file.write("*.key\n")
ignore_file.close()
# Vscode settings file
vscode_file = open(os.path.join(abs_project_path, ".vscode", "settings.json"), "w")
vscode_file.write("{\n")
vscode_file.write(
'\t"python.defaultInterpreterPath": "${workspaceFolder}/env/bin/python",\n'
)
vscode_file.write('\t"python.terminal.activateEnvironment": true,\n')
vscode_file.write('\t"editor.formatOnSave": true,\n')
vscode_file.write('\t"python.analysis.extraPaths": ["${workspaceFolder}/code"]\n')
vscode_file.write("}")
vscode_file.close()
# CD to project dir
os.chdir(abs_project_path)
# Create gitkeep files to commit empty dir structure
call_string = "find * -type d -not -path '*/\.*' -exec touch {}/.gitkeep \;"
sys_call(call_string)
# Initialize repo
sys_call("git init")
# Add stuff
git_add_call_string = "git add .gitignore README.md .vscode analysis code data figures papers paradigms presentations"
# Perform initial commit
sys_call(git_add_call_string)
call_string = "git commit -m 'Initial project commit.'"
sys_call(call_string)
if setup_precommit:
print("Setting up pre-commit hook ...")
sys_call(
"cp /Users/Esh/Documents/cmd_programs/check_env.sh ./.git/hooks/pre-commit"
)
# Messages
print(f"New project folder and repo created in:\n\n {abs_project_path}")
if setup_py:
print(
"""\nNew python environment created in ./env with environment.yml packages\n\nYou can activate this environment in a terminal using: conda activate ./env\n"""
)
view raw init_project.py hosted with ❤ by GitHub
# Just copy these to your .zshrc, .bashrc, etc, or source this file from them
savepackagesfromgit() {
# Get pip installed packages from github
PIP_FROM_SOURCE=$(conda run -p ./env pip freeze | grep git)
if [ ! -z $PIP_FROM_SOURCE ]; then
echo 'saving pip packages installed from github...'
# Get package names from PIP_FROM_SOURCE
package_names=$(echo "$PIP_FROM_SOURCE" | sed 's/ @.*//')
# Read the PIP_FROM_SOURCE variable and replace @ - git+..
modified_pip_from_source=$(echo "$PIP_FROM_SOURCE" | sed 's/.* @/ -/')
# Find the line number of the prefix line in environment.yml
prefix_line_num=$(grep -n "^prefix:" environment.yml | cut -d: -f1)
# Split the environment.yml file into two parts: before and after the prefix line
head -n $(($prefix_line_num - 1)) environment.yml > environment_part1.yml
tail -n +$prefix_line_num environment.yml > environment_part2.yml
# Remove lines containing the same names as in the package_names variable
while read -r package_name; do
grep -vE "^ -* *$package_name(==| @)" environment_part1.yml > environment_part1_filtered.yml
command mv -f environment_part1_filtered.yml environment_part1.yml
done <<< "$package_names"
# Combine the parts with the modified content from PIP_FROM_SOURCE
cat environment_part1.yml > environment_new.yml
echo "$modified_pip_from_source" >> environment_new.yml
cat environment_part2.yml >> environment_new.yml
# Replace the original environment.yml with the new file
command mv -f environment_new.yml environment.yml
# Delete intermediate files
command rm -f environment_part1.yml
command rm -f environment_part2.yml
fi
}
envsave() {
# Export to environment.yml rewriting it
# Also adds git packages installed via git
CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV")
if [ "$CURRENT_ENV" = "base" ]; then
echo "conda base env is active...activate your environment and then rerun envcheck"
return
fi
conda env export --no-builds -f environment.yml
printf "successfully saved environment.yml\n"
savepackagesfromgit
}
envinit(){
# Bootstrap from environment.yml if it exists otherwise create a vanilla env and export it to environment.yml
if [ -f "environment.yml" ]; then
if [ -d "./env" ]; then
echo "Existing ./env folder found...remove and try again"
return
fi
echo "environment.yml found...bootstrapping environment"
conda env create --file environment.yml -p ./env
echo "activating environment..."
envactivate
else
echo "no environment.yml found...creating basic environment with python 3.8 and conda-forge"
conda create -y -p ./env python=3.8 pip pycodestyle black ipykernel -c conda-forge
echo "activating environment..."
envactivate
echo "saving created environment to environment.yml..."
envsave
fi
}
# Note this always rewrites enviroment.yml if there are git+ packages as it doesn't
# check for those!
envcheck() {
ENV_FILE='environment.yml'
# Ensure the user doesn't accidentally run this from a shell in which is the base environment is active
CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV")
if [ "$CURRENT_ENV" = "base" ]; then
echo "conda base env is active...activate your environment and then rerun envcheck"
return
fi
if [ -f "$ENV_FILE" ]; then
echo "Checking $ENV_FILE against environment for changes..."
# Get current environment list
CONDA_YAML=$(conda env export --no-builds)
# Get diff with existing files
DIFF=$(echo "$CONDA_YAML" | git diff --no-index -- "$ENV_FILE" -)
if [ "$DIFF" != "" ]; then
echo "Changes found...updating $ENV_FILE"
envsave
else
echo "No changes found...$ENV_FILE is up-to-date"
fi
else
if [ ! -f "$ENV_FILE" ]; then
echo "no $ENV_FILE file found creating..."
envsave
fi
fi
}
envactivate() {
if [ "$1" != "" ]; then
conda activate $1
else
if [ -d "./env" ]; then
echo "Activating ./env"
conda activate ./env
else
echo "No ./env found. Need to specify which env to activate"
return
fi
fi
}
# Setup some aliases
alias ca='envactivate'
alias cdd='conda deactivate'
alias checkenv='envcheck'
alias initenv='envinit'
alias saveenv='envsave'
alias ce="envcheck"
alias se="envsave"
alias ie="envinit"
alias ae="envactivate"