Reproducible scientific Python environments with conda

Saturday. October 15, 2022 (~8m read)

Table of Contents

Setting up Anaconda or Miniconda
Creating a new environment for each project
Backing up and restoring your environment
A few closing suggestions

This is a brief explanation of a workflow that I’ve been using for research/data-science projects in Python. It makes use of conda environments co-located with project files. This meets several key criteria that I was looking for:

Environments are easily recreatable, meaning less worry about ever borking things
Reproducible workspace across different machines
Dependencies placed under version control for open-science and collaboration
“Portable” environments that are easy to move around like normal folders

Below are the key steps to use this setup.

Note This post replaces an earlier version I was drafting using Craft. I took down that post from this site, but you can still access the draft version here. The commands below were run on macOS and should be relatively similar to other Nix-y systems (e.g. Ubuntu and Windows subsystem for Linux).

Setting up Anaconda or Miniconda

You can use this link to grab the latest Miniconda (on macOS you want the bash script). Or you can use the installation the link at the bottom of this page to download Anaconda instead. Miniconda is a bit faster cause it’s more bare-bones, but Anaconda includes some default packages. Once you’ve downloaded either of those files you’ll need to open up a Terminal and cd to the location of the file (probably Downloads or Desktop). From there you’ll need to run the installer and follow the prompts, which you can do by typing bash fileYouDownloaded.sh.

You can verify the installation worked by asking your system what Python it sees now using which python. If everything worked it should point to the Python installed inside your anaconda3 or miniconda3 directory.

Creating a new environment for each project

Most guides will tell you to create a new named environment using the -n/--name flag to conda create. But a more reproducible setup is to create a local environment within your project folder. Let’s say you have a project folder called myproject/. The command below creates a new environment in a sub-directory called env. It installs Python 3.8, pip (for libraries on pypi), and specifically uses the conda-forge channel for grabbing them:

# From within myproject/
conda create -p ./env python=3.8 pip -c conda-forge

Change your Python version to what makes most sense for your project. You can also omit -c conda-forge to just install from the normal defaults channel.

You should now see a new env/ directory. You can activate the environment by pointing to it: conda activate ./env. You should not commit this folder to version control as it can be quite large depending on how complex your project requirements get. So make sure to echo 'env/' >> .gitignore.

Backing up and restoring your environment

This is the critical piece that makes this setup work: the environment.yml file. This file is a recipe for rebuilding your environment in a platform-independent way. To create this recipe run the following command:

# Export the environment recipe to a file called environment.yml
conda env export --no-builds -f environment.yml

The --no-builds flag exports the current environment in a platform-independent way. This means you should be able to use the same environment.yml across different operating systems (e.g. macOS, Windows, etc)

You should rerun this command any time you install or uninstall new libraries and packages. You should also commit those changes to version control:

git add environment.yml
git commit -m "saved environment"

To restore an environment from this file (e.g. if you or a collaborator are working on another machine or you break something) just do:

# Make sure the env isn't active 
conda deactivate
# Delete the env folder
rm -r env
# Create a new env using the spec in environment.yml
conda env create -p ./env -f environment.yml

You can also make sure your environment is in-sync with environment.yml by running the following command which will install and uninstall dependencies as needed:

conda env update -p ./env -f environment.yml --prune

A few closing suggestions

Make sure you know what environment is active before running conda install commands

Make sure whenever you want to add or remove package to a project environment (e.g. myproject ), you conda activate ./env it first. If you don’t, you’ll accidentally be adding and removing things from your base environment!

Be careful about mixing and matching conda channels!

In snipped above I used the -c flag to conda install from the conda-forge channel. By default conda will install from the defaults channel which points to anaconda.org, whereas conda-forge points to conda-forge.org.

In general, you can save yourself a lot of headaches by simply sticking with the same channel for installing everything. For example if I installed numpy using conda install -c conda-forge numpy, then it’s a good idea to keep using conda-forge for other packages I want like pandas: conda install -c conda-forge pandas, rather than conda install pandas, which is equivalent to conda install -c defaults pandas.

There might be times when you can’t avoid mixing and matching, but it’s a good heuristic to help avoid the dreaded “environment conflict” messages that you might encounter otherwise. I have been a long time defaults user because of one or two less than pleasant experiences with conda-forge several years ago. Plus there used to be a significant speed difference when using the Intel compiled MKL (Math Kernel Libraries) and the open BLAS (Basic Linear Algebra Subprograms) that power libraries like numpy. But lately most of that seems to have changed and for the last few projects I’ve been exclusively preferring conda-forge for its breadth and especially for any R related dependencies.

Optional extra automation

If you’re interested in automating this workflow a bit, I made a few bash functions and aliases that you should be able to drop in to a .zshrc or .bashrc file. Specifically:

envinit() for creating a brand new env/ with a few basic libs and exporting its environment.yml
envsave() which exports the current env to environment.yml and also appends pip packages installed from something other than pypi, e.g. github or locally from source. That’s because currently conda doesn’t install these packages when recreating an environment. So you’ll have to install them manually with pip
envcheck() which checks if envsave needs to be run and does so. Useful as a git pre-commit hook.
envactivate() which basically “overloads” conda activate to prefer a ./env if one exists in the current directory
newproject.py a python script that bootstraps a folder structure I often use while also setting up a VSCode Workspace and git repo

When starting a new project I’ll usually do something like this:

# Create new project scaffold using the `newproject` script at the end of this post
newproject --name coolscience

# Create a new conda env using an alias for the `envinit()` bash function
cd coolscience
ie

# Setup up verison control for env
echo 'env' >> .gitignore
git add environment.yml
git commit -m "saved environment"

This creates the following project structure and gives me a Python environment ready to go with some reasonable defaults and editor setup:

# contents of coolscience/
|-- analysis/
|-- code/
|-- data/
|-- docs/
|-- env/                  # actual env contents
|-- figures/
|-- presentations/
|-- LICENSE
|-- README.md
|-- environment.yml       # env recipe
|-- .vscode/
|-- .gitignore

As I continue working I’ll do the following:

whenever I first cd into coolscience I’ll use ca to activate the environment installed in coolscience/env
I’ll ce to check if my environment.yml needs to be updated with any new packages I conda install-ed or removed
If I remember I’ll use se whenever I conda install something. But honestly I forget a lot so end up using ce

Note As mentioned above the environment.yml generated by envsave() and envcheck() will also output installed pip packages for convenience. However, any pip packages installed from source or after git clone-ing something will need to be reinstalled manually whenever you recreate the environment. The aliases will print out a message indicating whenever you’re in that situation.

And the aliases/functions/scripts are here:

	#/usr/bin/python

	"""
	Scaffold a project directory quickly and put it under version control with reasonable defaults.
	Assumes python and git are available.
	"""

	import os
	from subprocess import check_output, STDOUT
	import argparse


	def sys_call(str):
	return check_output(str, shell=True, stderr=STDOUT)


	parser = argparse.ArgumentParser(
	description="script to auto-generate project structure."
	)
	parser.add_argument(
	"--base_dir", help="Where to initialize project. Defaults to cwd.", required=False
	)
	parser.add_argument("--name", help="Project name.", required=True)
	args = parser.parse_args()
	if not args.base_dir:
	base_dir = os.getcwd()
	else:
	base_dir = args.base_dir
	if not args.name:
	parser.error("Need a project name!")
	else:
	project_name = args.name

	abs_project_path = os.path.join(base_dir, project_name)

	# Make dir structure
	print("Setting up project...")
	call_string = (
	"mkdir -p "
	+ abs_project_path
	+ "/{analysis,data,papers,presentations,code,figures,paradigms,.vscode};"
	)
	sys_call(call_string)

	# Create a README file
	readme = open(os.path.join(abs_project_path, "README.md"), "w")
	readme.write(
	f"""
	# {project_name}


	---
	## Python environment setup
	The environment.yml file in this repo can be used to bootstrap a conda environment for
	reproducibility:\n
	`conda env create -p ./env -f environment.yml`\n

	To update the environment file after installing/removing packages: `conda env export --no-builds -f environment.yml`\n\n

	To update the environment itself after editing the `environment.yml` file: `conda env update --file environment.yml --prune`\n
	"""
	)
	readme.close()

	# Create a gitignore file
	ignore_file = open(os.path.join(abs_project_path, ".gitignore"), "w")
	ignore_file.write("*.DS_Store\n")
	ignore_file.write("*.ipynb_checkpoints\n")
	ignore_file.write("*.pyc\n")
	ignore_file.write("#Don't commit actual conda env\n")
	ignore_file.write("env\n")
	ignore_file.write("#Don't commit data, figs, or presentations by default\n")
	ignore_file.write("*.csv\n")
	ignore_file.write("*.txt\n")
	ignore_file.write("*.png\n")
	ignore_file.write("*.jpg\n")
	ignore_file.write("*.jpeg\n")
	ignore_file.write(".ppt\n")
	ignore_file.write("*.key\n")
	ignore_file.close()

	# Vscode settings file
	vscode_file = open(os.path.join(abs_project_path, ".vscode", "settings.json"), "w")
	vscode_file.write("{\n")
	vscode_file.write(
	'\t"python.defaultInterpreterPath": "${workspaceFolder}/env/bin/python",\n'
	)
	vscode_file.write('\t"python.terminal.activateEnvironment": true,\n')
	vscode_file.write('\t"editor.formatOnSave": true,\n')
	vscode_file.write('\t"python.analysis.extraPaths": ["${workspaceFolder}/code"]\n')
	vscode_file.write("}")
	vscode_file.close()

	# CD to project dir
	os.chdir(abs_project_path)
	# Create gitkeep files to commit empty dir structure
	call_string = "find * -type d -not -path '/\.' -exec touch {}/.gitkeep \;"
	sys_call(call_string)
	# Initialize repo
	sys_call("git init")
	# Add stuff
	git_add_call_string = "git add .gitignore README.md .vscode analysis code data figures papers paradigms presentations"

	# Perform initial commit
	sys_call(git_add_call_string)
	call_string = "git commit -m 'Initial project commit.'"
	sys_call(call_string)

	if setup_precommit:
	print("Setting up pre-commit hook ...")
	sys_call(
	"cp /Users/Esh/Documents/cmd_programs/check_env.sh ./.git/hooks/pre-commit"
	)

	# Messages
	print(f"New project folder and repo created in:\n\n {abs_project_path}")
	if setup_py:
	print(
	"""\nNew python environment created in ./env with environment.yml packages\n\nYou can activate this environment in a terminal using: conda activate ./env\n"""
	)

view raw init_project.py hosted with ❤ by GitHub

	# Just copy these to your .zshrc, .bashrc, etc, or source this file from them

	savepackagesfromgit() {
	# Get pip installed packages from github
	PIP_FROM_SOURCE=$(conda run -p ./env pip freeze \| grep git)
	if [ ! -z $PIP_FROM_SOURCE ]; then

	echo 'saving pip packages installed from github...'

	# Get package names from PIP_FROM_SOURCE
	package_names=$(echo "$PIP_FROM_SOURCE" \| sed 's/ @.*//')

	# Read the PIP_FROM_SOURCE variable and replace @ - git+..
	modified_pip_from_source=$(echo "$PIP_FROM_SOURCE" \| sed 's/.* @/ -/')

	# Find the line number of the prefix line in environment.yml
	prefix_line_num=$(grep -n "^prefix:" environment.yml \| cut -d: -f1)

	# Split the environment.yml file into two parts: before and after the prefix line
	head -n $(($prefix_line_num - 1)) environment.yml > environment_part1.yml
	tail -n +$prefix_line_num environment.yml > environment_part2.yml

	# Remove lines containing the same names as in the package_names variable
	while read -r package_name; do
	grep -vE "^ -* *$package_name(==\| @)" environment_part1.yml > environment_part1_filtered.yml
	command mv -f environment_part1_filtered.yml environment_part1.yml
	done <<< "$package_names"

	# Combine the parts with the modified content from PIP_FROM_SOURCE
	cat environment_part1.yml > environment_new.yml
	echo "$modified_pip_from_source" >> environment_new.yml
	cat environment_part2.yml >> environment_new.yml

	# Replace the original environment.yml with the new file
	command mv -f environment_new.yml environment.yml

	# Delete intermediate files
	command rm -f environment_part1.yml
	command rm -f environment_part2.yml
	fi
	}

	envsave() {
	# Export to environment.yml rewriting it
	# Also adds git packages installed via git
	CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV")
	if [ "$CURRENT_ENV" = "base" ]; then
	echo "conda base env is active...activate your environment and then rerun envcheck"
	return
	fi
	conda env export --no-builds -f environment.yml
	printf "successfully saved environment.yml\n"
	savepackagesfromgit
	}

	envinit(){
	# Bootstrap from environment.yml if it exists otherwise create a vanilla env and export it to environment.yml
	if [ -f "environment.yml" ]; then
	if [ -d "./env" ]; then
	echo "Existing ./env folder found...remove and try again"
	return
	fi
	echo "environment.yml found...bootstrapping environment"
	conda env create --file environment.yml -p ./env
	echo "activating environment..."
	envactivate
	else
	echo "no environment.yml found...creating basic environment with python 3.8 and conda-forge"
	conda create -y -p ./env python=3.8 pip pycodestyle black ipykernel -c conda-forge
	echo "activating environment..."
	envactivate
	echo "saving created environment to environment.yml..."
	envsave
	fi
	}

	# Note this always rewrites enviroment.yml if there are git+ packages as it doesn't
	# check for those!
	envcheck() {
	ENV_FILE='environment.yml'
	# Ensure the user doesn't accidentally run this from a shell in which is the base environment is active
	CURRENT_ENV=$(echo "$CONDA_DEFAULT_ENV")
	if [ "$CURRENT_ENV" = "base" ]; then
	echo "conda base env is active...activate your environment and then rerun envcheck"
	return
	fi

	if [ -f "$ENV_FILE" ]; then
	echo "Checking $ENV_FILE against environment for changes..."
	# Get current environment list
	CONDA_YAML=$(conda env export --no-builds)

	# Get diff with existing files
	DIFF=$(echo "$CONDA_YAML" \| git diff --no-index -- "$ENV_FILE" -)

	if [ "$DIFF" != "" ]; then
	echo "Changes found...updating $ENV_FILE"
	envsave
	else
	echo "No changes found...$ENV_FILE is up-to-date"
	fi
	else
	if [ ! -f "$ENV_FILE" ]; then
	echo "no $ENV_FILE file found creating..."
	envsave
	fi

	fi
	}

	envactivate() {
	if [ "$1" != "" ]; then
	conda activate $1
	else
	if [ -d "./env" ]; then
	echo "Activating ./env"
	conda activate ./env
	else
	echo "No ./env found. Need to specify which env to activate"
	return
	fi
	fi
	}


	# Setup some aliases
	alias ca='envactivate'
	alias cdd='conda deactivate'
	alias checkenv='envcheck'
	alias initenv='envinit'
	alias saveenv='envsave'
	alias ce="envcheck"
	alias se="envsave"
	alias ie="envinit"
	alias ae="envactivate"

view raw conda_helpers.sh hosted with ❤ by GitHub