Functional/Verb-base Data Manipulation¶
utilz.dfverbs
provides a module that offers dplyr-like functionality on top of pandas dataframes. The convention is to import it as an alias: import utilz.dfverbs as _
. In recent versions of the ipython kernel, _
isn't overwritten with recent interaction evaluations so its safe to use this alias.
Most pandas methods are accessible and function exactly like calling them directly on a dataframe, i.e.
_.head() == df.head()
However there are additional functions that mimic behavior from the tidyverse. These include:
_.query()
for filtering/selecting rows based on their values*_.sort()
for re-ordering rows_.select()
for picking columns by name_.mutate()
for creating new columns from existing columns with functions_.transmute()
just like mutate but only returns the new columns
_.summarize()
for collapsing many values down to single summary_.groupby
for splitting data into sub-datasets
*_.query
is the only function differs in its behavior even though it has the same name as df.query
. It supports everything df.query
does and df.loc
. That means you can pass in a (1) query string, (2) boolean array/series (3) index (4) callable function that accepts the entire dataframe as an argument
from utilz import pipe, tweak
import utilz.dfverbs as _
df = _.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/raw/mpg.csv')
pipe(
df,
_.head(),
# pipe both returns AND shows output by default, since we're not assigning the
# output of pipe to another variable we don't need to show it as well
show = False
)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
We can do all the fun data manipulation stuff that R
folks can like easily creating and extracting new columns. utilz
always returns dataframes for consistency even if the output contains a single column:
out = pipe(
df,
_.groupby('cylinders'),
_.transmute(displacement_centered_by_cylinders='displacement - displacement.mean()')
)
displacement_centered_by_cylinders | |
---|---|
0 | -38.009709 |
1 | 4.990291 |
2 | -27.009709 |
3 | -41.009709 |
4 | -43.009709 |
... | ... |
393 | 30.203431 |
394 | -12.796569 |
395 | 25.203431 |
396 | 10.203431 |
397 | 9.203431 |
398 rows × 1 columns
Here's a fun one: Plotting the relationship between a car's weight and average acceleration and stratifying on whether the car was made before or after 1976:
out = pipe(
df,
_.groupby("model_year"),
_.summarize(avg_accel="acceleration.mean()", avg_weight="weight.mean()"),
_.mutate(
model_year=lambda model_year: model_year.apply(
lambda year: "early" if year < 76 else "late"
)
),
_.scatterplot(x="avg_weight", y="avg_accel", hue="model_year"),
# Swiss army knife plot tweaking function
tweak(
despine=True,
title="Relationship between car weight and acceleration",
ylabel = 'Acceration (avg mph)',
xlabel = 'Weight (avg lbs)',
ylim=(14.0, 17.0),
legend_title='Model Year',
legend_frame=False,
legend_labels=['Early', 'Late'],
title_fontsize = 18,
legend_title_fontsize = 14,
xlabel_fontsize= 14,
ylabel_fontsize= 14,
xticklabel_fontsize=12,
yticklabel_fontsize=12,
title_y=1.05,
legend_labels_fontsize=12,
),
show=True,
)
model_year | avg_accel | avg_weight | |
---|---|---|---|
0 | early | 12.948276 | 3372.793103 |
1 | early | 15.142857 | 2995.428571 |
2 | early | 15.125000 | 3237.714286 |
3 | early | 14.312500 | 3419.025000 |
4 | early | 16.203704 | 2877.925926 |
5 | early | 16.050000 | 3176.800000 |
6 | late | 15.941176 | 3078.735294 |
7 | late | 15.435714 | 2997.357143 |
8 | late | 15.805556 | 2861.805556 |
9 | late | 15.813793 | 3055.344828 |
10 | late | 16.934483 | 2436.655172 |
11 | late | 16.306897 | 2522.931034 |
12 | late | 16.638710 | 2453.548387 |