Overview¶
1. Functional tools¶
Operating on single objects¶
from utilz import do, many, randdf
You can use do
to apply a single function or method to an object
df = randdf()
do(lambda df: df.head(), df)
do('head', df) # sytactic sugar
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
3 | 0.942193 | 0.767003 | 0.852347 |
4 | 0.178692 | 0.494257 | 0.507263 |
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
3 | 0.942193 | 0.767003 | 0.852347 |
4 | 0.178692 | 0.494257 | 0.507263 |
Use can pass function or method arguments as well
do('head', df, 10)
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
3 | 0.942193 | 0.767003 | 0.852347 |
4 | 0.178692 | 0.494257 | 0.507263 |
5 | 0.999359 | 0.056832 | 0.254085 |
6 | 0.802454 | 0.160224 | 0.843747 |
7 | 0.602545 | 0.840196 | 0.007152 |
8 | 0.801355 | 0.937513 | 0.052925 |
9 | 0.218896 | 0.757961 | 0.057891 |
Use many
to apply a sequence of functions independently to an object
results = many(['head', 'tail'], df)
results[0]
results[1]
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
3 | 0.942193 | 0.767003 | 0.852347 |
4 | 0.178692 | 0.494257 | 0.507263 |
A1 | B1 | C1 | |
---|---|---|---|
5 | 0.999359 | 0.056832 | 0.254085 |
6 | 0.802454 | 0.160224 | 0.843747 |
7 | 0.602545 | 0.840196 | 0.007152 |
8 | 0.801355 | 0.937513 | 0.052925 |
9 | 0.218896 | 0.757961 | 0.057891 |
Use compose
or a pipe
to apply a sequence of functions in a row
from utilz import compose, pipe
bottom_head = compose(lambda df: df.head(10), lambda df: df.tail(3))
bottom_head(df)
A1 | B1 | C1 | |
---|---|---|---|
7 | 0.602545 | 0.840196 | 0.007152 |
8 | 0.801355 | 0.937513 | 0.052925 |
9 | 0.218896 | 0.757961 | 0.057891 |
pipe(df,
lambda df: df.head(10),
lambda df: df.tail(3)
)
A1 | B1 | C1 | |
---|---|---|---|
7 | 0.602545 | 0.840196 | 0.007152 |
8 | 0.801355 | 0.937513 | 0.052925 |
9 | 0.218896 | 0.757961 | 0.057891 |
Use iffy
to apply a function if a predicate function is true. It takes a checking function and then a function or value to return if that checking function is true
from utilz import iffy
# Apply function
iffy(lambda df: len(df) > 3, lambda df: df.head(3), df)
# Return arbitrary object
iffy(lambda df: len(df) > 3, 'big df', df)
# If check fails just returns the original object
iffy(lambda df: len(df) < 3, lambda df: df.head(3), df)
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
'big df'
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.076077 | 0.990881 | 0.021679 |
1 | 0.914026 | 0.688789 | 0.698269 |
2 | 0.635191 | 0.337502 | 0.327470 |
3 | 0.942193 | 0.767003 | 0.852347 |
4 | 0.178692 | 0.494257 | 0.507263 |
5 | 0.999359 | 0.056832 | 0.254085 |
6 | 0.802454 | 0.160224 | 0.843747 |
7 | 0.602545 | 0.840196 | 0.007152 |
8 | 0.801355 | 0.937513 | 0.052925 |
9 | 0.218896 | 0.757961 | 0.057891 |
Operating on iterables¶
from utilz import map, mapcat
map
is just sugar for list(map())
def myfunc(x):
return x * 2
map(myfunc, range(10))
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
mapcat
will concatenate/flattest results:
def myfunc(x):
return [x * 2]
map(myfunc, range(10))
mapcat(myfunc, range(10))
[[0], [2], [4], [6], [8], [10], [12], [14], [16], [18]]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
If myfunc
is None it can be used to flatten nested lists (max 2 levels deep):
mapcat(None, [[1,2,3], [4,5,6], [7]])
[1, 2, 3, 4, 5, 6, 7]
If myfunc
returns a dataframe, will try to concat the results by row:
from utilz import randdf
def myfunc(f):
"""simulate loading a 2x3 dataframe from file"""
return randdf(size=(2,3))
mapcat(myfunc, range(4))
A1 | B1 | C1 | |
---|---|---|---|
0 | 0.723497 | 0.397081 | 0.477959 |
1 | 0.981362 | 0.465690 | 0.505523 |
2 | 0.254038 | 0.692296 | 0.589320 |
3 | 0.076432 | 0.229396 | 0.183292 |
4 | 0.317140 | 0.187555 | 0.451125 |
5 | 0.613190 | 0.191327 | 0.634255 |
6 | 0.678660 | 0.456217 | 0.492318 |
7 | 0.217005 | 0.730834 | 0.310409 |
If your myfunc
returns an array, will try also try concat the results by default, while preserving the output shape. Because myfunc
returns a 1d array, the final result is 2d:
import numpy as np
def myfunc(f):
"""Function that returns 1d array"""
return np.arange(3)
mapcat(myfunc, range(4))
array([[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]])
This is equivalent to passing concat_axis=1
:
mapcat(myfunc, range(4), concat_axis=1)
array([[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]])
You can instead flatten the array, but passing concat_axis=0
:
mapcat(myfunc, range(4), concat_axis=0)
array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])
Or stack it in a 3rd dimension by passing concat_axis=2
:
mapcat(myfunc, range(4), concat_axis=2)
array([[0, 0, 0, 0], [1, 1, 1, 1], [2, 2, 2, 2]])
both map
and mapcat
support easy parallel looping just be changing the n_jobs
argument:
from time import sleep
def myfunc(x):
"""Simulate expensive function"""
sleep(1)
return x * 2
map(myfunc, range(10), n_jobs=2)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
You can easily pass the loop index to myfunc
by setting enum=True
:
# myfunc needs to accept an 'idx' argument
def myfunc(x, idx):
"""Simulate expensive function"""
sleep(1)
return x * idx
mapcat(myfunc, range(10), n_jobs=2, enum=True)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Likewise if your function uses randomization, you can set the random_state
to reproduce parallel runs:
# myfunc needs to accept an 'random_seed' argument
def myfunc(x, random_state=None):
"""Simulate expensive function"""
from utilz import check_random_state
rng = check_random_state(random_state)
sleep(1)
return x * rng.random()
map(myfunc, range(10), n_jobs=2, random_state=1)
[0.0, 0.7026449924443589, 1.3671148797485828, 2.491197036621863, 2.8137601674196255, 4.388425659603134, 5.805394778570049, 4.548025497425891, 2.529163623883595, 4.026815575448365]
Now this second run reproduces the same values:
map(myfunc, range(10), n_jobs=2, random_state=1)
[0.0, 0.7026449924443589, 1.3671148797485828, 2.491197036621863, 2.8137601674196255, 4.388425659603134, 5.805394778570049, 4.548025497425891, 2.529163623883595, 4.026815575448365]
2. Decorators¶
utilz
decorators can be added to any function to provide some convenient information or checks before or after execution. Currently these include:
expensive
: cache a function result to disk and load it on rerunslog
: print shape, size, len of an arg before and after function executionmaybe
: run a function only if a file doesn't exist or a dir isn't emptyshow
: print the result of a function in addition to returning ittimeit
: print how long a function took to evaluate
3. Dataframe tools¶
Utilz makes working with dataframes a bit easier by offering extra methods without altering core pandas functionality. You don't need to import anything to use these methods. They're automatically available after importing anything from utilz
. Currently these include:
norm_by_group
: center, scale, or z-score separately by group.assert_same_nunique
: make sure groups have the same number of unique values in a particular column.assert_balanced_groups
: make sure groups have the same size
Example usage¶
# No need to import anything!
# Add a group col
df = randdf()
df['group'] = ['A'] * 5 + ['B'] * 5
# This is a new method!
new_df = df.norm_by_group('group', 'A1')
new_df
A1 | B1 | C1 | group | A1_normed_by_group | |
---|---|---|---|---|---|
0 | 0.045894 | 0.093716 | 0.932221 | A | A1 |
1 | 0.738293 | 0.249943 | 0.518687 | A | A1 |
2 | 0.357182 | 0.454217 | 0.575472 | A | A1 |
3 | 0.289010 | 0.453426 | 0.211871 | A | A1 |
4 | 0.328628 | 0.396641 | 0.041587 | A | A1 |
5 | 0.481833 | 0.394005 | 0.503150 | B | A1 |
6 | 0.430750 | 0.769627 | 0.838887 | B | A1 |
7 | 0.882731 | 0.122181 | 0.393370 | B | A1 |
8 | 0.622302 | 0.943480 | 0.715790 | B | A1 |
9 | 0.419627 | 0.882003 | 0.629938 | B | A1 |
You can use the scale
and center
args to control whether mean-centering and dividing by standard-deviation are done (both default to True
). This will also change the generated column name appropriately:
new_df.norm_by_group('group', 'A1', scale=False)
A1 | B1 | C1 | group | A1_normed_by_group | A1_centered_by_group | |
---|---|---|---|---|---|---|
0 | 0.045894 | 0.093716 | 0.932221 | A | A1 | A1 |
1 | 0.738293 | 0.249943 | 0.518687 | A | A1 | A1 |
2 | 0.357182 | 0.454217 | 0.575472 | A | A1 | A1 |
3 | 0.289010 | 0.453426 | 0.211871 | A | A1 | A1 |
4 | 0.328628 | 0.396641 | 0.041587 | A | A1 | A1 |
5 | 0.481833 | 0.394005 | 0.503150 | B | A1 | A1 |
6 | 0.430750 | 0.769627 | 0.838887 | B | A1 | A1 |
7 | 0.882731 | 0.122181 | 0.393370 | B | A1 | A1 |
8 | 0.622302 | 0.943480 | 0.715790 | B | A1 | A1 |
9 | 0.419627 | 0.882003 | 0.629938 | B | A1 | A1 |