Overview¶
1. Functional tools¶
Operating on single objects¶
from utilz import do, many, randdf
You can use do to apply a single function or method to an object
df = randdf()
do(lambda df: df.head(), df)
do('head', df) # sytactic sugar
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
| 3 | 0.942193 | 0.767003 | 0.852347 |
| 4 | 0.178692 | 0.494257 | 0.507263 |
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
| 3 | 0.942193 | 0.767003 | 0.852347 |
| 4 | 0.178692 | 0.494257 | 0.507263 |
Use can pass function or method arguments as well
do('head', df, 10)
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
| 3 | 0.942193 | 0.767003 | 0.852347 |
| 4 | 0.178692 | 0.494257 | 0.507263 |
| 5 | 0.999359 | 0.056832 | 0.254085 |
| 6 | 0.802454 | 0.160224 | 0.843747 |
| 7 | 0.602545 | 0.840196 | 0.007152 |
| 8 | 0.801355 | 0.937513 | 0.052925 |
| 9 | 0.218896 | 0.757961 | 0.057891 |
Use many to apply a sequence of functions independently to an object
results = many(['head', 'tail'], df)
results[0]
results[1]
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
| 3 | 0.942193 | 0.767003 | 0.852347 |
| 4 | 0.178692 | 0.494257 | 0.507263 |
| A1 | B1 | C1 | |
|---|---|---|---|
| 5 | 0.999359 | 0.056832 | 0.254085 |
| 6 | 0.802454 | 0.160224 | 0.843747 |
| 7 | 0.602545 | 0.840196 | 0.007152 |
| 8 | 0.801355 | 0.937513 | 0.052925 |
| 9 | 0.218896 | 0.757961 | 0.057891 |
Use compose or a pipe to apply a sequence of functions in a row
from utilz import compose, pipe
bottom_head = compose(lambda df: df.head(10), lambda df: df.tail(3))
bottom_head(df)
| A1 | B1 | C1 | |
|---|---|---|---|
| 7 | 0.602545 | 0.840196 | 0.007152 |
| 8 | 0.801355 | 0.937513 | 0.052925 |
| 9 | 0.218896 | 0.757961 | 0.057891 |
pipe(df,
lambda df: df.head(10),
lambda df: df.tail(3)
)
| A1 | B1 | C1 | |
|---|---|---|---|
| 7 | 0.602545 | 0.840196 | 0.007152 |
| 8 | 0.801355 | 0.937513 | 0.052925 |
| 9 | 0.218896 | 0.757961 | 0.057891 |
Use iffy to apply a function if a predicate function is true. It takes a checking function and then a function or value to return if that checking function is true
from utilz import iffy
# Apply function
iffy(lambda df: len(df) > 3, lambda df: df.head(3), df)
# Return arbitrary object
iffy(lambda df: len(df) > 3, 'big df', df)
# If check fails just returns the original object
iffy(lambda df: len(df) < 3, lambda df: df.head(3), df)
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
'big df'
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.076077 | 0.990881 | 0.021679 |
| 1 | 0.914026 | 0.688789 | 0.698269 |
| 2 | 0.635191 | 0.337502 | 0.327470 |
| 3 | 0.942193 | 0.767003 | 0.852347 |
| 4 | 0.178692 | 0.494257 | 0.507263 |
| 5 | 0.999359 | 0.056832 | 0.254085 |
| 6 | 0.802454 | 0.160224 | 0.843747 |
| 7 | 0.602545 | 0.840196 | 0.007152 |
| 8 | 0.801355 | 0.937513 | 0.052925 |
| 9 | 0.218896 | 0.757961 | 0.057891 |
Operating on iterables¶
from utilz import map, mapcat
map is just sugar for list(map())
def myfunc(x):
return x * 2
map(myfunc, range(10))
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
mapcat will concatenate/flattest results:
def myfunc(x):
return [x * 2]
map(myfunc, range(10))
mapcat(myfunc, range(10))
[[0], [2], [4], [6], [8], [10], [12], [14], [16], [18]]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
If myfunc is None it can be used to flatten nested lists (max 2 levels deep):
mapcat(None, [[1,2,3], [4,5,6], [7]])
[1, 2, 3, 4, 5, 6, 7]
If myfunc returns a dataframe, will try to concat the results by row:
from utilz import randdf
def myfunc(f):
"""simulate loading a 2x3 dataframe from file"""
return randdf(size=(2,3))
mapcat(myfunc, range(4))
| A1 | B1 | C1 | |
|---|---|---|---|
| 0 | 0.723497 | 0.397081 | 0.477959 |
| 1 | 0.981362 | 0.465690 | 0.505523 |
| 2 | 0.254038 | 0.692296 | 0.589320 |
| 3 | 0.076432 | 0.229396 | 0.183292 |
| 4 | 0.317140 | 0.187555 | 0.451125 |
| 5 | 0.613190 | 0.191327 | 0.634255 |
| 6 | 0.678660 | 0.456217 | 0.492318 |
| 7 | 0.217005 | 0.730834 | 0.310409 |
If your myfunc returns an array, will try also try concat the results by default, while preserving the output shape. Because myfunc returns a 1d array, the final result is 2d:
import numpy as np
def myfunc(f):
"""Function that returns 1d array"""
return np.arange(3)
mapcat(myfunc, range(4))
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
This is equivalent to passing concat_axis=1:
mapcat(myfunc, range(4), concat_axis=1)
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
You can instead flatten the array, but passing concat_axis=0:
mapcat(myfunc, range(4), concat_axis=0)
array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])
Or stack it in a 3rd dimension by passing concat_axis=2:
mapcat(myfunc, range(4), concat_axis=2)
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2]])
both map and mapcat support easy parallel looping just be changing the n_jobs argument:
from time import sleep
def myfunc(x):
"""Simulate expensive function"""
sleep(1)
return x * 2
map(myfunc, range(10), n_jobs=2)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
You can easily pass the loop index to myfunc by setting enum=True:
# myfunc needs to accept an 'idx' argument
def myfunc(x, idx):
"""Simulate expensive function"""
sleep(1)
return x * idx
mapcat(myfunc, range(10), n_jobs=2, enum=True)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Likewise if your function uses randomization, you can set the random_state to reproduce parallel runs:
# myfunc needs to accept an 'random_state' argument
def myfunc(x, random_state=None):
"""Simulate expensive function"""
from utilz import check_random_state
rng = check_random_state(random_state)
sleep(1)
return x * rng.random()
map(myfunc, range(10), n_jobs=2, random_state=1)
[0.0, 0.7026449924443589, 1.3671148797485828, 2.491197036621863, 2.8137601674196255, 4.388425659603134, 5.805394778570049, 4.548025497425891, 2.529163623883595, 4.026815575448365]
Now this second run reproduces the same values:
map(myfunc, range(10), n_jobs=2, random_state=1)
[0.0, 0.7026449924443589, 1.3671148797485828, 2.491197036621863, 2.8137601674196255, 4.388425659603134, 5.805394778570049, 4.548025497425891, 2.529163623883595, 4.026815575448365]
2. Decorators¶
utilz decorators can be added to any function to provide some convenient information or checks before or after execution. Currently these include:
expensive: cache a function result to disk and load it on rerunslog: print shape, size, len of an arg before and after function executionmaybe: run a function only if a file doesn't exist or a dir isn't emptyshow: print the result of a function in addition to returning ittimeit: print how long a function took to evaluate
3. Dataframe tools¶
Utilz makes working with dataframes a bit easier by offering extra methods without altering core pandas functionality. You don't need to import anything to use these methods. They're automatically available after importing anything from utilz. Currently these include:
norm_by_group: center, scale, or z-score separately by group.assert_same_nunique: make sure groups have the same number of unique values in a particular column.assert_balanced_groups: make sure groups have the same size
Example usage¶
# No need to import anything!
# Add a group col
df = randdf()
df['group'] = ['A'] * 5 + ['B'] * 5
# This is a new method!
new_df = df.norm_by_group('group', 'A1')
new_df
| A1 | B1 | C1 | group | A1_normed_by_group | |
|---|---|---|---|---|---|
| 0 | 0.897455 | 0.329248 | 0.190562 | A | 1.310156 |
| 1 | 0.411200 | 0.151263 | 0.204226 | A | -1.391970 |
| 2 | 0.670361 | 0.213199 | 0.398662 | A | 0.048193 |
| 3 | 0.590188 | 0.940737 | 0.826784 | A | -0.397329 |
| 4 | 0.739239 | 0.175956 | 0.304016 | A | 0.430950 |
| 5 | 0.708524 | 0.960608 | 0.286470 | B | 0.103200 |
| 6 | 0.851708 | 0.004294 | 0.302206 | B | 0.635292 |
| 7 | 0.309853 | 0.954225 | 0.954408 | B | -1.378318 |
| 8 | 0.535253 | 0.212095 | 0.627933 | B | -0.540699 |
| 9 | 0.998427 | 0.934565 | 0.602804 | B | 1.180524 |
You can use the scale and center args to control whether mean-centering and dividing by standard-deviation are done (both default to True). This will also change the generated column name appropriately:
new_df.norm_by_group('group', 'A1', scale=False)
| A1 | B1 | C1 | group | A1_normed_by_group | A1_centered_by_group | |
|---|---|---|---|---|---|---|
| 0 | 0.897455 | 0.329248 | 0.190562 | A | 1.310156 | 0.235766 |
| 1 | 0.411200 | 0.151263 | 0.204226 | A | -1.391970 | -0.250488 |
| 2 | 0.670361 | 0.213199 | 0.398662 | A | 0.048193 | 0.008672 |
| 3 | 0.590188 | 0.940737 | 0.826784 | A | -0.397329 | -0.071500 |
| 4 | 0.739239 | 0.175956 | 0.304016 | A | 0.430950 | 0.077551 |
| 5 | 0.708524 | 0.960608 | 0.286470 | B | 0.103200 | 0.027771 |
| 6 | 0.851708 | 0.004294 | 0.302206 | B | 0.635292 | 0.170955 |
| 7 | 0.309853 | 0.954225 | 0.954408 | B | -1.378318 | -0.370900 |
| 8 | 0.535253 | 0.212095 | 0.627933 | B | -0.540699 | -0.145500 |
| 9 | 0.998427 | 0.934565 | 0.602804 | B | 1.180524 | 0.317675 |