Overview¶

1. Functional tools¶

Operating on single objects¶

In [1]:

            
                Copied!
                
from utilz import do, many, randdf
from utilz import do, many, randdf

You can use do to apply a single function or method to an object

In [18]:

            
                Copied!
                
df = randdf()

do(lambda df: df.head(), df)
do('head', df) # sytactic sugar
df = randdf()

do(lambda df: df.head(), df)
do('head', df) # sytactic sugar

Out[18]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470
3	0.942193	0.767003	0.852347
4	0.178692	0.494257	0.507263

Out[18]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470
3	0.942193	0.767003	0.852347
4	0.178692	0.494257	0.507263

Use can pass function or method arguments as well

In [19]:

            
                Copied!
                
do('head', df, 10)
do('head', df, 10)

Out[19]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470
3	0.942193	0.767003	0.852347
4	0.178692	0.494257	0.507263
5	0.999359	0.056832	0.254085
6	0.802454	0.160224	0.843747
7	0.602545	0.840196	0.007152
8	0.801355	0.937513	0.052925
9	0.218896	0.757961	0.057891

Use many to apply a sequence of functions independently to an object

In [21]:

            
                Copied!
                
results = many(['head', 'tail'], df)
results[0]
results[1]
results = many(['head', 'tail'], df)
results[0]
results[1]

Out[21]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470
3	0.942193	0.767003	0.852347
4	0.178692	0.494257	0.507263

Out[21]:

	A1	B1	C1
5	0.999359	0.056832	0.254085
6	0.802454	0.160224	0.843747
7	0.602545	0.840196	0.007152
8	0.801355	0.937513	0.052925
9	0.218896	0.757961	0.057891

Use compose or a pipe to apply a sequence of functions in a row

In [22]:

            
                Copied!
                
from utilz import compose, pipe
from utilz import compose, pipe

In [24]:

            
                Copied!
                
bottom_head = compose(lambda df: df.head(10), lambda df: df.tail(3))
bottom_head(df)
bottom_head = compose(lambda df: df.head(10), lambda df: df.tail(3))
bottom_head(df)

Out[24]:

	A1	B1	C1
7	0.602545	0.840196	0.007152
8	0.801355	0.937513	0.052925
9	0.218896	0.757961	0.057891

In [25]:

            
                Copied!
                
pipe(df, 
     lambda df: df.head(10), 
     lambda df: df.tail(3)
     )
pipe(df, 
     lambda df: df.head(10), 
     lambda df: df.tail(3)
     )

Out[25]:

	A1	B1	C1
7	0.602545	0.840196	0.007152
8	0.801355	0.937513	0.052925
9	0.218896	0.757961	0.057891

Use iffy to apply a function if a predicate function is true. It takes a checking function and then a function or value to return if that checking function is true

In [31]:

            
                Copied!
                
from utilz import iffy

# Apply function
iffy(lambda df: len(df) > 3, lambda df: df.head(3), df)

# Return arbitrary object
iffy(lambda df: len(df) > 3, 'big df', df)

# If check fails just returns the original object
iffy(lambda df: len(df) < 3, lambda df: df.head(3), df)
from utilz import iffy

# Apply function
iffy(lambda df: len(df) > 3, lambda df: df.head(3), df)

# Return arbitrary object
iffy(lambda df: len(df) > 3, 'big df', df)

# If check fails just returns the original object
iffy(lambda df: len(df) < 3, lambda df: df.head(3), df)

Out[31]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470

Out[31]:

'big df'

Out[31]:

	A1	B1	C1
0	0.076077	0.990881	0.021679
1	0.914026	0.688789	0.698269
2	0.635191	0.337502	0.327470
3	0.942193	0.767003	0.852347
4	0.178692	0.494257	0.507263
5	0.999359	0.056832	0.254085
6	0.802454	0.160224	0.843747
7	0.602545	0.840196	0.007152
8	0.801355	0.937513	0.052925
9	0.218896	0.757961	0.057891

Operating on iterables¶

In [1]:

            
                Copied!
                
from utilz import map, mapcat
from utilz import map, mapcat

map is just sugar for list(map())

In [12]:

            
                Copied!
                
def myfunc(x):
    return x * 2

map(myfunc, range(10))
def myfunc(x):
    return x * 2

map(myfunc, range(10))

Out[12]:

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

mapcat will concatenate/flattest results:

In [4]:

            
                Copied!
                
def myfunc(x):
    return [x * 2]

map(myfunc, range(10))
mapcat(myfunc, range(10))
def myfunc(x):
    return [x * 2]

map(myfunc, range(10))
mapcat(myfunc, range(10))

Out[4]:

[[0], [2], [4], [6], [8], [10], [12], [14], [16], [18]]

Out[4]:

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

If myfunc is None it can be used to flatten nested lists (max 2 levels deep):

In [5]:

            
                Copied!
                
mapcat(None, [[1,2,3], [4,5,6], [7]])
mapcat(None, [[1,2,3], [4,5,6], [7]])

Out[5]:

[1, 2, 3, 4, 5, 6, 7]

If myfunc returns a dataframe, will try to concat the results by row:

In [6]:

            
                Copied!
                
from utilz import randdf

def myfunc(f):
    """simulate loading a 2x3 dataframe from file"""
    return randdf(size=(2,3))

mapcat(myfunc, range(4))
from utilz import randdf

def myfunc(f):
    """simulate loading a 2x3 dataframe from file"""
    return randdf(size=(2,3))

mapcat(myfunc, range(4))

Out[6]:

	A1	B1	C1
0	0.723497	0.397081	0.477959
1	0.981362	0.465690	0.505523
2	0.254038	0.692296	0.589320
3	0.076432	0.229396	0.183292
4	0.317140	0.187555	0.451125
5	0.613190	0.191327	0.634255
6	0.678660	0.456217	0.492318
7	0.217005	0.730834	0.310409

If your myfunc returns an array, will try also try concat the results by default, while preserving the output shape. Because myfunc returns a 1d array, the final result is 2d:

In [8]:

            
                Copied!
                
import numpy as np

def myfunc(f):
    """Function that returns 1d array"""
    return np.arange(3)

mapcat(myfunc, range(4))
import numpy as np

def myfunc(f):
    """Function that returns 1d array"""
    return np.arange(3)

mapcat(myfunc, range(4))

Out[8]:

array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

This is equivalent to passing concat_axis=1:

In [9]:

            
                Copied!
                
mapcat(myfunc, range(4), concat_axis=1)
mapcat(myfunc, range(4), concat_axis=1)

Out[9]:

array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

You can instead flatten the array, but passing concat_axis=0:

In [10]:

            
                Copied!
                
mapcat(myfunc, range(4), concat_axis=0)
mapcat(myfunc, range(4), concat_axis=0)

Out[10]:

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])

Or stack it in a 3rd dimension by passing concat_axis=2:

In [11]:

            
                Copied!
                
mapcat(myfunc, range(4), concat_axis=2)
mapcat(myfunc, range(4), concat_axis=2)

Out[11]:

array([[0, 0, 0, 0],
       [1, 1, 1, 1],
       [2, 2, 2, 2]])

both map and mapcat support easy parallel looping just be changing the n_jobs argument:

In [13]:

            
                Copied!
                
from time import sleep

def myfunc(x):
    """Simulate expensive function"""
    sleep(1)
    return x * 2

map(myfunc, range(10), n_jobs=2)
from time import sleep

def myfunc(x):
    """Simulate expensive function"""
    sleep(1)
    return x * 2

map(myfunc, range(10), n_jobs=2)

Out[13]:

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

You can easily pass the loop index to myfunc by setting enum=True:

In [13]:

            
                Copied!
                
# myfunc needs to accept an 'idx' argument
def myfunc(x, idx):
    """Simulate expensive function"""
    sleep(1)
    return x * idx

mapcat(myfunc, range(10), n_jobs=2, enum=True)
# myfunc needs to accept an 'idx' argument
def myfunc(x, idx):
    """Simulate expensive function"""
    sleep(1)
    return x * idx

mapcat(myfunc, range(10), n_jobs=2, enum=True)

Out[13]:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Likewise if your function uses randomization, you can set the random_state to reproduce parallel runs:

In [14]:

            
                Copied!
                
# myfunc needs to accept an 'random_state' argument
def myfunc(x, random_state=None):
    """Simulate expensive function"""
    from utilz import check_random_state

    rng = check_random_state(random_state)
    sleep(1)
    return x * rng.random()

map(myfunc, range(10), n_jobs=2, random_state=1)
# myfunc needs to accept an 'random_state' argument
def myfunc(x, random_state=None):
    """Simulate expensive function"""
    from utilz import check_random_state

    rng = check_random_state(random_state)
    sleep(1)
    return x * rng.random()

map(myfunc, range(10), n_jobs=2, random_state=1)

Out[14]:

[0.0,
 0.7026449924443589,
 1.3671148797485828,
 2.491197036621863,
 2.8137601674196255,
 4.388425659603134,
 5.805394778570049,
 4.548025497425891,
 2.529163623883595,
 4.026815575448365]

Now this second run reproduces the same values:

In [15]:

            
                Copied!
                
map(myfunc, range(10), n_jobs=2, random_state=1)
map(myfunc, range(10), n_jobs=2, random_state=1)

Out[15]:

[0.0,
 0.7026449924443589,
 1.3671148797485828,
 2.491197036621863,
 2.8137601674196255,
 4.388425659603134,
 5.805394778570049,
 4.548025497425891,
 2.529163623883595,
 4.026815575448365]

2. Decorators¶

utilz decorators can be added to any function to provide some convenient information or checks before or after execution. Currently these include:

expensive: cache a function result to disk and load it on reruns
log: print shape, size, len of an arg before and after function execution
maybe: run a function only if a file doesn't exist or a dir isn't empty
show: print the result of a function in addition to returning it
timeit: print how long a function took to evaluate

3. Dataframe tools¶

Utilz makes working with dataframes a bit easier by offering extra methods without altering core pandas functionality. You don't need to import anything to use these methods. They're automatically available after importing anything from utilz. Currently these include:

norm_by_group: center, scale, or z-score separately by group
.assert_same_nunique: make sure groups have the same number of unique values in a particular column
.assert_balanced_groups: make sure groups have the same size

Example usage¶

In [2]:

            
                Copied!
                
# No need to import anything!

# Add a group col
df = randdf()
df['group'] = ['A'] * 5 + ['B'] * 5

# This is a new method!
new_df = df.norm_by_group('group', 'A1')
new_df
# No need to import anything!

# Add a group col
df = randdf()
df['group'] = ['A'] * 5 + ['B'] * 5

# This is a new method!
new_df = df.norm_by_group('group', 'A1')
new_df

Out[2]:

	A1	B1	C1	group	A1_normed_by_group
0	0.897455	0.329248	0.190562	A	1.310156
1	0.411200	0.151263	0.204226	A	-1.391970
2	0.670361	0.213199	0.398662	A	0.048193
3	0.590188	0.940737	0.826784	A	-0.397329
4	0.739239	0.175956	0.304016	A	0.430950
5	0.708524	0.960608	0.286470	B	0.103200
6	0.851708	0.004294	0.302206	B	0.635292
7	0.309853	0.954225	0.954408	B	-1.378318
8	0.535253	0.212095	0.627933	B	-0.540699
9	0.998427	0.934565	0.602804	B	1.180524

You can use the scale and center args to control whether mean-centering and dividing by standard-deviation are done (both default to True). This will also change the generated column name appropriately:

In [3]:

            
                Copied!
                
new_df.norm_by_group('group', 'A1', scale=False)
new_df.norm_by_group('group', 'A1', scale=False)

Out[3]:

	A1	B1	C1	group	A1_normed_by_group	A1_centered_by_group
0	0.897455	0.329248	0.190562	A	1.310156	0.235766
1	0.411200	0.151263	0.204226	A	-1.391970	-0.250488
2	0.670361	0.213199	0.398662	A	0.048193	0.008672
3	0.590188	0.940737	0.826784	A	-0.397329	-0.071500
4	0.739239	0.175956	0.304016	A	0.430950	0.077551
5	0.708524	0.960608	0.286470	B	0.103200	0.027771
6	0.851708	0.004294	0.302206	B	0.635292	0.170955
7	0.309853	0.954225	0.954408	B	-1.378318	-0.370900
8	0.535253	0.212095	0.627933	B	-0.540699	-0.145500
9	0.998427	0.934565	0.602804	B	1.180524	0.317675

In [ ]: