Skip to content

Maps (operations on sequences)

utilz.maps

The maps module is designed to be used with sequences and has several generalizations of functions in utilz.ops that work with sequences. Here are the parallels:

map function (s) op function(s) description
map do apply one function
mapcompose pipe/do(compose()) apply multiple functions in sequence
mapmany many apply multiple functions in parallel
mapif iffy apply one function if a predicate function otherwise noop
mapacross None apply multiple functions to multiple inputs in pairs
mapcat None apply one multi-output function and flatten the results
mapwith None map a two argument function to an iterable and a fixed arg or two iterables

All members of the map family, expect an iterable as their last argument, each element of which is passed to functions as their first argument. Except for mapcat, all map* functions return a sequence the same length as the input they received.

check_random_state(seed=None)

Turn seed into a np.random.RandomState instance. Note: credit for this code goes entirely to sklearn.utils.check_random_state. Using the source here simply avoids an unecessary dependency.

Parameters:

Name Type Description Default
seed None, int, np.RandomState

if seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.

None
Source code in utilz/maps.py
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
def check_random_state(seed=None):
    """Turn seed into a np.random.RandomState instance. Note: credit for this code goes entirely to `sklearn.utils.check_random_state`. Using the source here simply avoids an unecessary dependency.

    Args:
        seed (None, int, np.RandomState): if seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.
    """

    import numbers

    if seed is None or seed is np.random:
        return np.random.mtrand._rand
    if isinstance(seed, (numbers.Integral, np.integer)):
        return np.random.RandomState(seed)
    if isinstance(seed, np.random.RandomState):
        return seed
    raise ValueError(
        "%r cannot be used to seed a numpy.random.RandomState" " instance" % seed
    )

filter(how, iterme, invert=False, substr_match=True, assert_notempty=True)

Filter an iterable and concatenate the output to a list instead of a generator like the standard filter in python. By default always returns the matching elements from iterme. This can be inverted using invert=True or split using invert='split' which will return matches, nomatches. Filtering can be done by passing a function, a single str/int/float, or an iterable of str/int/float. Filtering by an iterable checks if any of the values in the iterable exist in each item of iterme.

Parameters:

Name Type Description Default
func Union[Callable, Iterable, str, int, float]

if a function is passed it

required
iterme Iterable

iterable to filter

required
invert bool/str optional

if True, drops items where how resolves to

False
assert_notempty bool

raise an error if the returned output is

True

Returns:

Name Type Description
list

filtered version of `iterme

Source code in utilz/maps.py
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
@curry
def filter(
    how: Union[Callable, Iterable, str, int, float],
    iterme: Iterable,
    invert: Union[str, bool] = False,
    substr_match: bool = True,
    assert_notempty: bool = True,
):
    """
    Filter an iterable and concatenate the output to a list instead of a generator like
    the standard `filter` in python. By default always returns the *matching* elements
    from iterme. This can be inverted using invert=True or split using invert='split'
    which will return matches, nomatches. Filtering can be done by passing a function, a
    single `str/int/float`, or an iterable of `str/int/float`. Filtering by an iterable
    checks if `any` of the values in the iterable exist in each item of `iterme`.

    Args:
        func (Union[Callable, Iterable, str, int, float]): if a function is passed it
        must return `True` or `False`, otherwise will compare each element in `iterme`
        to the element passed for `func`. String comparisons check if `func` is `in` and
        element of `iterme` while float/integer comparisons check for value equality. If
        an iterable is passed filtering is performed for `any` of the elements in the ierable
        iterme (Iterable): iterable to filter
        invert (bool/str optional): if `True`, drops items where `how` resolves to
        `True` rather than keeping them. If passed the string `'split'` will return both
        matching and inverted results
        assert_notempty (bool, optional): raise an error if the returned output is
        empty; Default True


    Returns:
        list: filtered version of `iterme
    """

    if isinstance(how, Callable):
        func = how
    elif isinstance(how, str):
        if substr_match:
            func = lambda elem: how in str(elem)
        else:
            func = lambda elem: how == elem
    elif isinstance(how, (float, int)):
        func = lambda elem: how == elem
    elif isinstance(how, Iterable):
        if isinstance(how[0], str):
            if substr_match:
                func = lambda elem: any(map(lambda h: h in str(elem), how))
            else:
                func = lambda elem: any(map(lambda h: h == elem, how))
        elif isinstance(how[1], (float, int)):
            func = lambda elem: any(map(lambda h: h == elem, how))
        else:
            raise TypeError(
                "If an iterable is passed it must contain strings, ints or floats"
            )
    else:
        raise TypeError(
            "Must pass a function, iterable, string, int, or float to filter by"
        )

    if invert == "split":
        inverts = list(filterfalse(func, iterme))
        matches = list(_filter(func, iterme))

        if assert_notempty and (len(inverts) == 0 or len(matches) == 0):
            raise AssertionError("Filtered data is empty!")
        return matches, inverts

    elif isinstance(invert, bool):
        filtfunc = filterfalse if invert is True else _filter
        out = list(filtfunc(func, iterme))
        if assert_notempty and len(out) == 0:
            raise AssertionError("Filtered data is empty!")
        return out
    else:
        raise TypeError("invert must be True, False, or 'split'")

map(func, iterme, **kwargs)

Super-power your for loops with a progress-bar and optional reproducible parallelization!

maps func to iterme. Includes a progress-bar powered by tqdm.

Supports parallelization with jobllib.Parallel multi-processing by setting n_jobs > 1. Progress-bar accurately tracks parallel jobs!

iterme can be a list of elements, list of DataFrames, list of arrays, or list of lists. List of lists up to 2 deep will be flattened to single list when func = None

See the examples below for interesting use cases beyond standard looping!

Parameters:

Name Type Description Default
func callable

function to map

required
iterme iterable

an iterable for which each element will be passed to func

required
enum/enumerate bool

whether the value of the current iteration should be passed to func as the special kwarg idx. Make sure func can handle a kwarg named idx. Default False

required
random_state bool/int

whether a randomly initialized seed should be

required
n_jobs int

number of cpus/threads; Default 1 (no parallel)

required
backend str

Only applies if n_jobs > 1. See joblib.Parallel for

required
pbar bool

whether to use tqdm to sfunc a progressbar; Default

required
verbose int

joblib.Parallel verbosity. Default 0

required
**kwargs dict

optional keyword arguments to pass to func

{}

Examples:

>>> # Just like map
>>>  out = map(lambda x: x * 2, [1, 2, 3, 4])
>>> # Concatenating nested lists
>>> data = [[1, 2], [3, 4]]
>>> out = mapcat(None, data)
>>> # Load multiple files into a single dataframe
>>> out = mapcat(pd.read_csv, ["file1.txt", "file2.txt", "file3.txt"])
>>> # Parallelization with randomness
>>> def f_random(x, random_state=None):
>>>     random_state = check_random_state(random_state)
>>>     sleep(0.5)
>>>     # Use the random state's number generator rather than np.random
>>>     return x + random_state.rand()
>>>
>>> # Now set a random_state in mapcat to reproduce the parallel runs
>>> # It doesn't pass the value, but rather generates a reproducible list
>>> # of seeds that are passed to each function execution
>>> out = mapcat(f_random, [1, 1, 1, 1, 1], n_jobs=2, random_state=1)
Source code in utilz/maps.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
@curry
def map(
    func: Union[Callable, None],
    iterme: Iterable,
    **kwargs,
):
    """
    Super-power your `for` loops with a progress-bar and optional *reproducible*
    parallelization!

    **map**s `func` to `iterme`. Includes a progress-bar powered by `tqdm`.

    Supports parallelization with `jobllib.Parallel` multi-processing by setting `n_jobs > 1`. Progress-bar *accurately* tracks parallel jobs!

    `iterme` can be a list of elements, list of DataFrames, list of arrays, or list of
    lists. List of lists up to 2 deep will be flattened to single list when `func = None`

    See the examples below for interesting use cases beyond standard looping!

    Args:
        func (callable): function to map
        iterme (iterable): an iterable for which each element will be passed to func
        into a single list, array, or dataframe based on `axis`; Default True
        enum/enumerate (bool, optional): whether the value of the current iteration should be passed to `func` as the special kwarg `idx`. Make sure `func` can handle a kwarg named `idx`. Default False
        random_state (bool/int, optional): whether a randomly initialized seed should be
        passed to `func` as the special kwarg `random_state`. The function should pass
        this seed to the `utilz.check_random_state` helper to generate a random number
        generator for all computations rather than relying on `np.random`
        n_jobs (int, optional): number of cpus/threads; Default 1 (no parallel)
        backend (str, optional): Only applies if `n_jobs > 1`. See `joblib.Parallel` for
        options; Default None which uses `loky`
        Default True
        pbar (bool, optional): whether to use tqdm to sfunc a progressbar; Default
        False
        verbose (int): `joblib.Parallel` verbosity. Default 0
        **kwargs (dict, optional): optional keyword arguments to pass to func

    Examples:
        >>> # Just like map
        >>>  out = map(lambda x: x * 2, [1, 2, 3, 4])

        >>> # Concatenating nested lists
        >>> data = [[1, 2], [3, 4]]
        >>> out = mapcat(None, data)

        >>> # Load multiple files into a single dataframe
        >>> out = mapcat(pd.read_csv, ["file1.txt", "file2.txt", "file3.txt"])

        >>> # Parallelization with randomness
        >>> def f_random(x, random_state=None):
        >>>     random_state = check_random_state(random_state)
        >>>     sleep(0.5)
        >>>     # Use the random state's number generator rather than np.random
        >>>     return x + random_state.rand()
        >>>
        >>> # Now set a random_state in mapcat to reproduce the parallel runs
        >>> # It doesn't pass the value, but rather generates a reproducible list
        >>> # of seeds that are passed to each function execution
        >>> out = mapcat(f_random, [1, 1, 1, 1, 1], n_jobs=2, random_state=1)

    """

    enum = kwargs.pop("enum", False) or kwargs.pop("enumerate", False)
    random_state = kwargs.pop("random_state", False)
    n_jobs = kwargs.pop("n_jobs", 1)
    backend = kwargs.pop("backend", None)
    pbar = kwargs.pop("pbar", False)
    verbose = kwargs.pop("verbose", 0)

    if func is None:
        # No-op if no function
        op = iterme
    else:
        if isinstance(func, (str, dict, int, float, tuple, dict)):
            func_args = []
        else:
            try:
                func_args = list(signature(func).parameters.keys())
                if enum and "idx" not in func_args:
                    raise ValueError(
                        "Function must accept a keyword argument named 'idx' that accepts an integer if enum is True"
                    )

                if random_state is not False:
                    if "random_state" not in func_args:
                        raise ValueError(
                            "Function must have a keyword argument called 'random_state' if random_state is not False"
                        )
            except ValueError as _:
                # some funcs like numpy c funcs are not inspectable so we have to ksip
                # these checks
                func_args = []

        if random_state is not False:
            # User can pass True instead of a number for non-reproducible
            # parallelization
            random_state = None if random_state is True else random_state

            # Generate a list of random ints, that themselves are seeded by random_state
            # and passed to func
            seeds = check_random_state(random_state).randint(MAX_INT, size=len(iterme))
        else:
            seeds = None

        # Loop; parallel in n_jobs < 1 or > 1
        op = _pmap(func, iterme, enum, seeds, n_jobs, backend, pbar, verbose, kwargs)

    return op

mapacross(*args)

Map multiple functions to an iterable in a matched-pair fasion. The number of funcions needs to equal the length of the iterable.

Source code in utilz/maps.py
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
@curry
def mapacross(*args):
    """Map multiple functions to an iterable in a matched-pair fasion. The
    number of funcions needs to equal the length of the iterable."""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"Expected a list/tuple of input, but received a single {type(data)}. If you want to apply a function to a single input either use a lambda or do()"
            )
        if len(data) != len(args):
            raise ValueError(
                f"Te number of functions passed must equal the length of the previous output, but {len(data)} data and {len(args)} functions don't match. To run the same set of functions over the previous inputs see separate()"
            )
        return [f(a) for f, a in zip(args, data)]

    return call

mapcat(func, iterme, **kwargs)

Call map and concatenate results after. Particularly useful to ensure results are numpy arrays

Source code in utilz/maps.py
313
314
315
316
317
318
319
320
321
322
@curry
def mapcat(func: Union[Callable, None], iterme: Iterable, **kwargs):
    """Call map and concatenate results after.
    Particularly useful to ensure results are numpy arrays"""

    concat_axis = kwargs.pop("concat_axis", None)
    ignore_index = kwargs.pop("ignore_index", True)
    out = map(func, iterme, **kwargs)

    return _concat(out, iterme, concat_axis, ignore_index)

mapcompose(*args, **kwargs)

Compose multiple functions together and map them over a sequences, i.e. a mini-pipe per element. Returns a list the same length as the input iterable containing the final function evaluation for each element.

Source code in utilz/maps.py
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
@curry
def mapcompose(*args, **kwargs):
    """Compose multiple functions together and map them over a sequences, i.e. a
    mini-pipe per element. Returns a list the same length as the input iterable
    containing the final function evaluation for each element."""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"All map* funcs expect a list/tuple of input, but received a single {type(data)}."
            )
        if len(args) <= 1:
            raise ValueError(
                f"mapcompose applies *multiple* function calls in sequence but only received {len(args)} function. Use mapcat() to apply a single function."
            )

        composed = compose(*args)
        return map(composed, data, **kwargs)

    return call

mapif(func, predicate_func, iterme, **kwargs)

Apply func to each element of iterme if predicate_func is True for that element otherwise return the element

Source code in utilz/maps.py
388
389
390
391
392
393
@curry
def mapif(func, predicate_func, iterme, **kwargs):
    """Apply func to each element of iterme if predicate_func is True for that element
    otherwise return the element"""

    return map(iffy(predicate_func, func), iterme, **kwargs)

mapmany(*args, **kwargs)

Map multiple functions separately to each element in an iterable. Returns a list of nested lists containing the output of each function evaluation on each element in iterme

Source code in utilz/maps.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
@curry
def mapmany(*args, **kwargs):
    """Map multiple functions separately to each element in an iterable. Returns a list
     of nested lists containing the output of each function evaluation on each element in
    iterme"""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"All map* funcs expect a list/tuple of input, but received a single {type(data)}."
            )
        if len(args) <= 1:
            raise ValueError(
                f"mapmany applies *multiple* function calls separately but only received {len(args)} function. Use mapcat() to apply a single function."
            )

        together = _many(*args)
        return map(together, data, **kwargs)

    return call

mapwith(func, iterwith, iterme, **kwargs)

Just like map but accepts a second arg that can also be an iterator. In a pipe iterme is always the last input to mapwith, but the first input func. If copy=True is passed and iterwith is not an iterator, an iterator is built with guaranteed copies of iterwith.

Source code in utilz/maps.py
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
@curry
def mapwith(func, iterwith, iterme, **kwargs):
    """Just like map but accepts a second arg that can also be an iterator. In a pipe
    iterme is always the *last* input to mapwith, but the *first* input func. If
    `copy=True` is passed and iterwith is not an iterator, an iterator is built with
    guaranteed copies of iterwith."""

    copy = kwargs.pop("copy", False)

    if not isinstance(iterwith, (list, tuple)):
        if copy:
            hascopy = getattr(iterwith, "copy", None)
            if callable(hascopy):
                iterwith = [iterwith.copy()] * len(iterme)
            else:
                iterwith = [deepcopy(iterwith)] * len(iterme)
        else:
            iterwith = [iterwith] * len(iterme)

    if len(iterme) != len(iterwith):
        raise TypeError(
            f"mapwith received an iterable but its length ({len(iterwith)} doesn't match the length of the input iterable ({len(iterme)}"
        )

    return map(lambda tup: func(*tup), zip(iterme, iterwith), **kwargs)