Maps (operations on sequences)

`utilz.maps`

The maps module is designed to be used with sequences and has several generalizations of functions in utilz.ops that work with sequences. Here are the parallels:

map function (s)	op function(s)	description
`map`	`do`	apply one function
`mapcompose`	`pipe`/`do(compose())`	apply multiple functions in sequence
`mapmany`	`many`	apply multiple functions in parallel
`mapif`	`iffy`	apply one function if a predicate function otherwise noop
`mapacross`	`None`	apply multiple functions to multiple inputs in pairs
`mapcat`	`None`	apply one multi-output function and flatten the results
`mapwith`	`None`	map a two argument function to an iterable and a fixed arg or two iterables

All members of the map family, expect an iterable as their last argument, each element of which is passed to functions as their first argument. Except for mapcat, all map* functions return a sequence the same length as the input they received.

`check_random_state(seed=None)`

Turn seed into a np.random.RandomState instance. Note: credit for this code goes entirely to sklearn.utils.check_random_state. Using the source here simply avoids an unecessary dependency.

Parameters:

Name	Type	Description	Default
`seed`	`None, int, np.RandomState`	if seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.	`None`

Source code in utilz/maps.py

def check_random_state(seed=None):
    """Turn seed into a np.random.RandomState instance. Note: credit for this code goes entirely to `sklearn.utils.check_random_state`. Using the source here simply avoids an unecessary dependency.

    Args:
        seed (None, int, np.RandomState): if seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.
    """

    import numbers

    if seed is None or seed is np.random:
        return np.random.mtrand._rand
    if isinstance(seed, (numbers.Integral, np.integer)):
        return np.random.RandomState(seed)
    if isinstance(seed, np.random.RandomState):
        return seed
    raise ValueError(
        "%r cannot be used to seed a numpy.random.RandomState" " instance" % seed
    )

`filter(how, iterme, invert=False, substr_match=True, assert_notempty=True)`

Filter an iterable and concatenate the output to a list instead of a generator like the standard filter in python. By default always returns the matching elements from iterme. This can be inverted using invert=True or split using invert='split' which will return matches, nomatches. Filtering can be done by passing a function, a single str/int/float, or an iterable of str/int/float. Filtering by an iterable checks if any of the values in the iterable exist in each item of iterme.

Parameters:

Name	Type	Description	Default
`func`	`Union[Callable, Iterable, str, int, float]`	if a function is passed it	required
`iterme`	`Iterable`	iterable to filter	required
`invert`	`bool/str optional`	if `True`, drops items where `how` resolves to	`False`
`assert_notempty`	`bool`	raise an error if the returned output is	`True`

Returns:

Name	Type	Description
`list`		filtered version of `iterme

Source code in utilz/maps.py

@curry
def filter(
    how: Union[Callable, Iterable, str, int, float],
    iterme: Iterable,
    invert: Union[str, bool] = False,
    substr_match: bool = True,
    assert_notempty: bool = True,
):
    """
    Filter an iterable and concatenate the output to a list instead of a generator like
    the standard `filter` in python. By default always returns the *matching* elements
    from iterme. This can be inverted using invert=True or split using invert='split'
    which will return matches, nomatches. Filtering can be done by passing a function, a
    single `str/int/float`, or an iterable of `str/int/float`. Filtering by an iterable
    checks if `any` of the values in the iterable exist in each item of `iterme`.

    Args:
        func (Union[Callable, Iterable, str, int, float]): if a function is passed it
        must return `True` or `False`, otherwise will compare each element in `iterme`
        to the element passed for `func`. String comparisons check if `func` is `in` and
        element of `iterme` while float/integer comparisons check for value equality. If
        an iterable is passed filtering is performed for `any` of the elements in the ierable
        iterme (Iterable): iterable to filter
        invert (bool/str optional): if `True`, drops items where `how` resolves to
        `True` rather than keeping them. If passed the string `'split'` will return both
        matching and inverted results
        assert_notempty (bool, optional): raise an error if the returned output is
        empty; Default True


    Returns:
        list: filtered version of `iterme
    """

    if isinstance(how, Callable):
        func = how
    elif isinstance(how, str):
        if substr_match:
            func = lambda elem: how in str(elem)
        else:
            func = lambda elem: how == elem
    elif isinstance(how, (float, int)):
        func = lambda elem: how == elem
    elif isinstance(how, Iterable):
        if isinstance(how[0], str):
            if substr_match:
                func = lambda elem: any(map(lambda h: h in str(elem), how))
            else:
                func = lambda elem: any(map(lambda h: h == elem, how))
        elif isinstance(how[1], (float, int)):
            func = lambda elem: any(map(lambda h: h == elem, how))
        else:
            raise TypeError(
                "If an iterable is passed it must contain strings, ints or floats"
            )
    else:
        raise TypeError(
            "Must pass a function, iterable, string, int, or float to filter by"
        )

    if invert == "split":
        inverts = list(filterfalse(func, iterme))
        matches = list(_filter(func, iterme))

        if assert_notempty and (len(inverts) == 0 or len(matches) == 0):
            raise AssertionError("Filtered data is empty!")
        return matches, inverts

    elif isinstance(invert, bool):
        filtfunc = filterfalse if invert is True else _filter
        out = list(filtfunc(func, iterme))
        if assert_notempty and len(out) == 0:
            raise AssertionError("Filtered data is empty!")
        return out
    else:
        raise TypeError("invert must be True, False, or 'split'")

`map(func, iterme, **kwargs)`

Super-power your for loops with a progress-bar and optional reproducible parallelization!

maps func to iterme. Includes a progress-bar powered by tqdm.

Supports parallelization with jobllib.Parallel multi-processing by setting n_jobs > 1. Progress-bar accurately tracks parallel jobs!

iterme can be a list of elements, list of DataFrames, list of arrays, or list of lists. List of lists up to 2 deep will be flattened to single list when func = None

See the examples below for interesting use cases beyond standard looping!

Parameters:

Name	Type	Description	Default
`func`	`callable`	function to map	required
`iterme`	`iterable`	an iterable for which each element will be passed to func	required
`enum/enumerate`	`bool`	whether the value of the current iteration should be passed to `func` as the special kwarg `idx`. Make sure `func` can handle a kwarg named `idx`. Default False	required
`random_state`	`bool/int`	whether a randomly initialized seed should be	required
`n_jobs`	`int`	number of cpus/threads; Default 1 (no parallel)	required
`backend`	`str`	Only applies if `n_jobs > 1`. See `joblib.Parallel` for	required
`pbar`	`bool`	whether to use tqdm to sfunc a progressbar; Default	required
`verbose`	`int`	`joblib.Parallel` verbosity. Default 0	required
`**kwargs`	`dict`	optional keyword arguments to pass to func	`{}`

Examples:

>>> # Just like map
>>>  out = map(lambda x: x * 2, [1, 2, 3, 4])

>>> # Concatenating nested lists
>>> data = [[1, 2], [3, 4]]
>>> out = mapcat(None, data)

>>> # Load multiple files into a single dataframe
>>> out = mapcat(pd.read_csv, ["file1.txt", "file2.txt", "file3.txt"])

>>> # Parallelization with randomness
>>> def f_random(x, random_state=None):
>>>     random_state = check_random_state(random_state)
>>>     sleep(0.5)
>>>     # Use the random state's number generator rather than np.random
>>>     return x + random_state.rand()
>>>
>>> # Now set a random_state in mapcat to reproduce the parallel runs
>>> # It doesn't pass the value, but rather generates a reproducible list
>>> # of seeds that are passed to each function execution
>>> out = mapcat(f_random, [1, 1, 1, 1, 1], n_jobs=2, random_state=1)

Source code in utilz/maps.py

@curry
def map(
    func: Union[Callable, None],
    iterme: Iterable,
    **kwargs,
):
    """
    Super-power your `for` loops with a progress-bar and optional *reproducible*
    parallelization!

    **map**s `func` to `iterme`. Includes a progress-bar powered by `tqdm`.

    Supports parallelization with `jobllib.Parallel` multi-processing by setting `n_jobs > 1`. Progress-bar *accurately* tracks parallel jobs!

    `iterme` can be a list of elements, list of DataFrames, list of arrays, or list of
    lists. List of lists up to 2 deep will be flattened to single list when `func = None`

    See the examples below for interesting use cases beyond standard looping!

    Args:
        func (callable): function to map
        iterme (iterable): an iterable for which each element will be passed to func
        into a single list, array, or dataframe based on `axis`; Default True
        enum/enumerate (bool, optional): whether the value of the current iteration should be passed to `func` as the special kwarg `idx`. Make sure `func` can handle a kwarg named `idx`. Default False
        random_state (bool/int, optional): whether a randomly initialized seed should be
        passed to `func` as the special kwarg `random_state`. The function should pass
        this seed to the `utilz.check_random_state` helper to generate a random number
        generator for all computations rather than relying on `np.random`
        n_jobs (int, optional): number of cpus/threads; Default 1 (no parallel)
        backend (str, optional): Only applies if `n_jobs > 1`. See `joblib.Parallel` for
        options; Default None which uses `loky`
        Default True
        pbar (bool, optional): whether to use tqdm to sfunc a progressbar; Default
        False
        verbose (int): `joblib.Parallel` verbosity. Default 0
        **kwargs (dict, optional): optional keyword arguments to pass to func

    Examples:
        >>> # Just like map
        >>>  out = map(lambda x: x * 2, [1, 2, 3, 4])

        >>> # Concatenating nested lists
        >>> data = [[1, 2], [3, 4]]
        >>> out = mapcat(None, data)

        >>> # Load multiple files into a single dataframe
        >>> out = mapcat(pd.read_csv, ["file1.txt", "file2.txt", "file3.txt"])

        >>> # Parallelization with randomness
        >>> def f_random(x, random_state=None):
        >>>     random_state = check_random_state(random_state)
        >>>     sleep(0.5)
        >>>     # Use the random state's number generator rather than np.random
        >>>     return x + random_state.rand()
        >>>
        >>> # Now set a random_state in mapcat to reproduce the parallel runs
        >>> # It doesn't pass the value, but rather generates a reproducible list
        >>> # of seeds that are passed to each function execution
        >>> out = mapcat(f_random, [1, 1, 1, 1, 1], n_jobs=2, random_state=1)

    """

    enum = kwargs.pop("enum", False) or kwargs.pop("enumerate", False)
    random_state = kwargs.pop("random_state", False)
    n_jobs = kwargs.pop("n_jobs", 1)
    backend = kwargs.pop("backend", None)
    pbar = kwargs.pop("pbar", False)
    verbose = kwargs.pop("verbose", 0)

    if func is None:
        # No-op if no function
        op = iterme
    else:
        if isinstance(func, (str, dict, int, float, tuple, dict)):
            func_args = []
        else:
            try:
                func_args = list(signature(func).parameters.keys())
                if enum and "idx" not in func_args:
                    raise ValueError(
                        "Function must accept a keyword argument named 'idx' that accepts an integer if enum is True"
                    )

                if random_state is not False:
                    if "random_state" not in func_args:
                        raise ValueError(
                            "Function must have a keyword argument called 'random_state' if random_state is not False"
                        )
            except ValueError as _:
                # some funcs like numpy c funcs are not inspectable so we have to ksip
                # these checks
                func_args = []

        if random_state is not False:
            # User can pass True instead of a number for non-reproducible
            # parallelization
            random_state = None if random_state is True else random_state

            # Generate a list of random ints, that themselves are seeded by random_state
            # and passed to func
            seeds = check_random_state(random_state).randint(MAX_INT, size=len(iterme))
        else:
            seeds = None

        # Loop; parallel in n_jobs < 1 or > 1
        op = _pmap(func, iterme, enum, seeds, n_jobs, backend, pbar, verbose, kwargs)

    return op

`mapacross(*args)`

Map multiple functions to an iterable in a matched-pair fasion. The number of funcions needs to equal the length of the iterable.

Source code in utilz/maps.py

@curry
def mapacross(*args):
    """Map multiple functions to an iterable in a matched-pair fasion. The
    number of funcions needs to equal the length of the iterable."""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"Expected a list/tuple of input, but received a single {type(data)}. If you want to apply a function to a single input either use a lambda or do()"
            )
        if len(data) != len(args):
            raise ValueError(
                f"Te number of functions passed must equal the length of the previous output, but {len(data)} data and {len(args)} functions don't match. To run the same set of functions over the previous inputs see separate()"
            )
        return [f(a) for f, a in zip(args, data)]

    return call

`mapcat(func, iterme, **kwargs)`

Call map and concatenate results after. Particularly useful to ensure results are numpy arrays

Source code in utilz/maps.py

@curry
def mapcat(func: Union[Callable, None], iterme: Iterable, **kwargs):
    """Call map and concatenate results after.
    Particularly useful to ensure results are numpy arrays"""

    concat_axis = kwargs.pop("concat_axis", None)
    ignore_index = kwargs.pop("ignore_index", True)
    out = map(func, iterme, **kwargs)

    return _concat(out, iterme, concat_axis, ignore_index)

`mapcompose(*args, **kwargs)`

Compose multiple functions together and map them over a sequences, i.e. a mini-pipe per element. Returns a list the same length as the input iterable containing the final function evaluation for each element.

Source code in utilz/maps.py

@curry
def mapcompose(*args, **kwargs):
    """Compose multiple functions together and map them over a sequences, i.e. a
    mini-pipe per element. Returns a list the same length as the input iterable
    containing the final function evaluation for each element."""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"All map* funcs expect a list/tuple of input, but received a single {type(data)}."
            )
        if len(args) <= 1:
            raise ValueError(
                f"mapcompose applies *multiple* function calls in sequence but only received {len(args)} function. Use mapcat() to apply a single function."
            )

        composed = compose(*args)
        return map(composed, data, **kwargs)

    return call

`mapif(func, predicate_func, iterme, **kwargs)`

Apply func to each element of iterme if predicate_func is True for that element otherwise return the element

Source code in utilz/maps.py

@curry
def mapif(func, predicate_func, iterme, **kwargs):
    """Apply func to each element of iterme if predicate_func is True for that element
    otherwise return the element"""

    return map(iffy(predicate_func, func), iterme, **kwargs)

`mapmany(*args, **kwargs)`

Map multiple functions separately to each element in an iterable. Returns a list of nested lists containing the output of each function evaluation on each element in iterme

Source code in utilz/maps.py

@curry
def mapmany(*args, **kwargs):
    """Map multiple functions separately to each element in an iterable. Returns a list
     of nested lists containing the output of each function evaluation on each element in
    iterme"""

    def call(data):
        if not isinstance(data, (list, tuple)):
            raise TypeError(
                f"All map* funcs expect a list/tuple of input, but received a single {type(data)}."
            )
        if len(args) <= 1:
            raise ValueError(
                f"mapmany applies *multiple* function calls separately but only received {len(args)} function. Use mapcat() to apply a single function."
            )

        together = _many(*args)
        return map(together, data, **kwargs)

    return call

`mapwith(func, iterwith, iterme, **kwargs)`

Just like map but accepts a second arg that can also be an iterator. In a pipe iterme is always the last input to mapwith, but the first input func. If copy=True is passed and iterwith is not an iterator, an iterator is built with guaranteed copies of iterwith.

Source code in utilz/maps.py

@curry
def mapwith(func, iterwith, iterme, **kwargs):
    """Just like map but accepts a second arg that can also be an iterator. In a pipe
    iterme is always the *last* input to mapwith, but the *first* input func. If
    `copy=True` is passed and iterwith is not an iterator, an iterator is built with
    guaranteed copies of iterwith."""

    copy = kwargs.pop("copy", False)

    if not isinstance(iterwith, (list, tuple)):
        if copy:
            hascopy = getattr(iterwith, "copy", None)
            if callable(hascopy):
                iterwith = [iterwith.copy()] * len(iterme)
            else:
                iterwith = [deepcopy(iterwith)] * len(iterme)
        else:
            iterwith = [iterwith] * len(iterme)

    if len(iterme) != len(iterwith):
        raise TypeError(
            f"mapwith received an iterable but its length ({len(iterwith)} doesn't match the length of the input iterable ({len(iterme)}"
        )

    return map(lambda tup: func(*tup), zip(iterme, iterwith), **kwargs)