adaptivesplit.sklearn_interface package

Submodules

adaptivesplit.sklearn_interface.learning_curve module

adaptivesplit.sklearn_interface.learning_curve.calculate_learning_curve(estimator, X, y, sample_sizes, stratify=None, cv=5, cv_stat=<function mean>, dummy_estimator=None, num_samples=1, power_estimator=None, scoring=None, verbose=True, n_jobs=None, random_state=None, *args, **kwargs)[source]

Calculate learning curve on training and test data. Also generates a learning curve for baseline performance using dummy estimators.

Args:

estimator (estimator object):

Estimator object. A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed. If it is e.g. a GridSearchCV then nested cv is performed (recommended).

X (numpy.ndarray or pandas.DataFrame):

array-like of shape (n_samples, n_features). The data to fit as in scikit-learn. Can be a numpy array or pandas DataFrame.

y (numpy.ndarray or pandas.Series):

array-like of shape (n_samples,) or (n_samples, n_outputs), default=None The target variable to try to predict in the case of supervised learning, as in scikit-learn.

sample_sizes (int or list of int):

sample sizes to calculate the learning curve.

stratify (int):

For classification tasks. If not None, use stratified sampling to account for class labels imbalance. Defaults to None.

cv (int, cross-validation generator or an iterable):

Determines the cross-validation splitting strategy, as in scikit-learn. Possible inputs for cv are:

None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, K-Fold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Defaults to 5.

cv_stat (callable):

Function for aggregating cross-validation-wise scores. Defaults to numpy.mean.

dummy_estimator (estimator object):

A scikit-learn-like dummy estimator to evaluate baseline performance. If None, either DummyClassifier() or DummyRegressor() are used, based on ‘estimator’s type.

num_samples (int):

Number of iterations to shuffle data before determining subsamples. The first iteration (index 0) is ALWAYS unshuffled (num_samples=1 implies no resampling at all, default).

power_estimator (callable):

Callable must be a power_estimator function, see the ‘create_power_estimator*’ factory functions. If None, power curve is not calculated. Defaults to None.

scoring (str, callable, list, tuple or dict):

Scikit-learn-like score to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.

If None, the estimator’s score method is used. Defaults to None.

verbose (bool):

If not False, prints progress. Defaults to True.

n_jobs (int):

Number of jobs to run in parallel. Defaults to None. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

random_state (int):

Controls the randomness of the bootstrapping of the samples used when building sub-samples (if shuffle!=-1). Defaults to None.

*args:

Extra parameters passed to sklearn.model_selection.cross_validate.

**kwargs:

Extra keyword parameters passed to sklearn.model_selection.cross_validate.

Returns:

lc_train (adaptivesplit.base.learning_curve.LearningCurve object):: Learning curve calculated on training data.
lc_test (adaptivesplit.base.learning_curve.LearningCurve object):: Learning curve calculated on test data.
lc_dummy (adaptivesplit.base.learning_curve.LearningCurve object):: Learning curve calculated using the dummy estimator. It estimates baseline learning performance.

adaptivesplit.sklearn_interface.power module

adaptivesplit.sklearn_interface.power.PredictedScoreAndPower: Returned by the “predict_power_curve” function.

adaptivesplit.sklearn_interface.power.predict_power_curve(estimator, X, y, power_estimator, total_sample_size, stratify=None, sample_sizes=None, step=None, cv=5, num_samples=100, scoring=None, verbose=True, n_jobs=None, random_state=None, **kwargs)[source]

If total_sample_size > len(y) predicts the power curve trend to show what happens

when the sample size is higher.

Args:

estimator (estimator object):

This is assumed to implement the scikit-learn estimator interface.

X (numpy.ndarray or pandas.DataFrame):

array-like of shape (n_samples, n_features). The data to fit as in scikit-learn. Can be a numpy array or pandas DataFrame.

y (numpy.ndarray or pandas.Series):

array-like of shape (n_samples,) or (n_samples, n_outputs), default=None The target variable to try to predict in the case of supervised learning, as in scikit-learn.

power_estimator (callable):

Must be a power_estimator function, see the ‘create_power_estimator*’ factory functions.

total_sample_size (int):

The total number of samples in the data given as input.

stratify (int):

For classification tasks. If not None, use stratified sampling to account for class labels imbalance. Defaults to None.

sample_sizes (int or list of int):

Sample sizes to calculate the power curve. Defaults to None.

step (int):

Step size between sample sizes. A value of 1 is recommended. Defaults to None.

cv (int, cross-validation generator or an interable):

Determines the cross-validation splitting strategy, as in scikit-learn. Possible inputs for cv are:

None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, K-Fold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Defaults to 5.

num_samples (int):

Number of iterations to shuffle data before determining subsamples. The first iteration (index 0) is ALWAYS unshuffled (num_samples=1 implies no resampling at all, default). Defaults to 100.

scoring (str, callable, list, tuple or dict):

Scikit-learn-like score to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.

If None, the estimator’s score method is used. Defaults to None.

verbose (bool):

Prints progress. Defaults to True.

n_jobs (int):

Number of jobs to run in parallel. Defaults to None. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

random_state (int):

Controls the randomness of the bootstrapping of the samples used when building sub-samples (if shuffle!=-1). Defaults to None. Currently NOT implemented.

Returns:

PredictedScoreAndPower (tuple):: Contains the predicted score and power (in this order).

adaptivesplit.sklearn_interface.resampling module

class adaptivesplit.sklearn_interface.resampling.PermTest(stat_fun, num_samples=1000, n_jobs=-1, compare=<built-in function ge>, verbose=True, message='Permutation test')[source]

Bases: PermTest

Implements a permutation test.

class adaptivesplit.sklearn_interface.resampling.Resample(stat_fun, sample_size, stratify=None, num_samples=1000, replacement=True, first_unshuffled=False, n_jobs=- 1, verbose=True, message='Resampling')[source]

Bases: Resample

Implements the resampling strategy.

class adaptivesplit.sklearn_interface.resampling.SubSampleCV(estimator, sample_size, dummy_estimator=None, num_samples=100, cv=None, cv_stat=<function mean>, groups=None, scoring=None, power_estimator=None, n_jobs=-1, verbose=True, message='Calculating learning curve')[source]

Bases: Resample

Calculates learning performance/power on subsamples of the whole data.

Args:

estimator (estimator object):

This is assumed to implement the scikit-learn estimator interface.

sample_size (int):

Current sample size.

dummy_estimator (estimator object):

A scikit-learn-like dummy estimator to evaluate baseline performance. Defaults to None. If None, either DummyClassifier() or DummyRegressor() are used, based on ‘estimator’s type.

num_samples (int):

Number of iterations to shuffle data before determining subsamples. Defaults to 100. The first iteration (index 0) is ALWAYS unshuffled (num_samples=1 implies no resampling at all).

cv (int, cross-validation generator or an iterable):

Determines the cross-validation splitting strategy, as in scikit-learn. Possible inputs for cv are:

None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, K-Fold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Defaults to None.

cv_stat (callable):

Function for aggregating cross-validation-wise scores. Defaults to numpy.mean.

groups (array-like of shape (n_samples,)):

Group labels for the samples used while splitting the dataset into train/test set. This ‘groups’ parameter changes the cross-validation strategy from K-fold to GroupKfold as implemented in Scikit-Learn. Defaults to None.

scoring (str, callable, list, tuple or dict):

Scikit-learn-like score to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.

If None, the estimator’s score method is used. Defaults to None.

power_estimator (callable):

Callable must be a power_estimator function, see the ‘create_power_estimator*’ factory functions. If None, power curve is not calculated. Defaults to None.

n_jobs (int):

Number of jobs to run in parallel. Defaults to -1. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

verbose (bool):

Prints progress. Defaults to True.

message (str):

Message shown during calculations. Defaults to “Calculating learning curve”.

Returns:

SubSampledStats (tuple):: Contains the power and scores obtained during training/test with subsampled data.

fit_transform(x, y, stratify=None, sample_size=None, num_samples=None, replacement=None, compare=None, n_jobs=None, verbose=None, random_seed=None, **kwargs)[source]: Fit model using the current sample size.

plot(*args, **kwargs)[source]: Plot function to check outputs from resampling. By default this is unused.

subsample(x, y, stratify=None, sample_size=None, num_samples=None, replacement=None, cv=None, cv_stat=None, groups=None, scoring=None, random_seed=None, n_jobs=None, verbose=None, *args, **kwargs)[source]: Convenience function to run resampling.

adaptivesplit.sklearn_interface.resampling.SubSampledStats: Stores results given by “SubSampleCV”.

adaptivesplit.sklearn_interface.split module

class adaptivesplit.sklearn_interface.split.AdaptiveSplit(total_sample_size=500, scoring='neg_mean_squared_error', cv=5, step=1, bootstrap_samples=100, power_bootstrap_samples=None, window_size=None, verbose=True, plotting=True, ci='95%', n_jobs=- 1)[source]

Bases: object

Run the AdaptiveSplit model. This evaluates performance on multiple splits of the data by calculating the learning and power curves using bootstrap. The model works for both regression and classification tasks, depending on the scikit-learn estimator and type of score metric provided.

If the total sample size provided to this class is higher than len(Y), the algorithm will predict the learning and power curves for the additional samples. This is useful to check if a higher sample size is able to enhance model prediction.

Args:

total_sample_size (int):

The total length of the data given as input. Defaults to “total_sample_size” as specified in the configuration file.

scoring (str, callable, list, tuple or dict):

Scikit-learn-like score to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.

If None, the estimator’s score method is used. Defaults to “scoring” as specified in the configuration file.

cv (int, cross-validation generator or an iterable):

Determines the cross-validation splitting strategy, as in scikit-learn. Possible inputs for cv are:

None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, K-Fold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Defaults to “cv” as specified in the configuration file.

step (int):

Step size between sample sizes. A value of 1 is recommended. Defaults to “step” as specified in the configuration file.

bootstrap_samples (int):

Number of samples generated during bootstrapping. Defaults to “bootstrap_samples” as specified in the configuration file.

power_bootstrap_samples (int):

Number of iteration during which samples are bootstrapped to calculate power. Defaults to “power_bootstrap_samples” as specified in the configuration file.

window_size (int):

Size of the rolling window used to calculate the slope of the power curve. if fast_mode in the fit method is equal to true it is also used to calculate reduces sample sizes to use into fast mode. If None, defaults to “window_size” as specified in the configuration file.

verbose (bool):

Prints progress. Defaults to True.

plotting (bool):

Whether or not to plot the learning and the power curves after calculations. Defaults to True.

ci (str):

Intervals confidence used when plotting the learning and power curves. Defaults to 95%.

n_jobs (int):

Number of jobs to run in parallel. Defaults to “n_jobs” as specified in the configuration file. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns:

AdaptiveSplitResults (namedtuple):: Contains the results from the AdaptiveSplit algorithm, i.e. estimated stopping point, scores and power.
Figure (matplotlib.figure.Figure):: Plot illustrating the learning and power curves.

fit(X, Y, estimator, stratify=None, fast_mode=False, sample_size_multiplier=0.2, predict=True, random_state=None)[source]

Fit the AdaptiveSplit model.

Args:

X (numpy.ndarray or pandas.DataFrame):: array-like of shape (n_samples, n_features). The data to fit as in scikit-learn. Can be a numpy array or pandas DataFrame.
Y (numpy.ndarray or pandas.Series):: array-like of shape (n_samples,) or (n_samples, n_outputs). The target variable to try to predict in the case of supervised learning, as in scikit-learn.
estimator (estimator object):: Estimator object. A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed. If it is e.g. a GridSearchCV then nested cv is performed (recommended).
stratify (_type_, optional):: For classification tasks. If not None, use stratified sampling to account for class labels imbalance. Defaults to “stratify” as specified in the configuration file.
fast_mode (bool):: If True the algorithm is evaluated on reduced sample sizes to reduce runtime. Defaults to “fast_mode” as specified in the configuration file.
sample_size_multiplier (float):: Multiplier value to make sure the algorithm starts with adequate sample sizes. (Recommended value is 0.2). Defaults to “sample_size_multiplier” as specified in the configuration file.
predict (bool, optional):: If True, try to predict the learning and power curve for additional samples. If total_sample_size == len(Y) it automatically turns to False. Defaults to True.
random_state (int, optional):: Controls the randomness of the bootstrapping of the samples used when building sub-samples (if shuffle!=-1). Defaults to None.

adaptivesplit.sklearn_interface.utils module

adaptivesplit.sklearn_interface.utils.calculate_ci(X, ci='95%')[source]

Calculate confidence intervals.

Args:

X (list, np.ndarray, pd.Series):: 1D array of shape (n_samples,).
ci (str, optional):: Confidence level to Return. Defaults to ‘95%’. 90%, 95%, 98%, 99% are possible inputs.

Returns:

ci_lower:: Confidence intervals lower bound.
ci_upper:: Confidence intervals upper bound.

adaptivesplit.sklearn_interface.utils.get_sklearn_scorer(scoring)[source]

Provides a scikit-learn scoring function given an input string.

Args:

scoring (str, callable, list, tuple or dict):

Scikit-learn-like score to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.

If None, the estimator’s score method is used. Defaults to “scoring” as specified in the configuration file.

Returns:

score_func (callable):: Scikit-Learn scoring function.

adaptivesplit.sklearn_interface.utils.statfun_as_callable(stat_fun)[source]

Returns a statistical function.

Args:

stat_fun (str, callable):: If this is a str, use sklearn.metrics.get_sklearn_scorer to make stat_fun a callable.

Returns:

stat_fun (callable):: Statistical function.