A modular active learning framework for Python

Last update: Dec 31, 2022

Overview

Modular Active Learning framework for Python3

Page contents

Introduction
Active learning from bird's-eye view
modAL in action
Installation
Documentation
Citing
About the developer

Introduction

modAL is an active learning framework for Python3, designed with modularity, flexibility and extensibility in mind. Built on top of scikit-learn, it allows you to rapidly create active learning workflows with nearly complete freedom. What is more, you can easily replace parts with your custom built solutions, allowing you to design novel algorithms with ease.

Active learning from bird's-eye view

With the recent explosion of available data, you have can have millions of unlabelled examples with a high cost to obtain labels. For instance, when trying to predict the sentiment of tweets, obtaining a training set can require immense manual labour. But worry not, active learning comes to the rescue! In general, AL is a framework allowing you to increase classification performance by intelligently querying you to label the most informative instances. To give an example, suppose that you have the following data and classifier with shaded regions signifying the classification probability.

Suppose that you can query the label of an unlabelled instance, but it costs you a lot. Which one would you choose? By querying an instance in the uncertain region, surely you obtain more information than querying by random. Active learning gives you a set of tools to handle problems like this. In general, an active learning workflow looks like the following.

The key components of any workflow are the model you choose, the uncertainty measure you use and the query strategy you apply to request labels. With modAL, instead of choosing from a small set of built-in components, you have the freedom to seamlessly integrate scikit-learn or Keras models into your algorithm and easily tailor your custom query strategies and uncertainty measures.

modAL in action

Let's see what modAL can do for you!

From zero to one in a few lines of code

Active learning with a scikit-learn classifier, for instance RandomForestClassifier, can be as simple as the following.

from modAL.models import ActiveLearner
from sklearn.ensemble import RandomForestClassifier

# initializing the learner
learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    X_training=X_training, y_training=y_training
)

# query for labels
query_idx, query_inst = learner.query(X_pool)

# ...obtaining new labels from the Oracle...

# supply label for queried instance
learner.teach(X_pool[query_idx], y_new)

Replacing parts quickly

If you would like to use different uncertainty measures and query strategies than the default uncertainty sampling, you can either replace them with several built-in strategies or you can design your own by following a few very simple design principles. For instance, replacing the default uncertainty measure to classification entropy looks the following.

from modAL.models import ActiveLearner
from modAL.uncertainty import entropy_sampling
from sklearn.ensemble import RandomForestClassifier

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=entropy_sampling,
    X_training=X_training, y_training=y_training
)

Replacing parts with your own solutions

modAL was designed to make it easy for you to implement your own query strategy. For example, implementing and using a simple random sampling strategy is as easy as the following.

import numpy as np

def random_sampling(classifier, X_pool):
    n_samples = len(X_pool)
    query_idx = np.random.choice(range(n_samples))
    return query_idx, X_pool[query_idx]

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=random_sampling,
    X_training=X_training, y_training=y_training
)

For more details on how to implement your custom strategies, visit the page Extending modAL!

An example with active regression

To see modAL in real action, let's consider an active regression problem with Gaussian Processes! In this example, we shall try to learn the noisy sine function:

import numpy as np

X = np.random.choice(np.linspace(0, 20, 10000), size=200, replace=False).reshape(-1, 1)
y = np.sin(X) + np.random.normal(scale=0.3, size=X.shape)

For active learning, we shall define a custom query strategy tailored to Gaussian processes. In a nutshell, a query stategy in modAL is a function taking (at least) two arguments (an estimator object and a pool of examples), outputting the index of the queried instance. In our case, the arguments are regressor and X.

def GP_regression_std(regressor, X):
    _, std = regressor.predict(X, return_std=True)
    return np.argmax(std)

After setting up the query strategy and the data, the active learner can be initialized.

from modAL.models import ActiveLearner
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF

n_initial = 5
initial_idx = np.random.choice(range(len(X)), size=n_initial, replace=False)
X_training, y_training = X[initial_idx], y[initial_idx]

kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
         + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

regressor = ActiveLearner(
    estimator=GaussianProcessRegressor(kernel=kernel),
    query_strategy=GP_regression_std,
    X_training=X_training.reshape(-1, 1), y_training=y_training.reshape(-1, 1)
)

The initial regressor is not very accurate.

The blue band enveloping the regressor represents the standard deviation of the Gaussian process at the given point. Now we are ready to do active learning!

# active learning
n_queries = 10
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(X)
    regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))

After a few queries, we can see that the prediction is much improved.

Additional examples

Including this, many examples are available:

Installation

modAL requires

Python >= 3.5
NumPy >= 1.13
SciPy >= 0.18
scikit-learn >= 0.18

You can install modAL directly with pip:

pip install modAL

Alternatively, you can install modAL directly from source:

pip install git+https://github.com/modAL-python/modAL.git

Documentation

You can find the documentation of modAL at https://modAL-python.github.io, where several tutorials and working examples are available, along with a complete API reference. For running the examples, Matplotlib >= 2.0 is recommended.

Citing

If you use modAL in your projects, you can cite it as

@article{modAL2018,
    title={mod{AL}: {A} modular active learning framework for {P}ython},
    author={Tivadar Danka and Peter Horvath},
    url={https://github.com/modAL-python/modAL},
    note={available on arXiv at \url{https://arxiv.org/abs/1805.00979}}
}

About the developer

modAL is developed by me, Tivadar Danka (aka cosmic-cortex in GitHub). I have a PhD in pure mathematics, but I fell in love with biology and machine learning right after I finished my PhD. I have changed fields and now I work in the Bioimage Analysis and Machine Learning Group of Peter Horvath, where I am working to develop active learning strategies for intelligent sample analysis in biology. During my work I realized that in Python, creating and prototyping active learning workflows can be made really easy and fast with scikit-learn, so I ended up developing a general framework for this. The result is modAL :) If you have any questions, requests or suggestions, you can contact me at [email protected]! I hope you'll find modAL useful!

Comments

Pandas support & support for applying transformations configured in sklearn.pipeline
Most notable changes

query strategies now only return the indices of the selected instances, the query method then includes the instances themselves

old interface is still supported, but its usage results in a deprecation warning

added on_transformed parameter to learners; when True and the estimator uses sklearn.pipeline, the transformations configured in that pipeline are applied before calculating metrics on the data set

Committees also support this functionality, but as they have no X_training (could be different for each of their learners), the training data can yet not be transformed

Note

@cosmic-cortex , after playing around with your code, I must say you have created a great library! I am open to discussion to get this functionality merged, but please don't feel any pressure to do so if you are not satisfied with the implementation. I just needed to resolve #104 for my project and my fork is now sufficient for my needs.

Note2

Not sure where this functionality should be addressed in the docs.
opened by BoyanH 15
vote_entropy

I guess, the vote_entropy and KL_Divergence is not being returned, and all values corresponds to zero. Also, if I am doing it wrong, can you suggest a code snippet, how to use, Kl_Divergence or vote_entropy instead of concensus entropy for querying the points. when using query by committee

opened by srivastavapravesh14-zz 10
cold start handling in ranked batch sampling

Hi!

The behavior of cold start handling in ranked batch sampling seems different from the Cardoso et al.'s "Ranked batch-mode active learning".

https://github.com/modAL-python/modAL/blob/452898fc181b6d4ae6399dfdcb311ceb952c8486/modAL/batch.py#L133-L139

In modAL's implementation, in the case of cold start, the instance selected by select_cold_start_instance is not added to the instance list instance_index_ranking. While in "Ranked batch-mode active learning", the instance selected by select_cold_start_instance seems to be the first item in instance_index_ranking.

https://github.com/modAL-python/modAL/blob/452898fc181b6d4ae6399dfdcb311ceb952c8486/modAL/batch.py#L46

If my understanding on the algorithm proposed in the paper and modAL's implementation is correct, we can change the return of select_cold_start_instance to return best_coldstart_instance_index, X[best_coldstart_instance_index].reshape(1, -1), store best_coldstart_instance_index in instance_index_ranking, and revise ranked_batch correspondingly.

opened by zhangyu94 10
Support batch-mode queries?

Hi,

I've run into a bit of a use-case that I'm not sure is quite supported by modAL – nor the broader libraries for active learning – but would be relatively simple to implement. After reviewing modAL's internals a bit, I don't think it officially supports active learning with batch-mode queries.

The sampling strategies (for example, uncertainty sampling) do support the n_instances parameter, but from what I can tell, uncertainty sampling may return redundant/sub-optimal queries if we return more than one instance from the unlabeled set. This is a bit prohibitive in settings where we'd like to ask an active learner to return multiple (if not all) examples from the unlabeled set/pool, and the computational cost for re-training an active learning model goes without saying.

I found requests for batch-mode support in the popular libact library (issues #57 and #89) but, to the best of my knowledge, I'm not sure they were addressed in any of their PRs.

In that case, does it make sense to implement something like [Ranked batch-mode active learning] by Cardoso et al.? I took a crack at it this weekend for a better personal understanding, but if it's worth integrating and supporting in modAL I'm happy to polish it and talk it through in a PR.

Thanks!

opened by dataframing 10
Pytorch runnable example

this is a runnable example of modAL using pytorch models, wrapped with skorch. this example is very similar to the one we can find in modAL/examples/keras_integration.py

opened by damienlancry 9

use different query strategies

I am using keras/tensorflow models with this framework and the activelearner class. As soon as I try to change the query strategy, different errors occur.

  learner = ActiveLearner(
estimator=classifier,
query_strategy=expected_error_reduction,
X_training=x_initial_training,
y_training=y_initial_training,
)
prescore = learner.score(x_test, y_test)
n_queries = 50
postscore = np.zeros(shape=(n_queries, 1))
for idx in range(n_queries):
    print('Query no. %d' % (idx + 1))
    query_idx, query_instance = learner.query(x_pool)
    learner.teach(
        X=x_pool[query_idx],
        y=y_pool[query_idx],
        only_new=True,
        epochs=10,
        validation_data=(x_val, y_val),
    )
   # remove queried instances from pool
   x_pool = np.delete(x_pool, query_idx, axis=0)
   y_pool = np.delete(y_pool, query_idx, axis=0)
   postscore[idx, 0] = learner.score(x_test, y_test)

What do I have to change to implement the different strategies. The trainings_input is 3D shape. I tried up to now all uncertainty methods of which only the default selection did work. Now I was trying the expected error_reduction strategy, but there occur errors as well.

I am afraid the 3D shape of the training data is killing all the other algorithms, but for a LSTM this kind of shape is required.

opened by alexv1247 9

docs: refactor documentation

Autoconversion of docstrings with pyment doesn't work well, because the initial format was not following a strict standard. So there are a lot of manual corrections. I have chosen Google style for docstring, however conversion from it to NumPy style with pyment could be easier.

The first half of modAL.models looks good, but there may be some improvements (further deduplication) in coming days. Review and comments on committed parts could help to finish the whole refactoring (I hope, by the weekend).

opened by nikolay-bushkov 9
DBAL with Image Data implementation using modAL

I created an example script trying to reproduce the results of Deep Bayesian Active Learning with Image Data using modAL. I used this keras code from one of the authors. I cannot think of anything I am doing differently and yet their code works and not mine. For the acquisition function instead of using their modified keras, i used yarin gal's implementation (first author). Can you spot any mistake in my code? EDIT: I actually found a mistake in my code, I was not really computing the entropy but rather the other half of BALD function. I fixed this mistake and am currently running the code. EDIT2: Still not working

opened by damienlancry 8
Entropy sampling query startegy instable

I'm using entropy sampling startegy to select samples for RandomForest classification of 7 classes. However when i did my query with entropy sampling (i tried also uncertainty samplig) i have a different result every time i run the query. the selected samples are never the same (i have not changed my input data).

Thank you in advance for your help.

opened by YousraH 8
about learner.teach

it seems that each time we run the learner. teach, the model will fit the initial data plus the new data from the beginning just like an untrained new model, can the model just learn the new data with the weight which has been trained on the initial data?

opened by luxu1220 7
Using RandomForestClassificatier on vectors for predicting labels gives "Found input variables with inconsistent numbers of samples"
I am learning from Active Regression tutorial page but it has not taken up the case of applying learners to more than one dimension vectors ( I was not able to find a specific example in the doc for this, so please point if you know one ).

In the function named my_stuff

My learner is

regressor = ActiveLearner( estimator=RandomForestClassifier(), query_strategy=entropy_sampling, X_training=X_training, y_training=y_training.ravel() )

My dataset X is (13084, 50) ( meaning 13084 vectors each having 50 length ) and y is (13084, 1) ( similar meaning ).

Here X_training is (5, 50) and y_training is (5, 1). In this section of the code( taken blatantly from the tutorial page mentioned above ):

for idx in range(n_queries): query_idx, query_instance = regressor.query(X) print(query_idx, 'query_idx', X_training.shape, y_training.shape) regressor.teach(X[query_idx].reshape(-1, 1), y[query_idx].reshape(-1, 1))

The program ended abruptly, so upon using python debugger I found the error:

ValueError: Found input variables with inconsistent numbers of samples: [50, 1] > /path/to/file/predict.py(286)my_stuff() -> regressor.teach(X[query_idx].reshape(-1, 1), y[query_idx].reshape(-1, 1))

regressor Here X[query_idx].reshape(-1, 1) has shape (50, 1) and y[query_idx].reshape(-1, 1) has shape (1, 1).

What would be the correct procedure for the teach procedure?
opened by berserker1 6
Which sampling method is best for very unbalanced data?

Hi!

I am wondering, which of the implemented sampling strategies handles unbalanced data best? I believe if I get the top 10000 uncertain data instances, but 99 % are in the same class, this would not help much for the next training process iteration, right?

Thank you in advance!

opened by vandreslime 0
Can I use modAL with estimators from other libraries than scikit-learn like xgboost?

Hi there,

I have already trained some good working estimators (xgboost, catboost & lightgbm). I would like to add an active learner, because we need to decide which data to label continuously.

The documentation says, that I need to use a scikit-learn estimator object. Does that mean I can't use the models from xgboost, catboost & lightgbm? I used the models from the libraries with the same names.

And another question (for my understanding). Do I give an estimator that is already trained, or does the active learner train a model from scratch?

I am new to the field of active learning, so thank you very much!

opened by vandreslime 0

Proof of concept for allowing non-sklearn estimators

Not sure if there is any desire for this feature, but in this PR I have sketched out a way to use virtually any estimator type with the ActiveLearner and BayesianOptimizer classes.

Motivation

Allow us to use other training and inference facilities, such as HuggingFace models that are trained using the Trainer class, use AWS SageMaker Estimators, etc. With this added flexibility, the training and inference does not need to even run on the same hardware as the modAL code. This brings the suite of sampling methods here to many new applications, particularly resource-intensive deep learning models that typically don't fit that great under the sklearn interface.

Implementation

Rather than call the classic sklearn estimator functions such as fit, predict, predict_proba, and score, this PR adds a layer of callables that can be overridden: fit_func, predict_func, predict_proba_func, and score_func.

    def __init__(self,
                 estimator: BaseEstimator,
                 query_strategy: Callable = uncertainty_sampling,
                 X_training: Optional[modALinput] = None,
                 y_training: Optional[modALinput] = None,
                 bootstrap_init: bool = False,
                 on_transformed: bool = False,
                 force_all_finite: bool = True,
                 fit_func: FitFunction = SKLearnFitFunction(),
                 predict_func: PredictFunction = SKLearnPredictFunction(),
                 predict_proba_func: PredictProbaFunction = SKLearnPredictProbaFunction(),
                 score_func: ScoreFunction = SKLearnScoreFunction(),
                 **fit_kwargs
                 ) -> None:

I added SKLearn implementations of each by default (included their corresponding Protocol classes as well). Here's how fit works:

class FitFunction(Protocol):
    def __call__(self, estimator: GenericEstimator, X, y, **kwargs) -> GenericEstimator:
        raise NotImplementedError
# ...
class SKLearnFitFunction(FitFunction):
    def __call__(self, estimator: BaseEstimator, X, y, **kwargs) -> BaseEstimator:
        return estimator.fit(X=X, y=y, **kwargs)

I'll also note that the changes in this PR don't break any of the existing tests.

Usage

When using SageMaker, we might implement fit and predict_proba in this manner:

class CustomEstimator:
    hf_predictor: Union[HuggingFacePredictor, Predictor]
    hf_estimator: HuggingFace

    def __init__(self, hf_predictor: HuggingFacePredictor, hf_estimator: HuggingFace):
        self.hf_predictor = hf_predictor
        self.hf_estimator = hf_estimator

class CustomFitFunction(FitFunction):
    def __call__(self, estimator: CustomEstimator, X, y, **kwargs) -> CustomEstimator:
        # notice we don't use `y` -- the label is baked into the HuggingFace Dataset
        return estimator.hf_estimator.fit(X=X, **kwargs)

class CustomPredictProbaFunction(PredictProbaFunction):
    @staticmethod
    def hf_prediction_to_proba(predictions: Union[List[Dict], object],
                               positive_class_label: str = 'LABEL_1',
                               negative_class_label: str = 'LABEL_0') -> np.array:
        label_key: str = 'label'
        score_key: str = 'score'
        p = []
        for prediction in predictions:
            if positive_class_label == prediction[label_key]:
                score = prediction[score_key]
                p.append([score, 1.0 - score])
            if negative_class_label == prediction[label_key]:
                score = prediction[score_key]
                p.append([1.0 - score, score])
        return np.array(p)

    def __call__(self, estimator: CustomEstimator, X, **kwargs) -> np.array:
        return self.hf_prediction_to_proba(
            predictions=estimator.hf_predictor.predict(dict(inputs=X))
        )

estimator = CustomEstimator(hf_predictor=hf_predictor, hf_estimator=hf_estimator)

learner = ActiveLearner(
    estimator=estimator,
    fit_func=CustomFitFunction(),
    predict_proba_func=CustomPredictProbaFunction(),
    X_training=train_dataset # standard HuggingFace Dataset instead of your typical types for `X` in `sklearn`
)

If you've made it this far, I'd ask that you forgive the clunkiness. This was a rough sketch of an idea I wanted to get written down before I forgot it. Anyways, would love some feedback, and if you think this PR is worth finishing, let me know. I can say for me, this would unlock a lot of really useful applications.

opened by adelevie 2

TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid

trying to run this notebook https://www.kaggle.com/code/kmader/active-learning-optimization-improvement/notebook

getting an error in learner.teach step (and also in pd.concat(seq_iter.compute()) step):

# initializing the learner
from modAL.models import ActiveLearner
initial_df = all_papaya_samples_df.sample(20, random_state=2018)
learner = ActiveLearner(
    estimator=SVC(kernel = 'rbf', probability=True, random_state = 2018),
    X_training=initial_df[['firmness', 'redness']], 
    y_training=initial_df['tastiness']
)
# query for labels
X_pool = all_papaya_samples_df[['firmness', 'redness']].values
y_pool = all_papaya_samples_df['tastiness'].values
query_idx, query_inst = learner.query(X_pool)
query_idx, query_inst
fig, m_axs = plt.subplots(2, 3, figsize = (12, 12))
last_pts = initial_df.shape[0]
queried_pts = []

for c_ax, c_pts in zip(m_axs.flatten(), np.linspace(20, 350, 6).astype(int)):
    for _ in range(c_pts-last_pts):
        query_idx, _ = learner.query(X_pool)
        queried_pts += [query_idx]
        learner.teach(X_pool[query_idx], y_pool[query_idx])
    last_pts = c_pts
    fit_and_show_model(learner, 
                       None, 
                       title_str = 'Sampled: {}'.format(c_pts),
                       ax = c_ax,
                       fit_model = False
                      )

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_28372/2173050794.py in <module>
      6         query_idx, _ = learner.query(X_pool)
      7         queried_pts += [query_idx]
----> 8         learner.teach(X_pool[query_idx], y_pool[query_idx])
      9     last_pts = c_pts
     10     fit_and_show_model(learner, 

/opt/conda/lib/python3.7/site-packages/modAL/models/learners.py in teach(self, X, y, bootstrap, only_new, **fit_kwargs)
     96             **fit_kwargs: Keyword arguments to be passed to the fit method of the predictor.
     97         """
---> 98         self._add_training_data(X, y)
     99         if not only_new:
    100             self._fit_to_known(bootstrap=bootstrap, **fit_kwargs)

/opt/conda/lib/python3.7/site-packages/modAL/models/base.py in _add_training_data(self, X, y)
     94         else:
     95             try:
---> 96                 self.X_training = data_vstack((self.X_training, X))
     97                 self.y_training = data_vstack((self.y_training, y))
     98             except ValueError:

/opt/conda/lib/python3.7/site-packages/modAL/utils/data.py in data_vstack(blocks)
     22         return sp.vstack(blocks)
     23     elif isinstance(blocks[0], pd.DataFrame):
---> 24         return blocks[0].append(blocks[1:])
     25     elif isinstance(blocks[0], np.ndarray):
     26         return np.concatenate(blocks)

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in append(self, other, ignore_index, verify_integrity, sort)
   8967                 ignore_index=ignore_index,
   8968                 verify_integrity=verify_integrity,
-> 8969                 sort=sort,
   8970             )
   8971         ).__finalize__(self, method="append")

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    302         verify_integrity=verify_integrity,
    303         copy=copy,
--> 304         sort=sort,
    305     )
    306 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    382                     "only Series and DataFrame objs are valid"
    383                 )
--> 384                 raise TypeError(msg)
    385 
    386             ndims.add(obj.ndim)

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

opened by akamil-etsy 0

AttributeError: bootstrap_init

I am trying to apply the package for sklearn RandomForestClassifier like this:

learner= ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=modAL.uncertainty.uncertainty_sampling,
X_training=X_train0, y_training=y_train
)

learner

Then the following error appears:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/core/formatters.py:973, in MimeBundleFormatter.__call__(self, obj, include, exclude)
    970     method = get_real_method(obj, self.print_method)
    972     if method is not None:
--> 973         return method(include=include, exclude=exclude)
    974     return None
    975 else:

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:614, in BaseEstimator._repr_mimebundle_(self, **kwargs)
    612 def _repr_mimebundle_(self, **kwargs):
    613     """Mime bundle used by jupyter kernels to display estimator"""
--> 614     output = {"text/plain": repr(self)}
    615     if get_config()["display"] == "diagram":
    616         output["text/html"] = estimator_html_repr(self)

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:279, in BaseEstimator.__repr__(self, N_CHAR_MAX)
    271 # use ellipsis for sequences with a lot of elements
    272 pp = _EstimatorPrettyPrinter(
    273     compact=True,
    274     indent=1,
    275     indent_at_name=True,
    276     n_max_elements_to_show=N_MAX_ELEMENTS_TO_SHOW,
    277 )
--> 279 repr_ = pp.pformat(self)
    281 # Use bruteforce ellipsis when there are a lot of non-blank characters
    282 n_nonblank = len("".join(repr_.split()))

File ~/tensorflow-test/env/lib/python3.8/pprint.py:153, in PrettyPrinter.pformat(self, object)
    151 def pformat(self, object):
    152     sio = _StringIO()
--> 153     self._format(object, sio, 0, 0, {}, 0)
    154     return sio.getvalue()

File ~/tensorflow-test/env/lib/python3.8/pprint.py:170, in PrettyPrinter._format(self, object, stream, indent, allowance, context, level)
    168     self._readable = False
    169     return
--> 170 rep = self._repr(object, context, level)
    171 max_width = self._width - indent - allowance
    172 if len(rep) > max_width:

File ~/tensorflow-test/env/lib/python3.8/pprint.py:404, in PrettyPrinter._repr(self, object, context, level)
    403 def _repr(self, object, context, level):
--> 404     repr, readable, recursive = self.format(object, context.copy(),
    405                                             self._depth, level)
    406     if not readable:
    407         self._readable = False

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:189, in _EstimatorPrettyPrinter.format(self, object, context, maxlevels, level)
    188 def format(self, object, context, maxlevels, level):
--> 189     return _safe_repr(
    190         object, context, maxlevels, level, changed_only=self._changed_only
    191     )

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:440, in _safe_repr(object, context, maxlevels, level, changed_only)
    438 recursive = False
    439 if changed_only:
--> 440     params = _changed_params(object)
    441 else:
    442     params = object.get_params(deep=False)

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:93, in _changed_params(estimator)
     89 def _changed_params(estimator):
     90     """Return dict (param_name: value) of parameters that were given to
     91     estimator with non-default values."""
---> 93     params = estimator.get_params(deep=False)
     94     init_func = getattr(estimator.__init__, "deprecated_original", estimator.__init__)
     95     init_params = inspect.signature(init_func).parameters

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:210, in BaseEstimator.get_params(self, deep)
    208 out = dict()
    209 for key in self._get_param_names():
--> 210     value = getattr(self, key)
    211     if deep and hasattr(value, "get_params"):
    212         deep_items = value.get_params().items()

AttributeError: 'ActiveLearner' object has no attribute 'bootstrap_init'---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
    700 stream = StringIO()
    701 printer = pretty.RepresentationPrinter(stream, self.verbose,
    702     self.max_width, self.newline,
    703     max_seq_length=self.max_seq_length,
    704     singleton_pprinters=self.singleton_printers,
    705     type_pprinters=self.type_printers,
    706     deferred_pprinters=self.deferred_printers)
--> 707 printer.pretty(obj)
    708 printer.flush()
    709 return stream.getvalue()

File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File ~/tensorflow-test/env/lib/python3.8/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:279, in BaseEstimator.__repr__(self, N_CHAR_MAX)
    271 # use ellipsis for sequences with a lot of elements
    272 pp = _EstimatorPrettyPrinter(
    273     compact=True,
    274     indent=1,
    275     indent_at_name=True,
    276     n_max_elements_to_show=N_MAX_ELEMENTS_TO_SHOW,
    277 )
--> 279 repr_ = pp.pformat(self)
    281 # Use bruteforce ellipsis when there are a lot of non-blank characters
    282 n_nonblank = len("".join(repr_.split()))

File ~/tensorflow-test/env/lib/python3.8/pprint.py:153, in PrettyPrinter.pformat(self, object)
    151 def pformat(self, object):
    152     sio = _StringIO()
--> 153     self._format(object, sio, 0, 0, {}, 0)
    154     return sio.getvalue()

File ~/tensorflow-test/env/lib/python3.8/pprint.py:170, in PrettyPrinter._format(self, object, stream, indent, allowance, context, level)
    168     self._readable = False
    169     return
--> 170 rep = self._repr(object, context, level)
    171 max_width = self._width - indent - allowance
    172 if len(rep) > max_width:

File ~/tensorflow-test/env/lib/python3.8/pprint.py:404, in PrettyPrinter._repr(self, object, context, level)
    403 def _repr(self, object, context, level):
--> 404     repr, readable, recursive = self.format(object, context.copy(),
    405                                             self._depth, level)
    406     if not readable:
    407         self._readable = False

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:189, in _EstimatorPrettyPrinter.format(self, object, context, maxlevels, level)
    188 def format(self, object, context, maxlevels, level):
--> 189     return _safe_repr(
    190         object, context, maxlevels, level, changed_only=self._changed_only
    191     )

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:440, in _safe_repr(object, context, maxlevels, level, changed_only)
    438 recursive = False
    439 if changed_only:
--> 440     params = _changed_params(object)
    441 else:
    442     params = object.get_params(deep=False)

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/utils/_pprint.py:93, in _changed_params(estimator)
     89 def _changed_params(estimator):
     90     """Return dict (param_name: value) of parameters that were given to
     91     estimator with non-default values."""
---> 93     params = estimator.get_params(deep=False)
     94     init_func = getattr(estimator.__init__, "deprecated_original", estimator.__init__)
     95     init_params = inspect.signature(init_func).parameters

File ~/tensorflow-test/env/lib/python3.8/site-packages/sklearn/base.py:210, in BaseEstimator.get_params(self, deep)
    208 out = dict()
    209 for key in self._get_param_names():
--> 210     value = getattr(self, key)
    211     if deep and hasattr(value, "get_params"):
    212         deep_items = value.get_params().items()

AttributeError: 'ActiveLearner' object has no attribute 'bootstrap_init'

I have to run it with python 3.8 as I am using tensorflow under the mac M1 chip and this still has some dependency issues. For the rest, there is nothing different from the usual way I feed in the RF model (data formats are correct). Any idea why is it calling this attribute?

opened by luisignaciomenendez 1

decision_function instead of predict_proba

Several non-probabilistic estimators, such as SVMs in particular, can be used with uncertainty sampling. Scikit-Learn estimators that support the decision_function method can be used with the closest-to-hyperplane selection algorithm [Bloodgood]. This is actually a very popular strategy in AL research and would be very easy to implement.

opened by lkurlandski 5

Releases(0.4.1)

0.4.1(Jan 7, 2021)
Release notes

This release includes a fix for a new feature added in 0.4.0.

Fixes

#108: if the data transformation is learned, the transformed data cannot be stored and needs to be re-calculated every time a query is done. This was fixed by @BoyanH in #113.

Source code(tar.gz)
Source code(zip)
modAL-0.4.1-py3-none-any.whl(27.26 KB)
modAL-0.4.1.tar.gz(22.88 KB)
0.4.0(Nov 1, 2020)
Release notes

modAL 0.4.0 is finally here! This new release is made possible by the contributions of @BoyanH, @damienlancry, and @OskarLiew, many thanks to them!

New features

pandas.DataFrame support, thanks to @BoyanH! This was a frequently requested feature which I was unable to properly implement, but @BoyanH has found a solution for this in #105.

Support for scikit-learn pipelines, also by @BoyanH. Now learners support querying on the transformed data by setting on_transformed=True upon initialization.

Changes

Query strategies should no longer return the selected instances, only the indices for the queried objects. (See #104 by @BoyanH.)

Fixes

Committee sets classes when fitting, this solves the error which occurred when no training data was provided during initialization. This fix was contributed in #100 by @OskarLiew, thanks for that!

Some typos in the ranked batch mode sampling example, fixed by @damienlancry.

Source code(tar.gz)
Source code(zip)
0.3.6(Aug 21, 2020)
Fixes

Updating of known classes for Committee.teach() (#63)

Source code(tar.gz)
Source code(zip)
0.3.5(Nov 11, 2019)
Changes

ActiveLearner now supports np.nan and np.inf in the data by setting force_all_finite=False upon initialization. #58

Bayesian optimization fixed for multidimensional functions.

Calls to check_X_y no longer converts between datatypes. #49

Expected error reduction implementation error fixed. #45

modAL.utils.data_vstack now falls back to numpy.concatenate if possible.

Multidimensional data for ranked batch sampling and expected error reduction fixed. #41

Fixes by @zhangyu94:

modAL.selection.shuffled_argmax #32

Cold start instance in modAL.batch.ranked_batch fixed. #30

Best instance index in modAL.batch.select_instance fixed. #29

Source code(tar.gz)
Source code(zip)
0.3.4(Dec 5, 2018)
New features

To handle the case when the maximum utility score is not unique, a random tie break option was introduced. From this version, passing random_tie_break=True to the query strategies first shuffles the pool then uses a stable sorting to find the instances to query. In the case where the maximum utility score is not unique, it is equivalent of randomly sampling from the top scoring instances.

Changes

modAL.expected_error.expected_error_reduction runtime improved by omitting unnecessary cloning of the estimator for every instance in the pool.

Source code(tar.gz)
Source code(zip)
0.3.3(Nov 30, 2018)

New features

In this small release, the expected error and log loss reduction algorithms (Roy and McCallum, 2001) were added.
Source code(tar.gz)
Source code(zip)
0.3.2(Nov 26, 2018)
New features

In this release, the focus was on multilabel active learning strategies. The following algorithms were added:

SVM binary minimum (Brinker)

max loss, mean max loss, (Li et al.)

MinConfidence, MeanConfidence, MinScore, MeanScore (Esuli and Sebastiani)

Source code(tar.gz)
Source code(zip)
0.3.1(Oct 2, 2018)
Release notes

The new release of modAL is here! This is a milestone in its evolution, because it has just received its first contributions from the open source community! :) Thanks for @dataframing and @nikolay-bushkov for their work! Hoping to see many more contributions from the community, because modAL still has a long way to go! :)

New features

Ranked batch mode queries by @dataframing. With this query strategy, several instances can be queried for labeling, which alleviates a lot of problems in uncertainty sampling. For details, see Ranked batch mode learning by Cardoso et al.

Sparse matrix support by @nikolay-bushkov. From now, if the estimator can handle sparse matrices, you can use them to fit the active learning models!

Cold start support has been added to all the models. This means that now learner.query() can be used without training the model first.

Changes

The documentation has gone under a major refactoring thanks to @nikolay-bushkov! Type annotations have been added and the docstrings were refactored to follow Google style docstrings. The website has been changed accordingly. Instead of GitHub pages, ReadTheDocs are used and the old website is merged with the API reference. Regarding the examples, Jupyter notebooks were added by @dataframing. For details, check it out at https://modAL-python.github.io/!

.query() methods changed for BaseLearner and BaseCommittee to allow more general arguments for query strategies. Now it can accept any argument as long as the query_strategy function supports it.

.score() method was added for Committee. Fixes #6.

The modAL.density module was refactored using functions from sklearn.metrics.pairwise. This resulted in a major increase in performance as well as a more sustainable codebase for the module.

Bugfixes

1D array handling issues fixed, numpy.vstack calls replaced with numpy.concatenate. Fixes #15.

np.sum(generator) calls were replaced with np.sum(np.from_iter(generator)) because deprecation of the original one.

Source code(tar.gz)
Source code(zip)
0.3.0(Apr 25, 2018)
Release notes

New features

Bayesian optimization. Bayesian optimization is a method for optimizing black box functions for which evaluation may be expensive and derivatives may not be available. It uses a query loop very similar to active learning, which makes it possible to implement it using an API identical to the ActiveLearner. Sampling for values are made by strategies estimating the possible gains for each point. Among these, three strategies are implemented currently: probability of improvement, expected improvement and upper confidence bounds.

Changes

modAL.models.BaseLearner abstract base class implemented. ActiveLearner and BayesianOptimizer both inherit from it.

modAL.models.ActiveLearner.query() now passes the ActiveLearner object to the query function instead of just the estimator.

Fixes

modAL.utils.selection.multi_argmax() now works for arrays with shape (-1, ) as well as (-1, 1).

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 18, 2018)
Release notes

New features

modAL.utils.combination.make_query_strategy function factory to make the implementation of custom query strategies easier.

ActiveLearner and Committee models can be fitted using new data only by passing only_new=True to their .teach() methods. This is useful when working with models where the fitting does not occur from scratch, for instance tensorflow or keras models.

Fixes

Checks added to modAL.utils.selection.weighted_random() to avoid division with zero.

ABC metaclassing now compatible with earlier Python versions (i.e. Python 2.7). Fixes #3 .

sklearn.utils.check_array calls removed from modAL.models, performing checks now up to the estimator. As a consequence, images doesn't need to be flattened. Fixes #5 .

BaseCommittee now inherits from sklearn.base.BaseEstimator.

modAL.utils.combination.make_linear_combination rewritten using genexps, resulting in performance increase.

Source code(tar.gz)
Source code(zip)
0.2.0(Feb 10, 2018)
Release notes

New features

Information density measures. With the information_density function in modAL.density, density-based information metrics can be employed.

Functions for making new utility measures by linear combinations and products. With the function factories in modAL.utils.combination, functions can be transformed into their linear combination and product.

Changes

ActiveLearner constructor arguments renamed: predictor was renamed to estimator, X_initial and y_initial was renamed to X_training and y_training.

ActiveLearner, Committee and CommitteeRegressor now also inherits from sklearn.base.BaseEstimator. Because of this, for instance, get_params() and set_params() methods can be used.

The private attributes of ActiveLearner, Committee and CommitteeRegressor now exposed as public attributes.

As a result of the previous, the classes now can be cloned with sklearn.base.clone.

Source code(tar.gz)
Source code(zip)
0.1.0(Jan 8, 2018)
modAL 0.1.0

Modular Active Learning framework for Python3

Release notes

modAL is finally released! For its capabilities and documentation, see the page https://cosmic-cortex.github.io/modAL/!

Installation

modAL requires

Python >= 3.5

NumPy >= 1.13

SciPy >= 0.18

scikit-learn >= 0.18

You can install modAL directly with pip:

pip install modAL

Alternatively, you can install modAL directly from source:

pip install git+https://github.com/cosmic-cortex/modAL.git
Source code(tar.gz)
Source code(zip)

A modular active learning framework for Python

Related tags

Overview

Page contents

Introduction

Active learning from bird's-eye view

modAL in action

From zero to one in a few lines of code

Replacing parts quickly

Replacing parts with your own solutions

An example with active regression

Additional examples

Installation

Documentation

Citing

About the developer

Comments

Most notable changes

Note

Note2

Motivation

Implementation

Usage

Releases(0.4.1)

0.4.1(Jan 7, 2021)

Release notes

Fixes

0.4.0(Nov 1, 2020)

Release notes

New features

Changes

Fixes

0.3.6(Aug 21, 2020)

Fixes

0.3.5(Nov 11, 2019)

Changes

0.3.4(Dec 5, 2018)

New features

Changes

0.3.3(Nov 30, 2018)

New features

0.3.2(Nov 26, 2018)

New features

0.3.1(Oct 2, 2018)

Release notes

New features

Changes

Bugfixes

0.3.0(Apr 25, 2018)

Release notes

New features

Changes

Fixes

0.2.1(Apr 18, 2018)

Release notes

New features

Fixes

0.2.0(Feb 10, 2018)

Release notes

New features

Changes

0.1.0(Jan 8, 2018)

modAL 0.1.0

Release notes

Installation

Owner

modAL

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning

ThunderGBM: Fast GBDTs and Random Forests on GPUs

ETNA – time series forecasting framework

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

BigDL: Distributed Deep Learning Framework for Apache Spark

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

Simple linear model implementations from scratch.

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Python package for stacking (machine learning technique)

Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Random Forest Classification for Neural Subtypes

Primitives for machine learning and data science.