quapy.classification package

Submodules

quapy.classification.calibration module

class quapy.classification.calibration.BCTSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]

Bases: RecalibratedProbabilisticClassifierBase

Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.NBVSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]

Bases: RecalibratedProbabilisticClassifierBase

Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.RecalibratedProbabilisticClassifier[source]

Bases: object

Abstract class for (re)calibration method from abstention.calibration, as defined in Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:

class quapy.classification.calibration.RecalibratedProbabilisticClassifierBase(classifier, calibrator, val_split=5, n_jobs=None, verbose=False)[source]

Bases: BaseEstimator, RecalibratedProbabilisticClassifier

Applies a (re)calibration method from abstention.calibration, as defined in Alexandari et al. paper.

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)

  • val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None

  • verbose – whether or not to display information in the standard output

property classes_

Returns the classes on which the classifier has been trained on

Returns:

array-like of shape (n_classes)

fit(X, y)[source]

Fits the calibration for the probabilistic classifier.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

fit_cv(X, y)[source]

Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all training instances via cross-validation, and then retrains the classifier on all training instances. The posterior probabilities thus generated are used for calibrating the outputs of the classifier.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

fit_tr_val(X, y)[source]

Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a training and a validation set, and then uses the training samples to learn classifier which is then used to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate the classifier. The classifier is not retrained on the whole dataset.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

predict(X)[source]

Predicts class labels for the data instances in X

Parameters:

X – array-like of shape (n_samples, n_features) with the data instances

Returns:

array-like of shape (n_samples,) with the class label predictions

predict_proba(X)[source]

Generates posterior probabilities for the data instances in X

Parameters:

X – array-like of shape (n_samples, n_features) with the data instances

Returns:

array-like of shape (n_samples, n_classes) with posterior probabilities

class quapy.classification.calibration.TSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]

Bases: RecalibratedProbabilisticClassifierBase

Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.VSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]

Bases: RecalibratedProbabilisticClassifierBase

Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

quapy.classification.methods module

class quapy.classification.methods.LowRankLogisticRegression(n_components=100, **kwargs)[source]

Bases: BaseEstimator

An example of a classification method (i.e., an object that implements fit, predict, and predict_proba) that also generates embedded inputs (i.e., that implements transform), as those required for quapy.method.neural.QuaNet. This is a mock method to allow for easily instantiating quapy.method.neural.QuaNet on array-like real-valued instances. The transformation consists of applying sklearn.decomposition.TruncatedSVD while classification is performed using sklearn.linear_model.LogisticRegression on the low-rank space.

Parameters:
  • n_components – the number of principal components to retain

  • kwargs – parameters for the Logistic Regression classifier

fit(X, y)[source]

Fit the model according to the given training data. The fit consists of fitting TruncatedSVD and then LogisticRegression on the low-rank representation.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the instances

  • y – array-like of shape (n_samples, n_classes) with the class labels

Returns:

self

get_params()[source]

Get hyper-parameters for this estimator.

Returns:

a dictionary with parameter names mapped to their values

predict(X)[source]

Predicts labels for the instances X embedded into the low-rank space.

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(X)[source]

Predicts posterior probabilities for the instances X embedded into the low-rank space.

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

set_params(**params)[source]

Set the parameters of this estimator.

Parameters:

parameters – a **kwargs dictionary with the estimator parameters for Logistic Regression and eventually also n_components for TruncatedSVD

transform(X)[source]

Returns the low-rank approximation of X with n_components dimensions, or X unaltered if n_components >= X.shape[1].

Parameters:

X – array-like of shape (n_samples, n_features) instances to embed

Returns:

array-like of shape (n_samples, n_components) with the embedded instances

quapy.classification.neural module

class quapy.classification.neural.CNNnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, kernel_heights=[3, 5, 7], stride=1, padding=0, drop_p=0.5)[source]

Bases: TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Convolutional Neural Networks.

Parameters:
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of consecutive tokens that each kernel covers

  • stride – convolutional stride (default 1)

  • stride – convolutional pad (default 0)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(input)[source]

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

input – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()[source]

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

property vocabulary_size

Return the size of the vocabulary

Returns:

integer

class quapy.classification.neural.LSTMnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, lstm_class_nlayers=1, drop_p=0.5)[source]

Bases: TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Long Short Term Memory networks.

Parameters:
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • lstm_class_nlayers – number of LSTM layers (default 1)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(x)[source]

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()[source]

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

property vocabulary_size

Return the size of the vocabulary

Returns:

integer

class quapy.classification.neural.NeuralClassifierTrainer(net: TextClassifierNet, lr=0.001, weight_decay=0, patience=10, epochs=200, batch_size=64, batch_size_test=512, padding_length=300, device='cuda', checkpointpath='../checkpoint/classifier_net.dat')[source]

Bases: object

Trains a neural network for text classification.

Parameters:
  • net – an instance of TextClassifierNet implementing the forward pass

  • lr – learning rate (default 1e-3)

  • weight_decay – weight decay (default 0)

  • patience – number of epochs that do not show any improvement in validation to wait before applying early stop (default 10)

  • epochs – maximum number of training epochs (default 200)

  • batch_size – batch size for training (default 64)

  • batch_size_test – batch size for test (default 512)

  • padding_length – maximum number of tokens to consider in a document (default 300)

  • device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu

  • checkpointpath – where to store the parameters of the best model found so far according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)

property device

Gets the device in which the network is allocated

Returns:

device

fit(instances, labels, val_split=0.3)[source]

Fits the model according to the given training data.

Parameters:
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

  • val_split – proportion of training documents to be taken as the validation set (default 0.3)

Returns:

get_params()[source]

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

predict(instances)[source]

Predicts labels for the instances

Parameters:

instances – list of lists of indexed tokens

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(instances)[source]

Predicts posterior probabilities for the instances

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

reset_net_params(vocab_size, n_classes)[source]

Reinitialize the network parameters

Parameters:
  • vocab_size – the size of the vocabulary

  • n_classes – the number of target classes

set_params(**params)[source]

Set the parameters of this trainer and the learner it is training. In this current version, parameter names for the trainer and learner should be disjoint.

Parameters:

params – a **kwargs dictionary with the parameters

transform(instances)[source]

Returns the embeddings of the instances

Parameters:

instances – list of lists of indexed tokens

Returns:

array-like of shape (n_samples, embed_size) with the embedded instances, where embed_size is defined by the classification network

class quapy.classification.neural.TextClassifierNet(*args, **kwargs)[source]

Bases: Module

Abstract Text classifier (torch.nn.Module)

dimensions()[source]

Gets the number of dimensions of the embedding space

Returns:

integer

abstract document_embedding(x)[source]

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

forward(x)[source]

Performs the forward pass.

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a tensor of shape (n_instances, n_classes) with the decision scores for each of the instances and classes

abstract get_params()[source]

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

predict_proba(x)[source]

Predicts posterior probabilities for the instances in x

Parameters:

x – a torch tensor of indexed tokens with shape (n_instances, pad_length) where n_instances is the number of instances in the batch, and pad_length is length of the pad in the batch

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

property vocabulary_size

Return the size of the vocabulary

Returns:

integer

xavier_uniform()[source]

Performs Xavier initialization of the network parameters

class quapy.classification.neural.TorchDataset(instances, labels=None)[source]

Bases: Dataset

Transforms labelled instances into a Torch’s torch.utils.data.DataLoader object

Parameters:
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

asDataloader(batch_size, shuffle, pad_length, device)[source]

Converts the labelled collection into a Torch DataLoader with dynamic padding for the batch

Parameters:
  • batch_size – batch size

  • shuffle – whether or not to shuffle instances

  • pad_length – the maximum length for the list of tokens (dynamic padding is applied, meaning that if the longest document in the batch is shorter than pad_length, then the batch is padded up to its length, and not to pad_length.

  • device – whether to allocate tensors in cpu or in cuda

Returns:

a torch.utils.data.DataLoader object

quapy.classification.svmperf module

class quapy.classification.svmperf.SVMperf(svmperf_base, C=0.01, verbose=False, loss='01', host_folder=None)[source]

Bases: BaseEstimator, ClassifierMixin

A wrapper for the SVM-perf package by Thorsten Joachims. When using losses for quantification, the source code has to be patched. See the installation documentation for further details.

References

Parameters:
  • svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify

  • C – trade-off between training error and margin (default 0.01)

  • verbose – set to True to print svm-perf std outputs

  • loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.

  • host_folder – directory where to store the trained model; set to None (default) for using a tmp directory (temporal directories are automatically deleted)

decision_function(X, y=None)[source]

Evaluate the decision function for the samples in X.

Parameters:
  • X – array-like of shape (n_samples, n_features) containing the instances to classify

  • y – unused

Returns:

array-like of shape (n_samples,) containing the decision scores of the instances

fit(X, y)[source]

Trains the SVM for the multivariate performance loss

Parameters:
  • X – training instances

  • y – a binary vector of labels

Returns:

self

predict(X)[source]

Predicts labels for the instances X

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SVMperf

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

valid_losses = {'01': 0, 'f1': 1, 'kld': 12, 'mae': 26, 'mrae': 27, 'nkld': 13, 'q': 22, 'qacc': 23, 'qf1': 24, 'qgm': 25}

Module contents