quapy.classification package
Submodules
quapy.classification.calibration module
- class quapy.classification.calibration.BCTSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.NBVSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]
Bases:
RecalibratedProbabilisticClassifierBase
Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.RecalibratedProbabilisticClassifier[source]
Bases:
object
Abstract class for (re)calibration method from abstention.calibration, as defined in Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:
- class quapy.classification.calibration.RecalibratedProbabilisticClassifierBase(classifier, calibrator, val_split=5, n_jobs=None, verbose=False)[source]
Bases:
BaseEstimator
,RecalibratedProbabilisticClassifier
Applies a (re)calibration method from abstention.calibration, as defined in Alexandari et al. paper.
- Parameters:
classifier – a scikit-learn probabilistic classifier
calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)
val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None
verbose – whether or not to display information in the standard output
- property classes_
Returns the classes on which the classifier has been trained on
- Returns:
array-like of shape (n_classes)
- fit(X, y)[source]
Fits the calibration for the probabilistic classifier.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- fit_cv(X, y)[source]
Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all training instances via cross-validation, and then retrains the classifier on all training instances. The posterior probabilities thus generated are used for calibrating the outputs of the classifier.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- fit_tr_val(X, y)[source]
Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a training and a validation set, and then uses the training samples to learn classifier which is then used to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate the classifier. The classifier is not retrained on the whole dataset.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- class quapy.classification.calibration.TSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.VSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)[source]
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
quapy.classification.methods module
- class quapy.classification.methods.LowRankLogisticRegression(n_components=100, **kwargs)[source]
Bases:
BaseEstimator
An example of a classification method (i.e., an object that implements fit, predict, and predict_proba) that also generates embedded inputs (i.e., that implements transform), as those required for
quapy.method.neural.QuaNet
. This is a mock method to allow for easily instantiatingquapy.method.neural.QuaNet
on array-like real-valued instances. The transformation consists of applyingsklearn.decomposition.TruncatedSVD
while classification is performed usingsklearn.linear_model.LogisticRegression
on the low-rank space.- Parameters:
n_components – the number of principal components to retain
kwargs – parameters for the Logistic Regression classifier
- fit(X, y)[source]
Fit the model according to the given training data. The fit consists of fitting TruncatedSVD and then LogisticRegression on the low-rank representation.
- Parameters:
X – array-like of shape (n_samples, n_features) with the instances
y – array-like of shape (n_samples, n_classes) with the class labels
- Returns:
self
- get_params()[source]
Get hyper-parameters for this estimator.
- Returns:
a dictionary with parameter names mapped to their values
- predict(X)[source]
Predicts labels for the instances X embedded into the low-rank space.
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- predict_proba(X)[source]
Predicts posterior probabilities for the instances X embedded into the low-rank space.
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- set_params(**params)[source]
Set the parameters of this estimator.
- Parameters:
parameters – a **kwargs dictionary with the estimator parameters for Logistic Regression and eventually also n_components for TruncatedSVD
quapy.classification.neural module
- class quapy.classification.neural.CNNnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, kernel_heights=[3, 5, 7], stride=1, padding=0, drop_p=0.5)[source]
Bases:
TextClassifierNet
An implementation of
quapy.classification.neural.TextClassifierNet
based on Convolutional Neural Networks.- Parameters:
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of consecutive tokens that each kernel covers
stride – convolutional stride (default 1)
stride – convolutional pad (default 0)
drop_p – drop probability for dropout (default 0.5)
- document_embedding(input)[source]
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
input – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- get_params()[source]
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- property vocabulary_size
Return the size of the vocabulary
- Returns:
integer
- class quapy.classification.neural.LSTMnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, lstm_class_nlayers=1, drop_p=0.5)[source]
Bases:
TextClassifierNet
An implementation of
quapy.classification.neural.TextClassifierNet
based on Long Short Term Memory networks.- Parameters:
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
lstm_class_nlayers – number of LSTM layers (default 1)
drop_p – drop probability for dropout (default 0.5)
- document_embedding(x)[source]
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- get_params()[source]
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- property vocabulary_size
Return the size of the vocabulary
- Returns:
integer
- class quapy.classification.neural.NeuralClassifierTrainer(net: TextClassifierNet, lr=0.001, weight_decay=0, patience=10, epochs=200, batch_size=64, batch_size_test=512, padding_length=300, device='cuda', checkpointpath='../checkpoint/classifier_net.dat')[source]
Bases:
object
Trains a neural network for text classification.
- Parameters:
net – an instance of TextClassifierNet implementing the forward pass
lr – learning rate (default 1e-3)
weight_decay – weight decay (default 0)
patience – number of epochs that do not show any improvement in validation to wait before applying early stop (default 10)
epochs – maximum number of training epochs (default 200)
batch_size – batch size for training (default 64)
batch_size_test – batch size for test (default 512)
padding_length – maximum number of tokens to consider in a document (default 300)
device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu
checkpointpath – where to store the parameters of the best model found so far according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)
- property device
Gets the device in which the network is allocated
- Returns:
device
- fit(instances, labels, val_split=0.3)[source]
Fits the model according to the given training data.
- Parameters:
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
val_split – proportion of training documents to be taken as the validation set (default 0.3)
- Returns:
- get_params()[source]
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- predict(instances)[source]
Predicts labels for the instances
- Parameters:
instances – list of lists of indexed tokens
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- predict_proba(instances)[source]
Predicts posterior probabilities for the instances
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- reset_net_params(vocab_size, n_classes)[source]
Reinitialize the network parameters
- Parameters:
vocab_size – the size of the vocabulary
n_classes – the number of target classes
- class quapy.classification.neural.TextClassifierNet(*args, **kwargs)[source]
Bases:
Module
Abstract Text classifier (torch.nn.Module)
- abstract document_embedding(x)[source]
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- forward(x)[source]
Performs the forward pass.
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a tensor of shape (n_instances, n_classes) with the decision scores for each of the instances and classes
- abstract get_params()[source]
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- predict_proba(x)[source]
Predicts posterior probabilities for the instances in x
- Parameters:
x – a torch tensor of indexed tokens with shape (n_instances, pad_length) where n_instances is the number of instances in the batch, and pad_length is length of the pad in the batch
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- property vocabulary_size
Return the size of the vocabulary
- Returns:
integer
- class quapy.classification.neural.TorchDataset(instances, labels=None)[source]
Bases:
Dataset
Transforms labelled instances into a Torch’s
torch.utils.data.DataLoader
object- Parameters:
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
- asDataloader(batch_size, shuffle, pad_length, device)[source]
Converts the labelled collection into a Torch DataLoader with dynamic padding for the batch
- Parameters:
batch_size – batch size
shuffle – whether or not to shuffle instances
pad_length – the maximum length for the list of tokens (dynamic padding is applied, meaning that if the longest document in the batch is shorter than pad_length, then the batch is padded up to its length, and not to pad_length.
device – whether to allocate tensors in cpu or in cuda
- Returns:
a
torch.utils.data.DataLoader
object
quapy.classification.svmperf module
- class quapy.classification.svmperf.SVMperf(svmperf_base, C=0.01, verbose=False, loss='01', host_folder=None)[source]
Bases:
BaseEstimator
,ClassifierMixin
A wrapper for the SVM-perf package by Thorsten Joachims. When using losses for quantification, the source code has to be patched. See the installation documentation for further details.
References
- Parameters:
svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify
C – trade-off between training error and margin (default 0.01)
verbose – set to True to print svm-perf std outputs
loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.
host_folder – directory where to store the trained model; set to None (default) for using a tmp directory (temporal directories are automatically deleted)
- decision_function(X, y=None)[source]
Evaluate the decision function for the samples in X.
- Parameters:
X – array-like of shape (n_samples, n_features) containing the instances to classify
y – unused
- Returns:
array-like of shape (n_samples,) containing the decision scores of the instances
- fit(X, y)[source]
Trains the SVM for the multivariate performance loss
- Parameters:
X – training instances
y – a binary vector of labels
- Returns:
self
- predict(X)[source]
Predicts labels for the instances X
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SVMperf
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns:
self – The updated object.
- Return type:
object
- valid_losses = {'01': 0, 'f1': 1, 'kld': 12, 'mae': 26, 'mrae': 27, 'nkld': 13, 'q': 22, 'qacc': 23, 'qf1': 24, 'qgm': 25}