Кросс валидация на питоне

Содержание

Using cross_validate in sklearn, simply explained
Cross_validate in sklearn, what is it?
What is cross_validate used for?
How many folds should I use in cross_validate?
Can I train my model using cross_validate?
Can I use cross_validate for classification and regression?
Which metrics can I use in cross_validate?
How to implement cross_validate in Python
Function parameters for cross_validate
Summary of the cross_validate function
Related articles
References
Stephen Allwright Twitter

Using cross_validate in sklearn, simply explained

Cross_validate is a common function to use during the testing and validation phase of your machine learning model development. In this post I will explain what it is, what you can use it for, and how to implement it in Python.

Cross_validate in sklearn, what is it?

Cross_validate is a function in the scikit-learn package which trains and tests a model over multiple folds of your dataset. This cross validation method gives you a better understanding of model performance over the whole dataset instead of just a single train/test split.

The number of folds is defined, by default this is 5
The dataset is split up according to these folds, where each fold has a unique set of testing data
A model is trained and tested for each fold
Each fold returns a metric/s for it’s test data
The mean and standard deviation of these metrics can then be calculated to provide a single metric for the process

An illustration of how this works is shown below:

What is cross_validate used for?

Cross_validate is used as a cross validation technique to prevent over-fitting and promote model generalisation.

The typical process of model development is to train a model on one fold of data and then test on another. But how do we know that this single test dataset is representative? This is why we use cross_validate and cross validation more generally, to train and test our model on multiple folds such that we can be sure out model generalises well across the whole dataset and not just a single portion.

If we see that the metrics for all folds in cross_validate are uniform then it can be concluded that the model is able to generalise, however if there are significant differences between them then this may indicate over-fitting to certain folds and would need to be investigated further.

How many folds should I use in cross_validate?

By default cross_validate uses a 5-fold strategy, however this can be adjusted in the cv parameter.

But how many folds should you choose?

There is unfortunately no hard and fast rules when it comes to how many folds you should choose. A general rule of thumb though is that the number of folds should be as large as possible such that each fold has enough observations to generalise from and be tested on.

Can I train my model using cross_validate?

A common question developers have is whether cross_validate can also function as a way of training the final model. Unfortunately this is not the case. Cross_validate is a way of assessing a model and it’s parameters, and cannot be used for final training. Final training should take place on all available data and tested using a set of data that has been held back from the start.

Can I use cross_validate for classification and regression?

Cross_validate is a function which can be used for both classification and regression models. The only major difference between the two is that by default cross_validate uses Stratified KFold for classification, and normal KFold for regression.

Which metrics can I use in cross_validate?

By default cross_validate uses the chosen model’s default scoring metric, but this can be overridden in the scoring parameter. This parameter can accept either a single metric or multiple as a list or a dictionary.

‘accuracy’
‘balanced_accuracy’
‘roc_auc’
‘f1’
‘neg_mean_absolute_error’
‘neg_root_mean_squared_error’
‘r2’

How to implement cross_validate in Python

Create a dataset
Run hyper-parameter tuning
Create model object with desired parameters
Run cross_validate to test model performance
Train final model on full dataset

Therefore, in order to use this function we need to first have an idea of the model we want to use and a prepared dataset to test it on. Let’s look at how this process would look in Python using a Linear Regression model and the Diabetes dataset from sklearn. We will also use a list of multiple metrics for the scoring parameter.

Function parameters for cross_validate

estimator — The model object to use to fit the data
X — The data to fit the model on
y — The target of the model
scoring — The error metric/s to use
cv — The number of splits to use

Summary of the cross_validate function

Cross_validate is a method which runs cross validation on a dataset to test whether the model can generalise over the whole dataset. The function returns a list of scores per fold, and the average of these scores can be calculated to provide a single metric value for the dataset. This is a function and a technique which you should add to your workflow to make sure you are developing highly performant models.

Читайте также: Php get posted variables

References

Stephen Allwright Twitter

I’m a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I’ve picked up along the way.

Источник