Classification Fundamental Concept
“Google's Research Director Peter Norvig claimed that "We don’t have better algorithms. We just have more data."
The article sums-up the lecture Classification Fundamental Concept from our course Become a Citizen Data Scientist
In this article, we’re going to introduce perhaps the most fundamental concept to understand when building classification model: Bias-Variance Trade-off.
At the end you will have a better understanding on how to avoid the pitfall of over or under-fitting.
Even if it is obvious. You can never have a prediction model without error.
Without going further with the maths behind it, prediction error is mainly divided into 2 elements: Bias and Variance.
Error due to Bias is the difference between the predicted value and the correct value.
Error due to Variance is defined as the variability of a model prediction for a given data point.
As "a picture's worth a thousand words", let’s look to this graphic taken from one of the best article I found on this subject “Understanding the Bias-Variance Trade-off”.
Bulls-eye represents the graphical visualization of bias and Variance. Each point is the result of one iteration of the model building.
The center of the target is a model that predicts perfectly the actual values.
We have mainly four cases:
• Low Bias and Low Variance: That’s where we want to be! We have here a good model
• High Bias and Low Variance: that’s what we call an under-fitted model.
Source : http://scott.fortmann-roe.com/docs/BiasVariance.html
It means that our model lacks some information. It’s too simple Maybe we have to add variables to our training data. Also evaluating models using other methods could be a good option too.
• Low Bias and High Variance: We have an Over-Fitted Model. It means that the model is too complicated for the data we have. Put simply, the model cannot be generalized. The solution is to add more data into our training set and/or to reduce the number of features (the complexity), we use Ensemble method like random forest, bagging and boosting.
• High Bias and High Variance: We still need to work on our model. My suggestion is to tackle first the Bias error by using other methods and adding variables if you can.
Here is another way to sum-up the bias-variance trade-off:
This graphic was taken from the book “Elements of Statistical Learning.
Prediction Error is plotted against Model complexity twice: the green line is the result using the training data. The red line is the result using the test Data.
Source: Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001
When the model is too simple (low complexity):
• The gap between the two plots is narrow. That’s an indication for low variance
• The prediction error is high for training and test data. It means a High-Bias
Hence we have an under-fitted model.
Higher the complexity is, higher the gap is between the two plots
When The prediction error between the training and the test data become to wide it means that the model reached the over-fitting mode.
Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001.
The oneJesus del Toro Garcia CC BY