BiasVariance Tradeoff
 Summary

Discussion
 Could you explain biasvariance tradeoff with examples?
 What's the intuition behind biasvariance tradeoff?
 What's the math behind biasvariance tradeoff?
 Could you explain specific examples of high/low bias/variance?
 How can I calculate the bias and variance of my model?
 What are some possible methods to overcome biasvariance tradeoff?
 Is biasvariance tradeoff applicable to neural networks?
 Where exactly is the biasvariance tradeoff relevant?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
In statistics and machine learning, we collect data, build models from this data and make inferences. Too little data, the model is most likely not representative of truth since it's biased to what it sees. Too much data, the model could become complex if it attempts to deal with all the variations it sees.
Ideally, we want models to have low bias and low variance. In practice, lower bias leads to higher variance, and vice versa. For this reason, we call it BiasVariance Tradeoff, also called BiasVariance Dilemma.^{}
There are techniques to address this tradeoff. The idea is to get the right balance of bias and variance that's acceptable for the problem. A good model must be,^{}
Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns.
Discussion
Could you explain biasvariance tradeoff with examples? Suppose we collect income data from multiple cities across professions. While income can be correlated with profession, there will be variations across cities due to differing lifestyles, cost of living, tax rules, etc. For example, a doctor in London would have a higher income than a doctor in Leicester. This can be called heterogeneity in data. In regression modelling, a model built on this data will have high variance and predictions may not be accurate.
We can overcome this by making the data more homogeneous by splitting the data by city. Thus, we'll end up with multiple models, one per city. Each model is biased to its city but has lesser variance. We could also choose to split the data by state or region. The amount of variance that can be tolerated will dictate bias.
Consider KNN classification as another example. At low \(k\), predictions are not consistent due to high variations. When we consider more neighbours, we get better predictions as variance is reduced. However, if \(k\) is too high we start considering neighbours that are "too far", which contributes to increased bias.^{}
What's the intuition behind biasvariance tradeoff? Assume we're building a prediction model. We build multiple models from different data samples. Bias is a measure of how far the predictions are from the true value. Variance is a measure of variability across these models. This is illustrated graphically with a bullseye diagram.^{}
Intuitively, we can say that a biased model is too simple. It's unable to capture essential patterns in the training data. We say that such a model is underfitting the data.^{}
On the other hand, a model with high variance is a complex model. It's in fact sensitive to the training data. It's overfitting the data. When it sees new data, it's unable to predict correctly since it's overfitted to the training data. We might also state that such a model does not generalize well. A simpler model would have done better but if it becomes too simple it also becomes biased.^{}
In summary, a biased model is underfitted and of low complexity. A model of high variance is overfitted and of high complexity.
What's the math behind biasvariance tradeoff? Given xy data points, we can represent the relationship as \(Y = f(X) + \epsilon\), where \(\epsilon\) is the error term of Normal distribution \(N(0,\sigma_\epsilon)\). Let \(\hat{f}(X)\) be an estimate of \(f(X)\) obtained via any modelling technique such as linear regression or KNN. Let's use mean squared error for the prediction error. Thus, the expected prediction error at point \(x\) is,^{}
$$\begin{align}\Bbb{E}[(y  \hat{f}(x))^2] = & \; \Bbb{E}[(f(x)  \Bbb{E}[\hat{f}(x)])^2] \\ & + \Bbb{E}[(\hat{f}(x)  \Bbb{E}[\hat{f}(x)])^2] \\ & + {\sigma_\epsilon}^2\end{align}$$
The first term is the squared bias of the estimator. The second term is the variance of the estimator. The third term is simply noise. A perfect model would eliminate both bias and variance, but not the noise. Noise contributes to what we call irreducible error. \(\Bbb{E}[\hat{f}(x)]\) is the average prediction from various estimators. Each estimator is trained on a different sampling of the dataset.^{} ^{}
The idea of separating and analysing bias and variance terms of the prediction error is called biasvariance decomposition.^{} ^{}
Perfect models don't exist. In practice, we aim for a model that attempts to minimize the error, neither underfitting nor overfitting.^{}
Could you explain specific examples of high/low bias/variance? In this example, we use \(f(x)\) for the underlying process (purple) and \(\hat{f}(x)\) for our estimate of the process (orange). Individuals fit functions (orange) are averaged to give \(\Bbb{E}[(\hat{f}(x)]\) (green).^{}
Consider a nonlinear process \(f(x)\) (top figure). We don't know that it's nonlinear. We attempt to fit a linear function \(\hat{f}(x)\) to the data. The data may also contain noise. We take different samples of the data and find suitable fits. None of the lines are close to \(f(x)\). Thus, our fits are all biased, in fact, biased towards linear functions. However, all lines are not too different, which implies low variance.^{}
Now consider a linear process \(f(x)\) (bottom figure). We now attempt to fit a nonlinear function \(\hat{f}(x)\) to the data. The functions are complex enough to fit the data and the noise. In fact, it's overfitting. Each nonlinear curve fits its own data and the curves all look different. Thus, our model \(\hat{f}(x)\) exhibits high variance. When we average these fits, we get a line that close to the original process. Thus, there's low bias.^{}
How can I calculate the bias and variance of my model? Bias and variance can be calculated only when we have multiple estimators, each trained on a different dataset. In practice, we usually have a single dataset to train an estimator. In such a case, we can use bootstrapping or cross validation.^{}
We don't usually calculate bias and variance. Instead, the dataset is divided into training and test sets. The model is trained on the training set.^{} It's then evaluated for the prediction error on the test set. This is equivalent to selecting the best one from a candidate list of estimators using bias and variance of each estimator. This similarity can also be observed with the error curves for \({Bias}^2 + Variance\) and test set.^{}
Overfitting can be observed when the training set error drops but the test set error increases. This is often an indication to consider a simpler model.^{} Equivalently, in neural networks, it's an indication to stop the training process.
It's important that the test set is not used for training. Otherwise, it's difficult to assess the model's performance.^{}
What are some possible methods to overcome biasvariance tradeoff? A common myth is to minimize bias at the expense of variance. It's important to minimize both. Resampling techniques such as bagging and cross validation help to reduce variance without increasing bias. By such techniques we build multiple models and predict using an ensemble of these models. A specific example is random forests used for classification. The variance of a single decision tree is reduced by random forests. The penalty is memory and computation due to multiple models.^{}
Bagging reduces variance with little effect on bias. Boosting is a technique to reduce bias. In practice, boosting can hurt performance on noisy data.^{} Moreover, boosting is known to increase variance at an exponential decaying rate, which some call exponential biasvariance tradeoff.^{}
Is biasvariance tradeoff applicable to neural networks? Historically, it was believed that the tradeoff applies to neural networks as well.^{} To address the tradeoff, early stopping and dropping are techniques to avoid overfitting.^{}
In the 2010s, neural networks challenged the classical biasvariance tradeoff.^{} The classical Ushaped risk curve was replaced with a doubledescent risk curve. While bias decreases monotonically, variance first increases and then decreases after a point called the interpolation threshold. Beyond this point, as more parameters are added, the network performs better. In fact, this behaviour has been observed not just with neural networks but also with ensemble methods such as boosting and random forests.^{}
In particular, performance is influenced by both width and depth of the network. Bias decreases as width increases. Variances decreases as width increases beyond the threshold. As depth increases, bias decreases and variance increases by a lesser amount. Deeper networks generalize better and this mainly due to lower bias.^{}
Where exactly is the biasvariance tradeoff relevant? Biasvariance tradeoff applies to supervised machine learning. It applies to both classification problems and regression problems.^{} In general, it can be a useful conceptual framework when modelling any complex system.^{}
The tradeoff has been useful in analyzing human cognition. Given limited training data, we rely on highbias, lowvariance heuristics. These heuristics are fairly simple but generalize well to a wide variety of situations. Tasks such as object recognition use some "hard wiring" that's later fine tuned by experience.^{} Do humans learn concepts based on prototypes (high bias, low variance) or exemplar models (low bias, high variance)? This sort of question can be investigated by the biasvariance tradeoff.^{}
In program analysis, precise abstractions may not lead to better results and biasvariance tradeoff has been used to explain this. In fact, a tool produced using cross validation had better running time, found new defects and experienced fewer timeouts.^{}
In reinforcement learning with partial observability, there's a similar tradeoff between asymptotic bias and overfitting. A smaller state representation might decrease the risk of overfitting but at the cost of increasing asymptotic bias.^{}
Milestones
In a paper titled On empirical spectral analysis of stochastic processes Grenander introduces what he calls the uncertainty principle. He states, "if we want high resolvability we have to sacrifice some precision of the estimate and vice versa." The term resolvability relates to bias whereas the term precision relates to variance.^{}
Given discrete, noisy observations, Wahba and Wold show how a smooth curve can be fitted to the data via cross validation. Smoothing can be done to control variance or bias. Cross validation helps in controlling both and obtain a better fit.^{}
Hastie and Tibshirani discuss the biasvariance tradeoff in the context of regression modelling. This is just an example to show that the tradeoff is well known by the 1980s.^{}
Geman et al. note that a feedforward neural network trained by error backpropagation is essentially nonparametric regression. It's a modelfree approach but requires lots of training data and slow to converge. A modelbased approach learns faster but is also biased: it can't address complex inference problems. They therefore state the tradeoff clearly, "whereas incorrect models lead to high bias, truly modelfree inference suffers from high variance."^{}
Historically, biasvariance tradeoff started in regression with squared loss as the loss function. For classification problems, zeroone loss is used. For classification, Kong and Dietterich show that ensembles can reduce bias. In 1996, Breiman shows that ensembles can reduce variance.^{}
Schapire et al. show that ensembles enlarge the margins and thereby enable models to generalize better.^{}
Domingos proposes a unified biasvariance decomposition that can be applied to any loss function (squared loss, zeroone loss, etc.). The decomposition is not always additive. He notes that biasvariance tradeoff behaviour is dependent on the loss function.^{} Domingos also shows that Schapire's marginbased approach is equivalent to biasvariancebased approach. An ensemble's generalization error can be expressed either as the distribution of the margins or as biasvariance decomposition of the error.^{}
Valentini and Dietterich perform biasvariance analysis on Support Vector Machines (SVMs) to get insights into how SVMs learn. They observe the expected biasvariance tradeoff but they also see complex relationships, especially in Gaussian and polynomial kernels. They propose how biasvariance decomposition can be used to develop ensemble methods using SVMs as base learners.^{}
2018
Neal et al. observe that since the mid2010s, empirical results show that wider networks generalize better. The classical Ushaped test error curve due to biasvariance tradeoff is being defied by neural networks. In their experiments, they show that bias and variance decrease as more parameters are added to the network.^{}
Sample Code
References
 Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. "Reconciling modern machinelearning practice and the classical bias–variance tradeoff." PNAS, vol. 116, no. 32, pp. 1584915854. Accessed 20200917.
 Briscoe, Erica, and Jacob Feldman. 2011. "Conceptual complexity and the bias/variance tradeoff." Cognition, vol. 118, pp. 216, Elsevier B.V. Accessed 20200917.
 Cornell University. 2005. "Bias/Variance Tradeoff." CS578, Cornell University. Accessed 20200917.
 Domingos, Pedro. 2000. "A Unified BiasVariance Decomposition and its Applications." In Proc. 17th International Conf. on Machine Learning, pp. 231238, Morgan Kaufmann. Accessed 20200917.
 FortmannRoe, Scott. 2012. "Understanding the BiasVariance Tradeoff." June. Accessed 20200917.
 FrancoisLavet, Vincent, Guillaume Rabusseau, Joelle Pineau, Damien Ernst, and Raphael Fonteneau. 2020. "On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability." Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, pp. 50555059, July. Accessed 20200917.
 Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. "Neural Networks and the Bias/Variance Dilemma." Neural Computation, vol. 4, no. 1, pp. 158, January. Accessed 20200917.
 Grenander, Ulf. 1952. "On empirical spectral analysis of stochastic processes." Arkiv för Matematik, vol. 1, no. 6, pp. 503531. Accessed 20200917.
 Hastie, Trevor and Robert Tibshirani. 1986. "Generalized Additive Models." Statistical Science, vol. 1, no. 3, pp. 297310. Accessed 20200917.
 Neal, Brady, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon LacosteJulien, and Ioannis Mitliagkas. 2019. "A Modern Take on the BiasVariance Tradeoff in Neural Networks." arXiv, v4, December 18. Accessed 20200917.
 Rojas, Raúl. 2015. "The BiasVariance Dilemma." February 10. Accessed 20200917.
 Sharma, Rahul, Aditya V. Nori, and Alex Aiken. 2014. "BiasVariance Tradeoffs in Program Analysis." POPL ’14, ACM, January 2224. Accessed 20200917.
 Stansbury, Dustin. 2020. "Model Selection: Underfitting, Overfitting, and the BiasVariance Tradeoff." The Clever Machine, July 20. Accessed 20200917.
 Valentini, Giorgio, and Thomas G. Dietterich. 2004. "BiasVariance Analysis of Support Vector Machines for the Development of SVMBased Ensemble Methods." Journal of Machine Learning Research, vol. 5, pp. 725775. Accessed 20200917.
 Wahba, G. and S. Wold. 1975. "A completely automatic french curve: fitting spline functions by cross validation." Communications in Statistics, vol. 4, no. 1. doi: 10.1080/03610927508827223. Accessed 20200919.
 Wikipedia. 2020. "Bias–variance tradeoff." Wikipedia, September 10. Accessed 20200917.
 Wågberg, Johan. 2020. "Lecture 5 – Crossvalidation and the biasvariance tradeoff." In: Statistical Machine Learning, Uppsala University. Accessed 20200917.
 Yang, Zitong, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. 2020. "Rethinking BiasVariance Tradeoff for Generalization of Neural Networks." arXiv, v2, March 21. Accessed 20200917.
 Yu, Lean Yu, Kin Keung Lai, Shouyang Wang, and Wei Huang. 2006. "A BiasVarianceComplexity TradeOff Framework for Complex System Modeling." In: M. Gavrilova et al. (eds.), ICCSA 2006, LNCS 3980, pp. 518527, SpringerVerlag Berlin Heidelberg. Accessed 20200917.
Further Reading
 Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. "Neural Networks and the Bias/Variance Dilemma." Neural Computation, vol. 4, no. 1, pp. 158, January. Accessed 20200917.
 Rojas, Raúl. 2015. "The BiasVariance Dilemma." February 10. Accessed 20200917.
 Neal, Brady, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon LacosteJulien, and Ioannis Mitliagkas. 2019. "A Modern Take on the BiasVariance Tradeoff in Neural Networks." arXiv, v4, December 18. Accessed 20200917.
 Neal, Brady. 2019. "On the BiasVariance Tradeoff: Textbooks Need an Update." M.Sc. Thesis, Université de Montréal, December 10. Accessed 20200917.
 Brownlee, Jason. 2016. "Gentle Introduction to the BiasVariance TradeOff in Machine Learning." Machine Learning Mastery, March 18. Updated 20191025. Accessed 20200917.
 Brownlee, Jason. 2020. "How to Calculate the BiasVariance Tradeoff with Python." Machine Learning Mastery, August 19. Updated 20200826. Accessed 20200917.
Article Stats
Cite As
See Also
 Overfitting and Underfitting
 Ensemble Learning
 Boosting (Machine Learning)
 Regression Modelling
 Analysis of Variance
 Machine Learning