Introduction
This blog post covers the topic of how neural networks avoid the bias-variance tradeoff via the double dip. It is of interest because it bridges the classical data science technqiues and neural networks. This way we can get beyond the silliness of classical data science being old and not useful while neural networks are new and useful.
The blog post title is a one line representation of the YouTube video "What the Books Get Wrong about AI [Double Descent]" at Welch Labs. In other words, this blog post is a summary of the material in the YouTube video.
The goal of the blog post is to introduce and demystify this complex topic for a broad audience.
Classic Data Science Talks About the Bias Variance Tradeoff
A classic book in classic data science like Elements of Statistical Learning (ESL) by Trevor Hastie, ... will have a graph like the following
Estimate a Parabola with a Line
Let's start with the simplest possible geometry: estimate parabolic data with a line. This is a clear example of underfitting.
Compute the associated mean squared error. The mean squared error is the classic criteria used to evaluate the performance of an estimate. There are downsides to it but that is beyond the scope of this blog post.
Estimate a Parabola with Higher Order Polynomials
Next, let's estimate parabolic data with higher order polynomials to demonstrate overfitting.
Notice that the right hand curve up above corresponds to to the bias variance curve of the first figure shown in this blog post (Figure 2.11 of ELS).
Use Regularization to Prevent Overfitting
The YouTube video uses dropout, weighted decay and other regularization techniques.
It then refer to the paper "Understanding Deep Learning Requires Rethinking Generalization" by C. Zhang, ... . Below are some excerpts from the paper:
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice.
Neural Networks Cause a Double Dip in Bias Variance Tradeoff
The paper "Reconciling Modern Machine Learning and the Bias-Variance Trade-Off" by M. Belkin, .. provides the details on how neural networks increase complexity causing a double dip in the bias variance tradeoff. Below is a figure from the paper.
Notice that the paper states that "We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence". In other words, it is not a mathematical proof. So, in some cases the double dip phenomena may not exist.
The double dip phenomenon appears to contradict classical theory. It does not. It reveals that the classical U-shaped curve of test error is incomplete and that a more complex relationship exists; especially for neural networks.
To better understand the double dip, let's use an analogy. Imagine trying to connect a few dots on a page. With a stiff ruler (a simple model), you get a poor fit. With a flexible thread (a moderately complex model), you get a good, smooth fit. With a thread that is just the right length to pass through each dot with sharp angles (the interpolation threshold), the path between dots is wild and inaccurate. But if you have a much longer, very flexible thread (a highly over-parameterized model), it can pass through all the dots while laying in smooth, gentle curves between them, leading to a better overall path.
Below is a figure from the YouTube video that recreates the double dip using digit identification.
Code
To go to the code associated with the YouTube video, click here.
Summary
Classical data science talks about the bias-variance tradeoff. It also has regularization techniques to manage the bias-variance tradeoff. The bias-variance tradeoff isn't a problem to be reduced but a fundamental dilemma to be navigated.
However, stating that the double dip of neural networks overcomes the bias-variance tradeoff would be an overstatement. The double dip doesn't invalidate the tradeoff. Instead, it reveals that the classical U-shaped curve of test error is incomplete and a more complex relationship exists; especially for neural networks.
Please recall that there is no free lunch. Neural networks have their own disadvantages. They have massive computational costs as well as massive data requirements.






