.

Bias-Variance Trade Off

  • In statistical inference, models with too many parameters tend to overfit the data, memorizing it instead of generalizing.
  • Deep neural networks, despite having massive numbers of parameters, somehow still generalize well.

Double Descent Phenomenon

  • As model complexity (number of parameters) increases, test error initially decreases.
  • As complexity continues to increase, test error increases, reaches a peak, and then surprisingly decreases again.

Possible Explanations

  • Implicit regularization in neural networks: Neural networks might have inherent mechanisms that prevent overfitting even with numerous parameters.
  • Dynamics of gradient descent: The optimization process of gradient descent could lead to solutions that generalize better.
  • Stochastic gradient descent: The noise introduced by stochastic gradient descent could act as a form of regularization.

Polynomial Example:

  • Overfitting with low-degree polynomials: When fitting a low-degree polynomial (e.g., degree 10) to a small dataset (e.g., 11 points), the polynomial is forced to pass through every data point, resulting in a highly irregular curve that does not generalize well to new data.
  • Regularization and higher-degree polynomials: However, if we increase the degree of the polynomial (e.g., to 20 or 30) and apply regularization to its coefficients, the test error can actually decrease. The polynomial still passes through all the data points, but it becomes smoother and less erratic, leading to better generalization.

Thoughts

There are some discussion at https://x.com/paraschopra/status/1788542691455664235. The polynomial example seems not a good example. Regularization works well for both degree 10 and degree to. It emphasis the importance of regularization, but does not really illustrate the double descent phenomenon well.