Naveen Mathew Nathan S.
1 min readJan 26, 2024

--

I think we digressed. Let's take a simple training set for a regression model: (1, 1), (2, 2), (3, 3), ..., (100, 100). It's possible to fit an over parameterized ReLU network (not just a linear neuron) to fit the training set perfectly. The fit should be y = f(x) = x. Predicting for x = infinity gives y = infinity. Also, predicting for x = 2.5 gives y = 2.5. However, using non-linear activations in the 'initial layers' it's possible to fit y = g(x) such that g(1) = 1, g(2) = 2, ..., g(100) = 100, but g(2.5) can be close to or very different from 2.5 depending on the complexity of the initial non-linear transformations. Also, g(x) may not diverge (at least not as fast as f(x) = x) as x -> infinity. This is helpful in situations when we know the output of the regression model is bounded.

--

--

Naveen Mathew Nathan S.
Naveen Mathew Nathan S.

Written by Naveen Mathew Nathan S.

Data Scientist, interested in theory and practice of machine learning.

Responses (1)