Manufacturing polynomials using a sigmoid neural network — practicum

Naveen Mathew Nathan S.
4 min readAug 15, 2022

In the previous article I discussed a possible way to create polynomial bases (plural of basis in linear algebra) to approximate a polynomial function using a sigmoid neural network. This article is a practical test of the idea — does the bias constraint work as predicted, and how does it compare with a sigmoid neural network without the bias constraint? We will assume a quadratic true model of the form y = ax²+bx+c+ϵ; for convenience, ϵ=0, c=1, b=2, a=3.

The code

Repository: https://github.com/SNaveenMathew/ml_book/tree/master/polynomial_regression

Files (in the required order):

  1. quadratic_model.py
  2. test_quadratic.py
  3. test_quadratic_unconstrained.py

Bias constrained sigmoid neural network

Model definition

n=2; hidden layer activation: sigmoid, output layer activation: linear

yᵢ(x)=σ(wᵢx+bᵢ); Output=B₀+W₁y₁+W₂y₂

Bias constraints

b₁=0, b₂=log(2 — √3) ~ -1.317 — these parameters were not updated during training, whereas (w₁, w₂, W₁, W₂, B₀) are unconstrained

Training

Use test_quadratic.py

A data set was generated using 10000 observations from x~N(0, 1); y = 3x²+ 2x+1 (training set generation and true model are defined in quadratic_model.py). Since the objective is not to perform model selection or hyperparameter tuning, all 10000 observations were used for training.

Weights

The model was trained for a very large number of epochs. The final model parameters were:

(w₁, w₂, W₁, W₂, B₀) = (0.192, 0.396, -549.653, 427.676, 185.346); loss~0.0458

Visualizing y₁, y₂ vs x

y₁=σ(w₁x+b₁) vs x
y₂(x)=σ(w₂x+b₂)

From the visual representation we can vaguely infer that the model is learning exactly what we expect it to learn — y₁ learns the linear part, y₂ learns the quadratic part. But is this inference accurate?

Deep dive

y₁

Loss vs order of polynomial fit

Fitting y₁=a₁x+b₁+ϵ₁ (loss ~ 1/10⁷ is very small) gives a₁ ~ 0.0476, b₁ ~ 0.5

Fitting a cubic (y₁=a₁x³+b₁x²+c₁x+d₁+ϵ₁) to verify, we get: a₁ ~ -0.0001, b₁ ~ 0.000000504, c₁ ~ 0.048, d₁ ~ 0.5; x ∈ (-3.66, 4.03). The quadratic and cubic term are much smaller in magnitude compared to the constant and linear term.

y₂

Loss vs order of polynomial fit

Fitting y₂=a₂x²+b₂x+c₂+ϵ₂ (loss ~ 1/10⁷ is very small) gives a₂ ~ 0.007, b₂ ~ 0.0658, c₂ ~ 0.2116

Fitting a cubic (y₂=a₂x³+b₂x²+c₂x+d₂+ϵ₂) to verify, we get: a₂ ~ -0.0001, b₂ ~ 0.007, c₂ ~ 0.0662, d₂ ~ 0.2116; x ∈ (-3.66, 4.03). The cubic term is much smaller in magnitude compared to the constant, linear and quadratic terms.

Diagnostics

y₁

Forward difference dy₁/dx vs x

dy₁/dx was computed by sorting x and computing the forward difference approximation of dy₁/dx using Δy₁/Δx by considering the discrete points of x in the training set. We observe that f₁ is almost linear with respect to x, but shows some signatures of higher order polynomial terms. The estimate for dy₁/dx is 0.04784 (sample median). Using this the theoretical approximation for the coefficient of the linear term is 0.04784.

y₂

Forward difference dy₂/dx vs x
Forward difference d²y₂/dx² vs x
Histogram of forward difference d²y₂/dx² in the range (-0.5, 0.5)

We observe that d²y₂/dx² is almost a constant. The estimate for d²y₂/dx² is 0.0973 (sample median). Using this the theoretical approximation for the coefficient of the quadratic term is 0.0486. The estimate differs from the theoretical value of 0.0455.

Final polynomial function approximation

Fitting a linear model to the outcome, we obtain the MLE for the equation y = w₁y₁+w₂y₂+w₃+ϵ₃ as w₁~-551.0261, w₂~428.6562, w₃~185.8187. From the neural network we obtain these estimates as w₁~-549.6529, w₂~427.67603, w₃~185.34625. Multiplying with the MLE coefficients of the linear and quadratic terms of x from y₁ and y₂ respectively, we get the final estimate as a function of x₁ as y = a₃x²+b₃x+c₃+ϵ₄, where a₃~2.979, b₃~2.001, c₃~1.021.

The true model used in the analysis was y = 3x²+2x+1+ϵ

The unconstrained neural network fit

In the unconstrained model none of the weights and biases are constrained.

y₁=a₁x²+b₁x+c₁+ϵ₁

a₁~0.0246, b₁~-0.0625, c₁~0.0425

y₁ vs x in the unconstrained model

y₂=a₂x²+b₂x+c₂+ϵ₂

a₂~0.0190, b₂~0.0573, c₂~0.0580

y₂ vs x in the unconstrained model

Final layer model: y = w₁y₁+w₂y₂+w₃+ϵ₃

w₁~51.5308, w₂~91.1138, w₃~-6.4749

Putting the pieces together, we get the estimate for y = a₃x²+b₃x+c₃+ϵ₄:

a₃~2.9988, b₃~2.0001, c₃~0.9998

Drawbacks of the bias constrained fit

  • There is a high correlation (0.9889) between the two polynomials y₁ and y₂ — this multi-collinearity may delay convergence, cause the parameter estimates to be unstable. This is much lower compared to the final layer of the unconstrained model fit: correlation = -0.5862
  • The unconstrained fit is closer to the ground truth compared to the constrained fit — in terms of both MSE and coefficients

--

--

Naveen Mathew Nathan S.

Data Scientist, interested in theory and practice of machine learning.