Manufacturing polynomials using a sigmoid neural network — practicum
In the previous article I discussed a possible way to create polynomial bases (plural of basis in linear algebra) to approximate a polynomial function using a sigmoid neural network. This article is a practical test of the idea — does the bias constraint work as predicted, and how does it compare with a sigmoid neural network without the bias constraint? We will assume a quadratic true model of the form y = ax²+bx+c+ϵ; for convenience, ϵ=0, c=1, b=2, a=3.
The code
Repository: https://github.com/SNaveenMathew/ml_book/tree/master/polynomial_regression
Files (in the required order):
Bias constrained sigmoid neural network
Model definition
yᵢ(x)=σ(wᵢx+bᵢ); Output=B₀+W₁y₁+W₂y₂
Bias constraints
b₁=0, b₂=log(2 — √3) ~ -1.317 — these parameters were not updated during training, whereas (w₁, w₂, W₁, W₂, B₀) are unconstrained
Training
A data set was generated using 10000 observations from x~N(0, 1); y = 3x²+ 2x+1 (training set generation and true model are defined in quadratic_model.py). Since the objective is not to perform model selection or hyperparameter tuning, all 10000 observations were used for training.
Weights
The model was trained for a very large number of epochs. The final model parameters were:
(w₁, w₂, W₁, W₂, B₀) = (0.192, 0.396, -549.653, 427.676, 185.346); loss~0.0458
Visualizing y₁, y₂ vs x
From the visual representation we can vaguely infer that the model is learning exactly what we expect it to learn — y₁ learns the linear part, y₂ learns the quadratic part. But is this inference accurate?
Deep dive
y₁
Fitting y₁=a₁x+b₁+ϵ₁ (loss ~ 1/10⁷ is very small) gives a₁ ~ 0.0476, b₁ ~ 0.5
Fitting a cubic (y₁=a₁x³+b₁x²+c₁x+d₁+ϵ₁) to verify, we get: a₁ ~ -0.0001, b₁ ~ 0.000000504, c₁ ~ 0.048, d₁ ~ 0.5; x ∈ (-3.66, 4.03). The quadratic and cubic term are much smaller in magnitude compared to the constant and linear term.
y₂
Fitting y₂=a₂x²+b₂x+c₂+ϵ₂ (loss ~ 1/10⁷ is very small) gives a₂ ~ 0.007, b₂ ~ 0.0658, c₂ ~ 0.2116
Fitting a cubic (y₂=a₂x³+b₂x²+c₂x+d₂+ϵ₂) to verify, we get: a₂ ~ -0.0001, b₂ ~ 0.007, c₂ ~ 0.0662, d₂ ~ 0.2116; x ∈ (-3.66, 4.03). The cubic term is much smaller in magnitude compared to the constant, linear and quadratic terms.
Diagnostics
y₁
dy₁/dx was computed by sorting x and computing the forward difference approximation of dy₁/dx using Δy₁/Δx by considering the discrete points of x in the training set. We observe that f₁ is almost linear with respect to x, but shows some signatures of higher order polynomial terms. The estimate for dy₁/dx is 0.04784 (sample median). Using this the theoretical approximation for the coefficient of the linear term is 0.04784.
y₂
We observe that d²y₂/dx² is almost a constant. The estimate for d²y₂/dx² is 0.0973 (sample median). Using this the theoretical approximation for the coefficient of the quadratic term is 0.0486. The estimate differs from the theoretical value of 0.0455.
Final polynomial function approximation
Fitting a linear model to the outcome, we obtain the MLE for the equation y = w₁y₁+w₂y₂+w₃+ϵ₃ as w₁~-551.0261, w₂~428.6562, w₃~185.8187. From the neural network we obtain these estimates as w₁~-549.6529, w₂~427.67603, w₃~185.34625. Multiplying with the MLE coefficients of the linear and quadratic terms of x from y₁ and y₂ respectively, we get the final estimate as a function of x₁ as y = a₃x²+b₃x+c₃+ϵ₄, where a₃~2.979, b₃~2.001, c₃~1.021.
The true model used in the analysis was y = 3x²+2x+1+ϵ
The unconstrained neural network fit
In the unconstrained model none of the weights and biases are constrained.
y₁=a₁x²+b₁x+c₁+ϵ₁
a₁~0.0246, b₁~-0.0625, c₁~0.0425
y₂=a₂x²+b₂x+c₂+ϵ₂
a₂~0.0190, b₂~0.0573, c₂~0.0580
Final layer model: y = w₁y₁+w₂y₂+w₃+ϵ₃
w₁~51.5308, w₂~91.1138, w₃~-6.4749
Putting the pieces together, we get the estimate for y = a₃x²+b₃x+c₃+ϵ₄:
a₃~2.9988, b₃~2.0001, c₃~0.9998
Drawbacks of the bias constrained fit
- There is a high correlation (0.9889) between the two polynomials y₁ and y₂ — this multi-collinearity may delay convergence, cause the parameter estimates to be unstable. This is much lower compared to the final layer of the unconstrained model fit: correlation = -0.5862
- The unconstrained fit is closer to the ground truth compared to the constrained fit — in terms of both MSE and coefficients