Manufacturing polynomials using a sigmoid neural network

Naveen Mathew Nathan S.
6 min readDec 29, 2021

The objective of this article is to understand how (deep) neural networks with sigmoid activation can manufacture polynomial-like functions. The objective of this article is to only “engineer” possible bases (plural of basis in linear algebra) for higher order polynomials in the hidden layers to approximate any polynomial function using a sigmoid neural network with appropriate depth and width. It is important to highlight that the objective of this article is not to engineer the most optimal bases in the hidden layer.

For simplicity, we will assume a regression problem with OLS loss (without regularization) that relates the output variable y to a single input variable x. Also, we will assume that x is standardized to have 0 mean and unit variance, the output layer uses a linear activation, and all the hidden layers use a sigmoid activation.

As usual, my objective is to answer a few fundamental “why” questions in deep learning. Unlike my usual articles, this article will rely more on visuals and intuition over more rigorous mathematical ideas and proofs. However, this article will be child’s play for seasoned deep learning experts who know how things work.

Part 0: the constant term

We all know the sigmoid function well. It is asymptotic at both extremes — which means it’s value is almost constant as |x| tends to large values. If we define a region S₀ such that |σ(x)-c₀|<ϵ₀ ∀ x ∈ S₀. Roughly we can assume S₀=(-∞, -6) ⋃ (6, ∞)

Part 1: the linear term

σ(x) = Sigmoid(x) vs x for x ∈ (-10, 10)

I’m not going to explain the omnipresent sigmoid in this article. From the plot we infer that the σ(x) looks linear (loosely speaking) with respect to x for x ∈ (-1, 1). But is it really?

For a more rigorous proof: Assume y₁(w₁x)=a₁w₁x+b₁ is the linear representation of σ(x) around L₁ (it should be immediately clear that b₁=0.5). Then y₁(x) should be such that there exist (a₁, ϵ₁>0, δ₁>0) that satisfy:|y₁(x)-σ(x)| < ϵ₁ ∀ x ∈ (L₁-δ₁, L₁+δ₁)

Alternatively, let us examine the first derivative — if the first derivative is ‘almost constant’ in a region S₁ disjoint from S₀, then the original function is ‘almost linear’ in that region.

σ’(x) vs x for x ∈ (-10, 10)

Using σ’(x)≥0.19 (margin of error ~ 24%) we get something close to our previous guess — L₁=0, a₁/w₁∈(0.19, 0.25), δ₁~1 ⇒ S₁ = (-1, 1) works reasonably well to manufacture the linear term.

Part 2: the quadratic term

For a more rigorous proof: Assume y₂(w₂x)=a₂(w₂x)²+b₂(w₂x)+c₂ is the quadratic representation of σ(x) around L₂. Then y₂(x) should be such that there exist (a₂, ϵ₂>0, δ₂>0) that satisfy:|y₂(x)-σ(x)| < ϵ₂ ∀ x ∈ (L₂-δ₂, L₂+δ₂)

Instead, let us examine the second derivative — if the second derivative is ‘almost constant’ in a region S₂ disjoint from S₁ and S₀, then the original function is ‘almost quadratic’ in that region.

σ’’(x) vs x for x ∈ (-10, 10)

From the graph let us choose σ’’(x)≥0.091 (margin of error ~ 9%, this is not a fluke or a guess, it involved some trial and error). Solving using Wolfram Alpha (link to the solution I used), a₂/w₂²~0.0455, S₂ = (-1.67, -1) ⋃ (1, 1.67) works reasonably well to manufacture the quadratic term.

Part 3: the cubic term and other higher order terms

σ’’’(x) vs x for x ∈ (-10, 10)

We need to find y₃(w₃x)=a₃(w₃x)³+b₃(w₃x)²+c₃(w₃x)+d₃ that approximates σ(x) in some region S₃. We already used the region S₁=(-1, 1) to manufacture the linear term. From the graph let us choose σ’’’(x)≥0.035 (margin of error ~ 16%). Solving using Wolfram Alpha (link to the solution I used), a₃/w₃³~0.0058, S₃ = (-2.89, -1.86) ⋃ (1.86, 2.89) works reasonably well to manufacture the quadratic term.

Similarly, higher order polynomial terms can be manufactured from appropriate regions of the sigmoid curve.

σᶦᵛ(x) vs x for x ∈ (-10, 10)
σᵛ(x) vs x for x ∈ (-10, 10)

Putting things together: a wide single hidden layer neural network

Neural network to manufacture an n-th order polynomial

We take a leap of faith — we assume that for any arbitrary choice of n we will be able to find a non-null Sₙ that is disjoint from S₀, S₁, S₂, … This should be mathematically feasible because the Taylor series expansion of σ(x) exists around any value of x, and at least some of the higher order terms will be non-zero.

Remember, x was centered and scaled, therefore w*x is bounded. For appropriate choices of b we can map x to the appropriate polynomial region. The output is a weighted sum of polynomials of order less than or equal to n. Therefore, the output is a polynomial of order n.

Closing notes

  1. For a chosen order of polynomial it should be noted that the weights are constrained by the range of S. The coefficient of the n-th order polynomial y (given by max(σⁿ’(x))/n!) decreases as n increases
  2. Any polynomial can be approximated by adjusting the weights. W’s should be of greater interest for the purpose of updating the parameters, and w’s should be considered as restricted parameters used for engineering features
  3. (a₁/w₁)*(a₂/w₂²) ~ 0.19*0.0455 = 0.0086 > a₃/w₃³ ~ 0.0058 [we still haven’t figured out how to engineer interaction terms as features, so let’s ignore this for now]; (a₂/w₂²)² ~ 0.002 > a₄/w₄⁴ ~ 0.0008. Therefore, a deeper network is better for more readily engineering higher order polynomials. This can be attributed to the increase in number of parameters of the network
‘Deep’ neural network for higher order polynomials (notice that some biases and weights are fixed/frozen/not updated for ease of interpretation)

Also, it is known that building up non-linear features gradually in the deeper layers is often a better strategy than forcefully engineering them in the first hidden layer by having a very wide network — this also helps in robustness. However, for a given ’n’ we have not established whether indefinitely increasing the depth of the neural network (to ∞) is the best approach, or if there is an optimal number of hidden layers. In addition, we have not proved that this approach is the most optimal way to generate an n-th order polynomial using a sigmoid neural network. Note that these proofs require us to optimally choose the number of neurons in each layer. However, this should remind us of a familiar heuristics — number of neurons ~ squared after each layer or expanding out in powers of 2 in the deeper layers.

--

--

Naveen Mathew Nathan S.

Data Scientist, interested in theory and practice of machine learning.