A deep neural community might be understood as a geometrical system, the place every layer reshapes the enter area to kind more and more advanced determination boundaries. For this to work successfully, layers should protect significant spatial data — notably how far a knowledge level lies from these boundaries — since this distance permits deeper layers to construct wealthy, non-linear representations.
Sigmoid disrupts this course of by compressing all inputs right into a slender vary between 0 and 1. As values transfer away from determination boundaries, they grow to be indistinguishable, inflicting a lack of geometric context throughout layers. This results in weaker representations and limits the effectiveness of depth.
ReLU, however, preserves magnitude for constructive inputs, permitting distance data to stream by way of the community. This allows deeper fashions to stay expressive with out requiring extreme width or compute.
On this article, we concentrate on this forward-pass conduct — analyzing how Sigmoid and ReLU differ in sign propagation and illustration geometry utilizing a two-moons experiment, and what which means for inference effectivity and scalability.
Establishing the dependencies
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colours import ListedColormap
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
plt.rcParams.replace({
“font.household”: “monospace”,
“axes.spines.high”: False,
“axes.spines.proper”: False,
“determine.facecolor”: “white”,
“axes.facecolor”: “#f7f7f7”,
“axes.grid”: True,
“grid.shade”: “#e0e0e0”,
“grid.linewidth”: 0.6,
})
T = {
“bg”: “white”,
“panel”: “#f7f7f7”,
“sig”: “#e05c5c”,
“relu”: “#3a7bd5”,
“c0”: “#f4a261”,
“c1”: “#2a9d8f”,
“textual content”: “#1a1a1a”,
“muted”: “#666666”,
}
Creating the dataset
To check the impact of activation features in a managed setting, we first generate an artificial dataset utilizing scikit-learn’s make_moons. This creates a non-linear, two-class downside the place easy linear boundaries fail, making it ideally suited for testing how nicely neural networks study advanced determination surfaces.
We add a small quantity of noise to make the duty extra reasonable, then standardize the options utilizing StandardScaler so each dimensions are on the identical scale — making certain steady coaching. The dataset is then break up into coaching and check units to guage generalization.
Lastly, we visualize the information distribution. This plot serves because the baseline geometry that each Sigmoid and ReLU networks will try to mannequin, permitting us to later examine how every activation operate transforms this area throughout layers.
X, y = make_moons(n_samples=400, noise=0.18, random_state=42)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
fig, ax = plt.subplots(figsize=(7, 5))
fig.patch.set_facecolor(T[“bg”])
ax.set_facecolor(T[“panel”])
ax.scatter(X[y == 0, 0], X[y == 0, 1], c=T[“c0″], s=40,
edgecolors=”white”, linewidths=0.5, label=”Class 0″, alpha=0.9)
ax.scatter(X[y == 1, 0], X[y == 1, 1], c=T[“c1″], s=40,
edgecolors=”white”, linewidths=0.5, label=”Class 1″, alpha=0.9)
ax.set_title(“make_moons — our dataset”, shade=T[“text”], fontsize=13)
ax.set_xlabel(“x₁”, shade=T[“muted”]); ax.set_ylabel(“x₂”, shade=T[“muted”])
ax.tick_params(colours=T[“muted”]); ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig(“moons_dataset.png”, dpi=140, bbox_inches=”tight”)
plt.present()
Creating the Community
Subsequent, we implement a small, managed neural community to isolate the impact of activation features. The aim right here is to not construct a extremely optimized mannequin, however to create a clear experimental setup the place Sigmoid and ReLU might be in contrast beneath an identical situations.
We outline each activation features (Sigmoid and ReLU) together with their derivatives, and use binary cross-entropy because the loss since this can be a binary classification process. The TwoLayerNet class represents a easy 3-layer feedforward community (2 hidden layers + output), the place the one configurable element is the activation operate.
A key element is the initialization technique: we use He initialization for ReLU and Xavier initialization for Sigmoid, making certain that every community begins in a good and steady regime primarily based on its activation dynamics.
The ahead move computes activations layer by layer, whereas the backward move performs customary gradient descent updates. Importantly, we additionally embody diagnostic strategies like get_hidden and get_z_trace, which permit us to examine how indicators evolve throughout layers — that is essential for analyzing how a lot geometric data is preserved or misplaced.
By holding structure, information, and coaching setup fixed, this implementation ensures that any distinction in efficiency or inside representations might be instantly attributed to the activation operate itself — setting the stage for a transparent comparability of their influence on sign propagation and expressiveness.
def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_d(a): return a * (1 – a)
def relu(z): return np.most(0, z)
def relu_d(z): return (z > 0).astype(float)
def bce(y, yhat): return -np.imply(y * np.log(yhat + 1e-9) + (1 – y) * np.log(1 – yhat + 1e-9))
class TwoLayerNet:
def __init__(self, activation=”relu”, seed=0):
np.random.seed(seed)
self.act_name = activation
self.act = relu if activation == “relu” else sigmoid
self.dact = relu_d if activation == “relu” else sigmoid_d
# He init for ReLU, Xavier for Sigmoid
scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == “relu” else np.sqrt(1 / fan_in)
self.W1 = np.random.randn(2, 8) * scale(2)
self.b1 = np.zeros((1, 8))
self.W2 = np.random.randn(8, 8) * scale(8)
self.b2 = np.zeros((1, 8))
self.W3 = np.random.randn(8, 1) * scale(8)
self.b3 = np.zeros((1, 1))
self.loss_history = []
def ahead(self, X, retailer=False):
z1 = X @ self.W1 + self.b1; a1 = self.act(z1)
z2 = a1 @ self.W2 + self.b2; a2 = self.act(z2)
z3 = a2 @ self.W3 + self.b3; out = sigmoid(z3)
if retailer:
self._cache = (X, z1, a1, z2, a2, z3, out)
return out
def backward(self, lr=0.05):
X, z1, a1, z2, a2, z3, out = self._cache
n = X.form[0]
dout = (out – self.y_cache) / n
dW3 = a2.T @ dout; db3 = dout.sum(axis=0, keepdims=True)
da2 = dout @ self.W3.T
dz2 = da2 * (self.dact(z2) if self.act_name == “relu” else self.dact(a2))
dW2 = a1.T @ dz2; db2 = dz2.sum(axis=0, keepdims=True)
da1 = dz2 @ self.W2.T
dz1 = da1 * (self.dact(z1) if self.act_name == “relu” else self.dact(a1))
dW1 = X.T @ dz1; db1 = dz1.sum(axis=0, keepdims=True)
for p, g in [(self.W3,dW3),(self.b3,db3),(self.W2,dW2),
(self.b2,db2),(self.W1,dW1),(self.b1,db1)]:
p -= lr * g
def train_step(self, X, y, lr=0.05):
self.y_cache = y.reshape(-1, 1)
out = self.ahead(X, retailer=True)
loss = bce(self.y_cache, out)
self.backward(lr)
return loss
def get_hidden(self, X, layer=1):
“””Return post-activation values for layer 1 or 2.”””
z1 = X @ self.W1 + self.b1; a1 = self.act(z1)
if layer == 1: return a1
z2 = a1 @ self.W2 + self.b2; return self.act(z2)
def get_z_trace(self, x_single):
“””Return pre-activation magnitudes per layer for ONE pattern.”””
z1 = x_single @ self.W1 + self.b1
a1 = self.act(z1)
z2 = a1 @ self.W2 + self.b2
a2 = self.act(z2)
z3 = a2 @ self.W3 + self.b3
return [np.abs(z1).mean(), np.abs(a1).mean(),
np.abs(z2).mean(), np.abs(a2).mean(),
np.abs(z3).mean()]
Coaching the Networks
Now we practice each networks beneath an identical situations to make sure a good comparability. We initialize two fashions — one utilizing Sigmoid and the opposite utilizing ReLU — with the identical random seed so they begin from equal weight configurations.
The coaching loop runs for 800 epochs utilizing mini-batch gradient descent. In every epoch, we shuffle the coaching information, break up it into batches, and replace each networks in parallel. This setup ensures that the one variable altering between the 2 runs is the activation operate.
We additionally observe the loss after each epoch and log it at common intervals. This enables us to watch how every community evolves over time — not simply when it comes to convergence pace, however whether or not it continues enhancing or plateaus.
This step is important as a result of it establishes the primary sign of divergence: if each fashions begin identically however behave in a different way throughout coaching, that distinction should come from how every activation operate propagates and preserves data by way of the community.
EPOCHS = 800
LR = 0.05
BATCH = 64
net_sig = TwoLayerNet(“sigmoid”, seed=42)
net_relu = TwoLayerNet(“relu”, seed=42)
for epoch in vary(EPOCHS):
idx = np.random.permutation(len(X_train))
for web in [net_sig, net_relu]:
epoch_loss = []
for i in vary(0, len(idx), BATCH):
b = idx[i:i+BATCH]
loss = web.train_step(X_train[b], y_train[b], LR)
epoch_loss.append(loss)
web.loss_history.append(np.imply(epoch_loss))
if (epoch + 1) % 200 == 0:
ls = net_sig.loss_history[-1]
lr = net_relu.loss_history[-1]
print(f” Epoch {epoch+1:4d} | Sigmoid loss: {ls:.4f} | ReLU loss: {lr:.4f}”)
print(“n✅ Coaching full.”)
Coaching Loss Curve
The loss curves make the divergence between Sigmoid and ReLU very clear. Each networks begin from the identical initialization and are skilled beneath an identical situations, but their studying trajectories shortly separate. Sigmoid improves initially however plateaus round ~0.28 by epoch 400, displaying nearly no progress afterward — an indication that the community has exhausted the helpful sign it may extract.
ReLU, in distinction, continues to steadily cut back loss all through coaching, dropping from ~0.15 to ~0.03 by epoch 800. This isn’t simply quicker convergence; it displays a deeper situation: Sigmoid’s compression is limiting the stream of significant data, inflicting the mannequin to stall, whereas ReLU preserves that sign, permitting the community to maintain refining its determination boundary.
fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_facecolor(T[“bg”])
ax.set_facecolor(T[“panel”])
ax.plot(net_sig.loss_history, shade=T[“sig”], lw=2.5, label=”Sigmoid”)
ax.plot(net_relu.loss_history, shade=T[“relu”], lw=2.5, label=”ReLU”)
ax.set_xlabel(“Epoch”, shade=T[“muted”])
ax.set_ylabel(“Binary Cross-Entropy Loss”, shade=T[“muted”])
ax.set_title(“Coaching Loss — identical structure, identical init, identical LRnonly the activation differs”,
shade=T[“text”], fontsize=12)
ax.legend(fontsize=11)
ax.tick_params(colours=T[“muted”])
# Annotate last losses
for web, shade, va in [(net_sig, T[“sig”], “backside”), (net_relu, T[“relu”], “high”)]:
last = web.loss_history[-1]
ax.annotate(f” last: {last:.4f}”, xy=(EPOCHS-1, last),
shade=shade, fontsize=9, va=va)
plt.tight_layout()
plt.savefig(“loss_curves.png”, dpi=140, bbox_inches=”tight”)
plt.present()
Resolution Boundary Plots
The choice boundary visualization makes the distinction much more tangible. The Sigmoid community learns an almost linear boundary, failing to seize the curved construction of the two-moons dataset, which leads to decrease accuracy (~79%). This can be a direct consequence of its compressed inside representations — the community merely doesn’t have sufficient geometric sign to assemble a fancy boundary.
In distinction, the ReLU community learns a extremely non-linear, well-adapted boundary that carefully follows the information distribution, attaining a lot increased accuracy (~96%). As a result of ReLU preserves magnitude throughout layers, it permits the community to progressively bend and refine the choice floor, turning depth into precise expressive energy quite than wasted capability.
def plot_boundary(ax, web, X, y, title, shade):
h = 0.025
x_min, x_max = X[:, 0].min() – 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() – 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]
Z = web.ahead(grid).reshape(xx.form)
# Comfortable shading
cmap_bg = ListedColormap([“#fde8c8”, “#c8ece9”])
ax.contourf(xx, yy, Z, ranges=50, cmap=cmap_bg, alpha=0.85)
ax.contour(xx, yy, Z, ranges=[0.5], colours=[color], linewidths=2)
ax.scatter(X[y==0, 0], X[y==0, 1], c=T[“c0″], s=35,
edgecolors=”white”, linewidths=0.4, alpha=0.9)
ax.scatter(X[y==1, 0], X[y==1, 1], c=T[“c1″], s=35,
edgecolors=”white”, linewidths=0.4, alpha=0.9)
acc = ((web.ahead(X) >= 0.5).ravel() == y).imply()
ax.set_title(f”{title}nTest acc: {acc:.1%}”, shade=shade, fontsize=12)
ax.set_xlabel(“x₁”, shade=T[“muted”]); ax.set_ylabel(“x₂”, shade=T[“muted”])
ax.tick_params(colours=T[“muted”])
fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
fig.patch.set_facecolor(T[“bg”])
fig.suptitle(“Resolution Boundaries discovered on make_moons”,
fontsize=13, shade=T[“text”])
plot_boundary(axes[0], net_sig, X_test, y_test, “Sigmoid”, T[“sig”])
plot_boundary(axes[1], net_relu, X_test, y_test, “ReLU”, T[“relu”])
plt.tight_layout()
plt.savefig(“decision_boundaries.png”, dpi=140, bbox_inches=”tight”)
plt.present()
Layer-by-Layer Sign Hint
This chart tracks how the sign evolves throughout layers for a degree removed from the choice boundary — and it clearly exhibits the place Sigmoid fails. Each networks begin with related pre-activation magnitude on the first layer (~2.0), however Sigmoid instantly compresses it to ~0.3, whereas ReLU retains a better worth. As we transfer deeper, Sigmoid continues to squash the sign right into a slender band (0.5–0.6), successfully erasing significant variations. ReLU, however, preserves and amplifies magnitude, with the ultimate layer reaching values as excessive as 9–20.
This implies the output neuron within the ReLU community is making selections primarily based on a robust, well-separated sign, whereas the Sigmoid community is pressured to categorise utilizing a weak, compressed one. The important thing takeaway is that ReLU preserves distance from the choice boundary throughout layers, permitting that data to compound, whereas Sigmoid progressively destroys it.
far_class0 = X_train[y_train == 0][np.argmax(
np.linalg.norm(X_train[y_train == 0] – [-1.2, -0.3], axis=1)
)]
far_class1 = X_train[y_train == 1][np.argmax(
np.linalg.norm(X_train[y_train == 1] – [1.2, 0.3], axis=1)
)]
stage_labels = [“z₁ (pre)”, “a₁ (post)”, “z₂ (pre)”, “a₂ (post)”, “z₃ (out)”]
x_pos = np.arange(len(stage_labels))
fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
fig.patch.set_facecolor(T[“bg”])
fig.suptitle(“Layer-by-layer sign magnitude — a degree removed from the boundary”,
fontsize=12, shade=T[“text”])
for ax, pattern, title in zip(
axes,
[far_class0, far_class1],
[“Class 0 sample (deep in its moon)”, “Class 1 sample (deep in its moon)”]
):
ax.set_facecolor(T[“panel”])
sig_trace = net_sig.get_z_trace(pattern.reshape(1, -1))
relu_trace = net_relu.get_z_trace(pattern.reshape(1, -1))
ax.plot(x_pos, sig_trace, “o-“, shade=T[“sig”], lw=2.5, markersize=8, label=”Sigmoid”)
ax.plot(x_pos, relu_trace, “s-“, shade=T[“relu”], lw=2.5, markersize=8, label=”ReLU”)
for i, (s, r) in enumerate(zip(sig_trace, relu_trace)):
ax.textual content(i, s – 0.06, f”{s:.3f}”, ha=”heart”, fontsize=8, shade=T[“sig”])
ax.textual content(i, r + 0.04, f”{r:.3f}”, ha=”heart”, fontsize=8, shade=T[“relu”])
ax.set_xticks(x_pos); ax.set_xticklabels(stage_labels, shade=T[“muted”], fontsize=9)
ax.set_ylabel(“Imply |activation|”, shade=T[“muted”])
ax.set_title(title, shade=T[“text”], fontsize=11)
ax.tick_params(colours=T[“muted”]); ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig(“signal_trace.png”, dpi=140, bbox_inches=”tight”)
plt.present()
Hidden Area Scatter
That is an important visualization as a result of it instantly exposes how every community makes use of (or fails to make use of) depth. Within the Sigmoid community (left), each courses collapse into a good, overlapping area — a diagonal smear the place factors are closely entangled. The usual deviation really decreases from layer 1 (0.26) to layer 2 (0.19), which means the illustration is changing into much less expressive with depth. Every layer is compressing the sign additional, stripping away the spatial construction wanted to separate the courses.
ReLU exhibits the other conduct. In layer 1, whereas some neurons are inactive (the “useless zone”), the energetic ones already unfold throughout a wider vary (1.15 std), indicating preserved variation. By layer 2, this expands even additional (1.67 std), and the courses grow to be clearly separable — one is pushed to excessive activation ranges whereas the opposite stays close to zero. At this level, the output layer’s job is trivial.
fig, axes = plt.subplots(2, 2, figsize=(13, 10))
fig.patch.set_facecolor(T[“bg”])
fig.suptitle(“Hidden-space representations on make_moons check set”,
fontsize=13, shade=T[“text”])
for col, (web, shade, title) in enumerate([
(net_sig, T[“sig”], “Sigmoid”),
(net_relu, T[“relu”], “ReLU”),
]):
for row, layer in enumerate([1, 2]):
ax = axes[row][col]
ax.set_facecolor(T[“panel”])
H = web.get_hidden(X_test, layer=layer)
ax.scatter(H[y_test==0, 0], H[y_test==0, 1], c=T[“c0″], s=40,
edgecolors=”white”, linewidths=0.4, alpha=0.85, label=”Class 0″)
ax.scatter(H[y_test==1, 0], H[y_test==1, 1], c=T[“c1″], s=40,
edgecolors=”white”, linewidths=0.4, alpha=0.85, label=”Class 1″)
unfold = H.std()
ax.textual content(0.04, 0.96, f”std: {unfold:.4f}”,
rework=ax.transAxes, fontsize=9, va=”high”,
shade=T[“text”],
bbox=dict(boxstyle=”spherical,pad=0.3″, fc=”white”, ec=shade, alpha=0.85))
ax.set_title(f”{title} — Layer {layer} hidden area”,
shade=shade, fontsize=11)
ax.set_xlabel(f”Unit 1″, shade=T[“muted”])
ax.set_ylabel(f”Unit 2″, shade=T[“muted”])
ax.tick_params(colours=T[“muted”])
if row == 0 and col == 0: ax.legend(fontsize=9)
plt.tight_layout()
plt.savefig(“hidden_space.png”, dpi=140, bbox_inches=”tight”)
plt.present()
Take a look at the Full Codes right here. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as nicely.
Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us
I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Knowledge Science, particularly Neural Networks and their software in numerous areas.

