The Mathematics Behind Neural Networks

Neural networks might seem like magic, but they're built on elegant mathematical foundations. In this article, we'll break down the core math concepts that make deep learning work.
The Neuron: A Mathematical Model#
At its core, a single neuron performs a simple computation:
Where:
- is the input vector
- is the weight vector
- is the bias term
- is the activation function
Activation Functions#
Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.
Sigmoid Function#
The sigmoid squashes any input to a value between 0 and 1:
function sigmoid(x: number): number {
return 1 / (1 + Math.exp(-x));
}
function sigmoidDerivative(x: number): number {
const s = sigmoid(x);
return s * (1 - s);
}
ReLU (Rectified Linear Unit)#
ReLU is computationally efficient and helps mitigate the vanishing gradient problem:
function relu(x: number): number {
return Math.max(0, x);
}
function reluDerivative(x: number): number {
return x > 0 ? 1 : 0;
}
Softmax Function#
For multi-class classification, softmax converts raw scores (logits) into probabilities:
The outputs sum to 1, making them interpretable as probabilities:
function softmax(x: number[]): number[] {
const maxVal = Math.max(...x);
const expValues = x.map(val => Math.exp(val - maxVal)); // Numerical stability
const sumExp = expValues.reduce((a, b) => a + b, 0);
return expValues.map(val => val / sumExp);
}
Matrix Operations in Neural Networks#
Neural networks process data in batches using matrix operations. For a layer with inputs and outputs:
Where:
- is the input matrix of shape
- is the weight matrix of shape
- is the bias vector of shape
- is the output matrix of shape
Implementation#
class Matrix {
data: number[][];
rows: number;
cols: number;
constructor(data: number[][]) {
this.data = data;
this.rows = data.length;
this.cols = data[0].length;
}
// Matrix multiplication
static multiply(a: Matrix, b: Matrix): Matrix {
if (a.cols !== b.rows) {
throw new Error('Incompatible dimensions');
}
const result: number[][] = [];
for (let i = 0; i < a.rows; i++) {
result[i] = [];
for (let j = 0; j < b.cols; j++) {
let sum = 0;
for (let k = 0; k < a.cols; k++) {
sum += a.data[i][k] * b.data[k][j];
}
result[i][j] = sum;
}
}
return new Matrix(result);
}
// Element-wise addition
static add(a: Matrix, b: Matrix | number[]): Matrix {
const result: number[][] = [];
for (let i = 0; i < a.rows; i++) {
result[i] = [];
for (let j = 0; j < a.cols; j++) {
const bVal = Array.isArray(b) ? b[j] : b.data[i][j];
result[i][j] = a.data[i][j] + bVal;
}
}
return new Matrix(result);
}
// Apply function element-wise
map(fn: (x: number) => number): Matrix {
const result = this.data.map(row => row.map(fn));
return new Matrix(result);
}
}
Loss Functions#
Loss functions measure how wrong our predictions are. The goal of training is to minimize this loss.
Mean Squared Error (MSE)#
For regression tasks:
function mse(predicted: number[], actual: number[]): number {
const n = predicted.length;
let sum = 0;
for (let i = 0; i < n; i++) {
sum += Math.pow(actual[i] - predicted[i], 2);
}
return sum / n;
}
function mseGradient(predicted: number[], actual: number[]): number[] {
const n = predicted.length;
return predicted.map((p, i) => (2 / n) * (p - actual[i]));
}
Cross-Entropy Loss#
For classification tasks:
Where is 1 for the correct class and 0 otherwise (one-hot encoding).
function crossEntropy(predicted: number[], actual: number[]): number {
let loss = 0;
for (let i = 0; i < predicted.length; i++) {
if (actual[i] === 1) {
loss -= Math.log(predicted[i] + 1e-15); // Add epsilon for stability
}
}
return loss;
}
Backpropagation#
Backpropagation computes gradients using the chain rule, allowing us to update weights efficiently.
The Chain Rule#
For a composite function :
Computing Gradients#
Consider a simple two-layer network:
For each layer, we compute:
Where (pre-activation) and (post-activation).
Implementation#
interface LayerCache {
input: Matrix;
preActivation: Matrix;
output: Matrix;
}
class DenseLayer {
weights: Matrix;
bias: number[];
activation: (x: number) => number;
activationDerivative: (x: number) => number;
// Gradient storage
weightsGrad: Matrix | null = null;
biasGrad: number[] | null = null;
forward(input: Matrix, cache: LayerCache): Matrix {
cache.input = input;
// Z = XW + b
const z = Matrix.add(Matrix.multiply(input, this.weights), this.bias);
cache.preActivation = z;
// Y = f(Z)
const output = z.map(this.activation);
cache.output = output;
return output;
}
backward(gradOutput: Matrix, cache: LayerCache): Matrix {
// Gradient through activation: dZ = dY * f'(Z)
const activationGrad = cache.preActivation.map(this.activationDerivative);
const dZ = new Matrix(
gradOutput.data.map((row, i) =>
row.map((val, j) => val * activationGrad.data[i][j])
)
);
// Gradient for weights: dW = X^T @ dZ
const inputT = this.transpose(cache.input);
this.weightsGrad = Matrix.multiply(inputT, dZ);
// Gradient for bias: db = sum(dZ, axis=0)
this.biasGrad = this.sumColumns(dZ);
// Gradient for input: dX = dZ @ W^T
const weightsT = this.transpose(this.weights);
return Matrix.multiply(dZ, weightsT);
}
private transpose(m: Matrix): Matrix {
const result: number[][] = [];
for (let j = 0; j < m.cols; j++) {
result[j] = [];
for (let i = 0; i < m.rows; i++) {
result[j][i] = m.data[i][j];
}
}
return new Matrix(result);
}
private sumColumns(m: Matrix): number[] {
const result = new Array(m.cols).fill(0);
for (let i = 0; i < m.rows; i++) {
for (let j = 0; j < m.cols; j++) {
result[j] += m.data[i][j];
}
}
return result;
}
}
Gradient Descent Optimization#
Once we have gradients, we update weights to minimize loss.
Vanilla Gradient Descent#
Where is the learning rate.
Adam Optimizer#
Adam combines momentum and adaptive learning rates:
class AdamOptimizer {
learningRate: number;
beta1: number;
beta2: number;
epsilon: number;
t: number = 0;
// Moment estimates for each parameter
m: Map<string, number[][]> = new Map();
v: Map<string, number[][]> = new Map();
constructor(
learningRate = 0.001,
beta1 = 0.9,
beta2 = 0.999,
epsilon = 1e-8
) {
this.learningRate = learningRate;
this.beta1 = beta1;
this.beta2 = beta2;
this.epsilon = epsilon;
}
step(paramName: string, weights: Matrix, gradients: Matrix): void {
this.t += 1;
// Initialize moments if first time
if (!this.m.has(paramName)) {
this.m.set(paramName,
Array(weights.rows).fill(null).map(() =>
Array(weights.cols).fill(0)
)
);
this.v.set(paramName,
Array(weights.rows).fill(null).map(() =>
Array(weights.cols).fill(0)
)
);
}
const m = this.m.get(paramName)!;
const v = this.v.get(paramName)!;
for (let i = 0; i < weights.rows; i++) {
for (let j = 0; j < weights.cols; j++) {
const g = gradients.data[i][j];
// Update biased first moment estimate
m[i][j] = this.beta1 * m[i][j] + (1 - this.beta1) * g;
// Update biased second raw moment estimate
v[i][j] = this.beta2 * v[i][j] + (1 - this.beta2) * g * g;
// Compute bias-corrected estimates
const mHat = m[i][j] / (1 - Math.pow(this.beta1, this.t));
const vHat = v[i][j] / (1 - Math.pow(this.beta2, this.t));
// Update parameters
weights.data[i][j] -= this.learningRate * mHat / (Math.sqrt(vHat) + this.epsilon);
}
}
}
}
Putting It All Together#
Here's a complete forward and backward pass:
Further Reading#
- Deep Learning Book by Goodfellow, Bengio, and Courville
- CS231n: Convolutional Neural Networks - Stanford course
- 3Blue1Brown Neural Networks - Excellent visualizations
Understanding the math empowers you to debug models, design new architectures, and push the boundaries of what's possible with neural networks.



