The Mathematics Behind Neural Networks

Neural networks might seem like magic, but they're built on elegant mathematical foundations. In this article, we'll break down the core math concepts that make deep learning work.

The Neuron: A Mathematical Model#

At its core, a single neuron performs a simple computation:

y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w}^T \mathbf{x} + b)

Where:

$\mathbf{x}$ is the input vector
$\mathbf{w}$ is the weight vector
$b$ is the bias term
$f$ is the activation function

Rendering diagram...

Activation Functions#

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Sigmoid Function#

\sigma(x) = \frac{1}{1 + e^{-x}}

The sigmoid squashes any input to a value between 0 and 1:

function sigmoid(x: number): number {
    return 1 / (1 + Math.exp(-x));
}

function sigmoidDerivative(x: number): number {
    const s = sigmoid(x);
    return s * (1 - s);
}

ReLU (Rectified Linear Unit)#

\text{ReLU}(x) = \max(0, x)

ReLU is computationally efficient and helps mitigate the vanishing gradient problem:

function relu(x: number): number {
    return Math.max(0, x);
}

function reluDerivative(x: number): number {
    return x > 0 ? 1 : 0;
}

Softmax Function#

For multi-class classification, softmax converts raw scores (logits) into probabilities:

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}

The outputs sum to 1, making them interpretable as probabilities:

function softmax(x: number[]): number[] {
    const maxVal = Math.max(...x);
    const expValues = x.map(val => Math.exp(val - maxVal)); // Numerical stability
    const sumExp = expValues.reduce((a, b) => a + b, 0);
    return expValues.map(val => val / sumExp);
}

Matrix Operations in Neural Networks#

Neural networks process data in batches using matrix operations. For a layer with $n$ inputs and $m$ outputs:

\mathbf{Y} = f(\mathbf{X} \mathbf{W} + \mathbf{b})

Where:

$\mathbf{X}$ is the input matrix of shape $(batch, n)$
$\mathbf{W}$ is the weight matrix of shape $(n, m)$
$\mathbf{b}$ is the bias vector of shape $(m,)$
$\mathbf{Y}$ is the output matrix of shape $(batch, m)$

Rendering diagram...

Implementation#

class Matrix {
    data: number[][];
    rows: number;
    cols: number;

    constructor(data: number[][]) {
        this.data = data;
        this.rows = data.length;
        this.cols = data[0].length;
    }

    // Matrix multiplication
    static multiply(a: Matrix, b: Matrix): Matrix {
        if (a.cols !== b.rows) {
            throw new Error('Incompatible dimensions');
        }

        const result: number[][] = [];
        for (let i = 0; i < a.rows; i++) {
            result[i] = [];
            for (let j = 0; j < b.cols; j++) {
                let sum = 0;
                for (let k = 0; k < a.cols; k++) {
                    sum += a.data[i][k] * b.data[k][j];
                }
                result[i][j] = sum;
            }
        }
        return new Matrix(result);
    }

    // Element-wise addition
    static add(a: Matrix, b: Matrix | number[]): Matrix {
        const result: number[][] = [];
        for (let i = 0; i < a.rows; i++) {
            result[i] = [];
            for (let j = 0; j < a.cols; j++) {
                const bVal = Array.isArray(b) ? b[j] : b.data[i][j];
                result[i][j] = a.data[i][j] + bVal;
            }
        }
        return new Matrix(result);
    }

    // Apply function element-wise
    map(fn: (x: number) => number): Matrix {
        const result = this.data.map(row => row.map(fn));
        return new Matrix(result);
    }
}

Loss Functions#

Loss functions measure how wrong our predictions are. The goal of training is to minimize this loss.

Mean Squared Error (MSE)#

For regression tasks:

\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

function mse(predicted: number[], actual: number[]): number {
    const n = predicted.length;
    let sum = 0;
    for (let i = 0; i < n; i++) {
        sum += Math.pow(actual[i] - predicted[i], 2);
    }
    return sum / n;
}

function mseGradient(predicted: number[], actual: number[]): number[] {
    const n = predicted.length;
    return predicted.map((p, i) => (2 / n) * (p - actual[i]));
}

Cross-Entropy Loss#

For classification tasks:

\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Where $y_i$ is 1 for the correct class and 0 otherwise (one-hot encoding).

function crossEntropy(predicted: number[], actual: number[]): number {
    let loss = 0;
    for (let i = 0; i < predicted.length; i++) {
        if (actual[i] === 1) {
            loss -= Math.log(predicted[i] + 1e-15); // Add epsilon for stability
        }
    }
    return loss;
}

Backpropagation#

Backpropagation computes gradients using the chain rule, allowing us to update weights efficiently.

The Chain Rule#

For a composite function $f(g(x))$ :

\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

Computing Gradients#

Consider a simple two-layer network:

Rendering diagram...

For each layer, we compute:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \cdot \frac{\partial \mathbf{Y}}{\partial \mathbf{Z}} \cdot \frac{\partial \mathbf{Z}}{\partial \mathbf{W}}

Where $\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b}$ (pre-activation) and $\mathbf{Y} = f(\mathbf{Z})$ (post-activation).

Implementation#

interface LayerCache {
    input: Matrix;
    preActivation: Matrix;
    output: Matrix;
}

class DenseLayer {
    weights: Matrix;
    bias: number[];
    activation: (x: number) => number;
    activationDerivative: (x: number) => number;

    // Gradient storage
    weightsGrad: Matrix | null = null;
    biasGrad: number[] | null = null;

    forward(input: Matrix, cache: LayerCache): Matrix {
        cache.input = input;

        // Z = XW + b
        const z = Matrix.add(Matrix.multiply(input, this.weights), this.bias);
        cache.preActivation = z;

        // Y = f(Z)
        const output = z.map(this.activation);
        cache.output = output;

        return output;
    }

    backward(gradOutput: Matrix, cache: LayerCache): Matrix {
        // Gradient through activation: dZ = dY * f'(Z)
        const activationGrad = cache.preActivation.map(this.activationDerivative);
        const dZ = new Matrix(
            gradOutput.data.map((row, i) =>
                row.map((val, j) => val * activationGrad.data[i][j])
            )
        );

        // Gradient for weights: dW = X^T @ dZ
        const inputT = this.transpose(cache.input);
        this.weightsGrad = Matrix.multiply(inputT, dZ);

        // Gradient for bias: db = sum(dZ, axis=0)
        this.biasGrad = this.sumColumns(dZ);

        // Gradient for input: dX = dZ @ W^T
        const weightsT = this.transpose(this.weights);
        return Matrix.multiply(dZ, weightsT);
    }

    private transpose(m: Matrix): Matrix {
        const result: number[][] = [];
        for (let j = 0; j < m.cols; j++) {
            result[j] = [];
            for (let i = 0; i < m.rows; i++) {
                result[j][i] = m.data[i][j];
            }
        }
        return new Matrix(result);
    }

    private sumColumns(m: Matrix): number[] {
        const result = new Array(m.cols).fill(0);
        for (let i = 0; i < m.rows; i++) {
            for (let j = 0; j < m.cols; j++) {
                result[j] += m.data[i][j];
            }
        }
        return result;
    }
}

Gradient Descent Optimization#

Once we have gradients, we update weights to minimize loss.

Vanilla Gradient Descent#

\mathbf{W}_{t+1} = \mathbf{W}_t - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}}

Where $\eta$ is the learning rate.

Adam Optimizer#

Adam combines momentum and adaptive learning rates:

\begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1-\beta_2^t} \\ W_{t+1} &= W_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align}

class AdamOptimizer {
    learningRate: number;
    beta1: number;
    beta2: number;
    epsilon: number;
    t: number = 0;

    // Moment estimates for each parameter
    m: Map<string, number[][]> = new Map();
    v: Map<string, number[][]> = new Map();

    constructor(
        learningRate = 0.001,
        beta1 = 0.9,
        beta2 = 0.999,
        epsilon = 1e-8
    ) {
        this.learningRate = learningRate;
        this.beta1 = beta1;
        this.beta2 = beta2;
        this.epsilon = epsilon;
    }

    step(paramName: string, weights: Matrix, gradients: Matrix): void {
        this.t += 1;

        // Initialize moments if first time
        if (!this.m.has(paramName)) {
            this.m.set(paramName,
                Array(weights.rows).fill(null).map(() =>
                    Array(weights.cols).fill(0)
                )
            );
            this.v.set(paramName,
                Array(weights.rows).fill(null).map(() =>
                    Array(weights.cols).fill(0)
                )
            );
        }

        const m = this.m.get(paramName)!;
        const v = this.v.get(paramName)!;

        for (let i = 0; i < weights.rows; i++) {
            for (let j = 0; j < weights.cols; j++) {
                const g = gradients.data[i][j];

                // Update biased first moment estimate
                m[i][j] = this.beta1 * m[i][j] + (1 - this.beta1) * g;

                // Update biased second raw moment estimate
                v[i][j] = this.beta2 * v[i][j] + (1 - this.beta2) * g * g;

                // Compute bias-corrected estimates
                const mHat = m[i][j] / (1 - Math.pow(this.beta1, this.t));
                const vHat = v[i][j] / (1 - Math.pow(this.beta2, this.t));

                // Update parameters
                weights.data[i][j] -= this.learningRate * mHat / (Math.sqrt(vHat) + this.epsilon);
            }
        }
    }
}

Putting It All Together#

Here's a complete forward and backward pass:

Rendering diagram...

The Mathematics Behind Neural Networks

The Neuron: A Mathematical Model#

Activation Functions#

Sigmoid Function#

ReLU (Rectified Linear Unit)#

Softmax Function#

Matrix Operations in Neural Networks#

Implementation#

Loss Functions#

Mean Squared Error (MSE)#

Cross-Entropy Loss#

Backpropagation#

The Chain Rule#

Computing Gradients#

Implementation#

Gradient Descent Optimization#

Vanilla Gradient Descent#

Adam Optimizer#

Putting It All Together#

Further Reading#

Related Articles

The Mathematics Behind CSS Animations

Building a JIRA-Aware Commitlint Plugin with an Interactive CLI Prompter

CES 2026: The Intelligent Transformation