The Mathematics Behind Neural Networks

Abstract visualization of neural network connections

Neural networks might seem like magic, but they're built on elegant mathematical foundations. In this article, we'll break down the core math concepts that make deep learning work.

The Neuron: A Mathematical Model#

At its core, a single neuron performs a simple computation:

y=f(โˆ‘i=1nwixi+b)=f(wTx+b)y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w}^T \mathbf{x} + b)

Where:

  • x\mathbf{x} is the input vector
  • w\mathbf{w} is the weight vector
  • bb is the bias term
  • ff is the activation function
Rendering diagram...

Activation Functions#

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Sigmoid Function#

ฯƒ(x)=11+eโˆ’x\sigma(x) = \frac{1}{1 + e^{-x}}

The sigmoid squashes any input to a value between 0 and 1:

TypeScript
function sigmoid(x: number): number {
    return 1 / (1 + Math.exp(-x));
}

function sigmoidDerivative(x: number): number {
    const s = sigmoid(x);
    return s * (1 - s);
}

ReLU (Rectified Linear Unit)#

ReLU(x)=maxโก(0,x)\text{ReLU}(x) = \max(0, x)

ReLU is computationally efficient and helps mitigate the vanishing gradient problem:

TypeScript
function relu(x: number): number {
    return Math.max(0, x);
}

function reluDerivative(x: number): number {
    return x > 0 ? 1 : 0;
}

Softmax Function#

For multi-class classification, softmax converts raw scores (logits) into probabilities:

softmax(xi)=exiโˆ‘j=1Kexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}

The outputs sum to 1, making them interpretable as probabilities:

TypeScript
function softmax(x: number[]): number[] {
    const maxVal = Math.max(...x);
    const expValues = x.map(val => Math.exp(val - maxVal)); // Numerical stability
    const sumExp = expValues.reduce((a, b) => a + b, 0);
    return expValues.map(val => val / sumExp);
}

Matrix Operations in Neural Networks#

Neural networks process data in batches using matrix operations. For a layer with nn inputs and mm outputs:

Y=f(XW+b)\mathbf{Y} = f(\mathbf{X} \mathbf{W} + \mathbf{b})

Where:

  • X\mathbf{X} is the input matrix of shape (batch,n)(batch, n)
  • W\mathbf{W} is the weight matrix of shape (n,m)(n, m)
  • b\mathbf{b} is the bias vector of shape (m,)(m,)
  • Y\mathbf{Y} is the output matrix of shape (batch,m)(batch, m)
Rendering diagram...

Implementation#

TypeScript
class Matrix {
    data: number[][];
    rows: number;
    cols: number;

    constructor(data: number[][]) {
        this.data = data;
        this.rows = data.length;
        this.cols = data[0].length;
    }

    // Matrix multiplication
    static multiply(a: Matrix, b: Matrix): Matrix {
        if (a.cols !== b.rows) {
            throw new Error('Incompatible dimensions');
        }

        const result: number[][] = [];
        for (let i = 0; i < a.rows; i++) {
            result[i] = [];
            for (let j = 0; j < b.cols; j++) {
                let sum = 0;
                for (let k = 0; k < a.cols; k++) {
                    sum += a.data[i][k] * b.data[k][j];
                }
                result[i][j] = sum;
            }
        }
        return new Matrix(result);
    }

    // Element-wise addition
    static add(a: Matrix, b: Matrix | number[]): Matrix {
        const result: number[][] = [];
        for (let i = 0; i < a.rows; i++) {
            result[i] = [];
            for (let j = 0; j < a.cols; j++) {
                const bVal = Array.isArray(b) ? b[j] : b.data[i][j];
                result[i][j] = a.data[i][j] + bVal;
            }
        }
        return new Matrix(result);
    }

    // Apply function element-wise
    map(fn: (x: number) => number): Matrix {
        const result = this.data.map(row => row.map(fn));
        return new Matrix(result);
    }
}

Loss Functions#

Loss functions measure how wrong our predictions are. The goal of training is to minimize this loss.

Mean Squared Error (MSE)#

For regression tasks:

LMSE=1nโˆ‘i=1n(yiโˆ’y^i)2\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
TypeScript
function mse(predicted: number[], actual: number[]): number {
    const n = predicted.length;
    let sum = 0;
    for (let i = 0; i < n; i++) {
        sum += Math.pow(actual[i] - predicted[i], 2);
    }
    return sum / n;
}

function mseGradient(predicted: number[], actual: number[]): number[] {
    const n = predicted.length;
    return predicted.map((p, i) => (2 / n) * (p - actual[i]));
}

Cross-Entropy Loss#

For classification tasks:

LCE=โˆ’โˆ‘i=1Cyilogโก(y^i)\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Where yiy_i is 1 for the correct class and 0 otherwise (one-hot encoding).

TypeScript
function crossEntropy(predicted: number[], actual: number[]): number {
    let loss = 0;
    for (let i = 0; i < predicted.length; i++) {
        if (actual[i] === 1) {
            loss -= Math.log(predicted[i] + 1e-15); // Add epsilon for stability
        }
    }
    return loss;
}

Backpropagation#

Backpropagation computes gradients using the chain rule, allowing us to update weights efficiently.

The Chain Rule#

For a composite function f(g(x))f(g(x)):

dfdx=dfdgโ‹…dgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

Computing Gradients#

Consider a simple two-layer network:

Rendering diagram...

For each layer, we compute:

โˆ‚Lโˆ‚W=โˆ‚Lโˆ‚Yโ‹…โˆ‚Yโˆ‚Zโ‹…โˆ‚Zโˆ‚W\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \cdot \frac{\partial \mathbf{Y}}{\partial \mathbf{Z}} \cdot \frac{\partial \mathbf{Z}}{\partial \mathbf{W}}

Where Z=XW+b\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b} (pre-activation) and Y=f(Z)\mathbf{Y} = f(\mathbf{Z}) (post-activation).

Implementation#

TypeScript
interface LayerCache {
    input: Matrix;
    preActivation: Matrix;
    output: Matrix;
}

class DenseLayer {
    weights: Matrix;
    bias: number[];
    activation: (x: number) => number;
    activationDerivative: (x: number) => number;

    // Gradient storage
    weightsGrad: Matrix | null = null;
    biasGrad: number[] | null = null;

    forward(input: Matrix, cache: LayerCache): Matrix {
        cache.input = input;

        // Z = XW + b
        const z = Matrix.add(Matrix.multiply(input, this.weights), this.bias);
        cache.preActivation = z;

        // Y = f(Z)
        const output = z.map(this.activation);
        cache.output = output;

        return output;
    }

    backward(gradOutput: Matrix, cache: LayerCache): Matrix {
        // Gradient through activation: dZ = dY * f'(Z)
        const activationGrad = cache.preActivation.map(this.activationDerivative);
        const dZ = new Matrix(
            gradOutput.data.map((row, i) =>
                row.map((val, j) => val * activationGrad.data[i][j])
            )
        );

        // Gradient for weights: dW = X^T @ dZ
        const inputT = this.transpose(cache.input);
        this.weightsGrad = Matrix.multiply(inputT, dZ);

        // Gradient for bias: db = sum(dZ, axis=0)
        this.biasGrad = this.sumColumns(dZ);

        // Gradient for input: dX = dZ @ W^T
        const weightsT = this.transpose(this.weights);
        return Matrix.multiply(dZ, weightsT);
    }

    private transpose(m: Matrix): Matrix {
        const result: number[][] = [];
        for (let j = 0; j < m.cols; j++) {
            result[j] = [];
            for (let i = 0; i < m.rows; i++) {
                result[j][i] = m.data[i][j];
            }
        }
        return new Matrix(result);
    }

    private sumColumns(m: Matrix): number[] {
        const result = new Array(m.cols).fill(0);
        for (let i = 0; i < m.rows; i++) {
            for (let j = 0; j < m.cols; j++) {
                result[j] += m.data[i][j];
            }
        }
        return result;
    }
}

Gradient Descent Optimization#

Once we have gradients, we update weights to minimize loss.

Vanilla Gradient Descent#

Wt+1=Wtโˆ’ฮทโˆ‚Lโˆ‚W\mathbf{W}_{t+1} = \mathbf{W}_t - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}}

Where ฮท\eta is the learning rate.

Adam Optimizer#

Adam combines momentum and adaptive learning rates:

mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)gtvt=ฮฒ2vtโˆ’1+(1โˆ’ฮฒ2)gt2m^t=mt1โˆ’ฮฒ1tv^t=vt1โˆ’ฮฒ2tWt+1=Wtโˆ’ฮทv^t+ฯตm^t\begin{align} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1-\beta_2^t} \\ W_{t+1} &= W_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align}
TypeScript
class AdamOptimizer {
    learningRate: number;
    beta1: number;
    beta2: number;
    epsilon: number;
    t: number = 0;

    // Moment estimates for each parameter
    m: Map<string, number[][]> = new Map();
    v: Map<string, number[][]> = new Map();

    constructor(
        learningRate = 0.001,
        beta1 = 0.9,
        beta2 = 0.999,
        epsilon = 1e-8
    ) {
        this.learningRate = learningRate;
        this.beta1 = beta1;
        this.beta2 = beta2;
        this.epsilon = epsilon;
    }

    step(paramName: string, weights: Matrix, gradients: Matrix): void {
        this.t += 1;

        // Initialize moments if first time
        if (!this.m.has(paramName)) {
            this.m.set(paramName,
                Array(weights.rows).fill(null).map(() =>
                    Array(weights.cols).fill(0)
                )
            );
            this.v.set(paramName,
                Array(weights.rows).fill(null).map(() =>
                    Array(weights.cols).fill(0)
                )
            );
        }

        const m = this.m.get(paramName)!;
        const v = this.v.get(paramName)!;

        for (let i = 0; i < weights.rows; i++) {
            for (let j = 0; j < weights.cols; j++) {
                const g = gradients.data[i][j];

                // Update biased first moment estimate
                m[i][j] = this.beta1 * m[i][j] + (1 - this.beta1) * g;

                // Update biased second raw moment estimate
                v[i][j] = this.beta2 * v[i][j] + (1 - this.beta2) * g * g;

                // Compute bias-corrected estimates
                const mHat = m[i][j] / (1 - Math.pow(this.beta1, this.t));
                const vHat = v[i][j] / (1 - Math.pow(this.beta2, this.t));

                // Update parameters
                weights.data[i][j] -= this.learningRate * mHat / (Math.sqrt(vHat) + this.epsilon);
            }
        }
    }
}

Putting It All Together#

Here's a complete forward and backward pass:

Rendering diagram...

Further Reading#

Understanding the math empowers you to debug models, design new architectures, and push the boundaries of what's possible with neural networks.

Share:

Related Articles