Last year, I built a deep learning library from scratch to better understand the inner workings of machine learning. Over time, I realized the value of the project was not just the library itself, but the build process and the lessons that came from it. In an earlier post, I wrote about what I learned from building the library. This post looks at a different angle: the mechanics inside a deep learning library that makes learning from data possible. I’ll use my library as a concrete bridge to help build that intuition, especially for readers who haven’t built one before.

In standard machine learning training, we’ve seen many training steps look something like this.

optimizer.zero_grad()
prediction = model(x)
loss = loss_function(prediction, true_label)
loss.backward()
optimizer.step()

The core question is: what data do we need to store and compute so that loss.backward() can produce gradients, and an optimizer can use those gradients to update parameters that reduces the loss overtime (a.k.a learning)?

Let’s go step by step.

1. The Familiar Training Step

Our from-scratch library keeps the user-facing API close to PyTorch on purpose so the surface stays familiar and we can focus on the mechanics underneath. If you have used PyTorch or TensorFlow before, none of this should look strange. The point here is not the API surface. It’s what the library has to do under it. After you understand the core steps that enable the model to learn from data, the high-level APIs will naturally make more sense. Different frameworks wrap these steps differently, but the underlying loop is the same:

The library’s basic value type is Tensor. For now, you can treat a tensor as “an array-like value that may also carry gradient-tracking metadata.” Section 4 will unpack that in detail.

2. Why Call loss.backward() At All

A loss is just a single number that says, in some way, how wrong the model currently is. In the tiny example below, error ** 2 turns “prediction versus target” into one scalar error.

from autograd.tensor import Tensor from autograd.playground_helpers import grad_value x = Tensor(2.0, requires_grad=False) w = Tensor(3.0, requires_grad=True) b = Tensor(1.0, requires_grad=True) y_true = Tensor(8.0, requires_grad=False) # Try changing y_true or one of the requires_grad flags prod = x * w pred = prod + b error = pred - y_true loss = error ** 2 print("Forward:") print(f" prod = {prod.data}") print(f" pred = {pred.data}") print(f" error = {error.data}") print(f" loss = {loss.data}") print("\nBackward:") try: loss.backward() except Exception as error: print(f" backward failed: {error}") else: print(f" pred.grad = {grad_value(pred)}") print(f" w.grad = {grad_value(w)}") print(f" b.grad = {grad_value(b)}")

We usually call backward on a scalar loss because that gives backpropagation one clear starting point: the gradient of the loss with respect to itself is 1. That scalar tells you how the model is doing, but it does not yet tell you how to improve w and b, the (trainable) weights of the model.

To update parameters, you need a more specific question answered:

If I nudge each trainable parameter a little, how does the loss change?

That quantity is the gradient.

It helps to separate the forward and backward passes:

  • in the forward pass (e.g. model() or model.forward() or even just a + b), you run the model on some inputs to produce pred, then keep going until that prediction is turned into a scalar loss. And yes, a simple math operation like a + b is considered a “forward pass” by the deep learning library.
  • in the backward pass, you start at loss and walk backward through the same chain, asking how a small change in each earlier trainable parameter would change that loss

More concretely, in the running example above, the forward values are prod = 6, pred = 7, error = -1, and loss = 1. In that forward pass (i.e. x * w + b), the model predicts 7 while the target (i.e. y_true) is 8, so the squared error is 1.

What does requires_grad mean?

Not every tensor/parameter is trainable. w and b are trainable because they are the quantities we want to update. x and y_true are non-trainable in this example, because they are givens of the current training example, not knobs the optimizer should tune. Even though they still affect the loss.

For trainable leaf tensors, part of that bookkeeping is a .grad field. After loss.backward(), .grad stores the gradient of the loss with respect to that tensor. The optimizer later reads those stored gradients on trainable leaves like w and b. As a teaching choice, this library also keeps .grad on intermediate tensors with requires_grad=True, so pred.grad is available here even though PyTorch would usually require retain_grad() for a non-leaf tensor. In the running example above, loss.backward() gives pred.grad = -2, w.grad = -4, and b.grad = -2.

So trainable (requires_grad=True) versus non-trainable (requires_grad=False) is not just a labeling detail. requires_grad controls gradient bookkeeping, while optimizers usually update the parameter set you hand them. In many frameworks, frozen parameters are practically protected because autograd leaves .grad=None, and optimizers skip parameters whose gradient is None.

So loss.backward() gives us stored gradients on trainable tensors like w and b. Before we look at how the library computes those gradients, it helps to see why they are useful at all.

3. Why An Update Actually Helps

So far, loss.backward() may still feel like bookkeeping, but more importantly, a gradient is a direction signal.

In the tiny example above, the model predicted 7 while the target was 8, so the loss was 1. After loss.backward(), the running example gives w.grad = -4 and b.grad = -2.

Re-run the scalar example to see how one gradient step changes prediction and loss from autograd.playground_helpers import print_running_example_update learning_rate = 0.1 y_true_value = 8.0 # Try changing learning_rate or y_true_value to see how the update responds print_running_example_update(learning_rate=learning_rate, y_true_value=y_true_value)

What do those numbers mean?

  • w.grad = -4 means: if w increases a little, the loss goes down
  • b.grad = -2 means: if b increases a little, the loss also goes down

That is why gradient descent subtracts the gradient:

w = w - learning_rate * w.grad
b = b - learning_rate * b.grad

If we use learning_rate = 0.1, then:

  • w_new = 3.0 - 0.1 * (-4) = 3.4
  • b_new = 1.0 - 0.1 * (-2) = 1.2

Now the prediction becomes:

pred_new = 2.0 * 3.4 + 1.2  # 8.0

So the loss drops from 1.0 to 0.0.

This tiny example is the mechanical core of gradient-based learning:

  • the loss tells you how wrong the current model is
  • the gradient tells you how each trainable parameter should move to reduce that error
  • subtracting the gradient moves each parameter in the local direction that reduces the loss

That is why gradients are useful. The next question is how the library computes them in the first place.

4. Values Remember Where They Came From

Let’s revisit the central object Tensor, because every forward pass in the library is built out of tensors.

Here is a leaf tensor that was created directly rather than produced by another operation. That is enough to turn ordinary math into tracked math. from autograd.tensor import Tensor w = Tensor(3.0, requires_grad=True) print(w.data) # 3.0 print(w.data.shape) # () print(w.requires_grad) # True print(w.creator) # None: leaf tensor print(w.grad) # None until backward runs

So how are Tensors created?

First, recall that a forward pass can be as simple as an arithmetic operation like a + b. Arithmetic on tensors does not mutate one tensor into another. Each operation produces a new tensor:

x = Tensor(2.0, requires_grad=False)
w = Tensor(3.0, requires_grad=True)
b = Tensor(1.0, requires_grad=True)
y_true = Tensor(8.0, requires_grad=False)

prod = x * w
pred = prod + b
error = pred - y_true
loss = error ** 2

print(prod is w)   # False
print(pred is prod)  # False
print(error is pred) # False
print(loss is pred)  # False

So loss was not created directly with Tensor(...). It was created indirectly by calling tensor operations (e.g. the ** operator is equivalent to __pow__() in python), which each returned a fresh tensor. In the same way, error was produced from pred and y_true, pred was produced from prod and b, and prod was produced from x and w.

In the running training example from Section 1, that gives us a compressed dependency chain:

%%{init: {'flowchart': {'rankSpacing': 16, 'nodeSpacing': 10, 'diagramPadding': 2}}}%%
flowchart LR
    x["x"] --> prod["prod =<br/>x * w"]
    w["w"] --> prod
    prod --> pred["pred =<br/>prod + b"]
    b["b"] --> pred
    pred --> error["error =<br/>pred - y_true"]
    y_true["y_true"] --> error
    error --> loss["loss =<br/>error ** 2"]

This dependency chain is the computational graph. The graph is built dynamically from the tensor operations you run in each forward pass (e.g. x * w). That means a tensor remembers where it came from.

From an implementation perspective, we intentionally added a creator attribute in the Tensor class. That way, loss does not need to know every ancestor directly. It only knows its own creator, and that creator stores the input tensors for that one operation.

Inspect creator links and the dependency tree for the same scalar example from autograd.playground_helpers import build_running_example, print_graph_structure example = build_running_example() print_graph_structure(example["loss"], example["names"])

Under the hood, the raw creator graph has a few extra helper nodes because subtraction is defined as “add-plus-negate” in __sub__(), and error ** 2 is defined as error * error. The printed tree above collapses unnamed helper nodes, but that literal expansion is still the first gradient accumulation case in this example: backward sends two contributions into error, and those contributions have to be added together.

Now the role of creator becomes more concrete:

  • leaf tensors are usually inputs, parameters, or targets
  • intermediate tensors are graph nodes produced by operations
  • a tensor with no creator is a boundary where the graph stops

Therefore, a tensor is not only a number container here. It is also a node in a computational graph.

5. What One Operation Does During Backward

Section 4 explained the graph structure. Now ignore the whole graph for a moment and focus on one node.

At one node, the question is simple:

If I already know how the loss changes with this operation’s output, what gradients should this operation send to its inputs?

backward() for one operation only has one job: take an incoming gradient and turn it into gradients for the inputs.

During backward, every operation follows the same pattern:

  1. receive a gradient at its output
  2. apply its own local derivative
  3. return one gradient per input

In shorthand:

gradient to an input = incoming gradient * local derivative

Take the add node pred = prod + b from the running example.

Suppose backward() has already reached this node and told us:

\[\frac{\partial \text{loss}}{\partial \text{pred}} = -2\]

And now we need to figure out the gradient of pred with respect to prod and b respectively. For the addition operator, each input affects the output one-for-one, so:

\[\frac{\partial \text{pred}}{\partial \text{prod}} = 1, \qquad \frac{\partial \text{pred}}{\partial b} = 1\]

So this node sends back:

\[\frac{\partial \text{loss}}{\partial \text{prod}} = \frac{\partial \text{loss}}{\partial \text{pred}} \cdot \frac{\partial \text{pred}}{\partial \text{prod}} = -2 \cdot 1 = -2\] \[\frac{\partial \text{loss}}{\partial b} = \frac{\partial \text{loss}}{\partial \text{pred}} \cdot \frac{\partial \text{pred}}{\partial b} = -2 \cdot 1 = -2\]

Now look at the multiply node prod = x * w.

If the incoming gradient is:

\[\frac{\partial \text{loss}}{\partial \text{prod}} = -2\]

then the local derivatives are:

\[\frac{\partial \text{prod}}{\partial w} = x, \qquad \frac{\partial \text{prod}}{\partial x} = w\]

With x = 2 and w = 3, the local rule gives:

\[\frac{\partial \text{loss}}{\partial w} = \frac{\partial \text{loss}}{\partial \text{prod}} \cdot \frac{\partial \text{prod}}{\partial w} = -2 \cdot 2 = -4\] \[\frac{\partial \text{loss}}{\partial x} = \frac{\partial \text{loss}}{\partial \text{prod}} \cdot \frac{\partial \text{prod}}{\partial x} = -2 \cdot 3 = -6\]

Conceptually, that multiply node sends one contribution toward w and one toward x. In this example, only w’s gradient is stored because x.requires_grad=False.

Let’s think through what each backward operation actually needs. Add.backward(...) only needs the incoming gradient and the two inputs it added. Mul.backward(...) only needs the incoming gradient and the two inputs it multiplied. No single operation needs to understand the whole graph.

Once each node can do that locally, the only remaining question is how the library walks the whole graph in the right order.

6. How loss.backward() Walks the Whole Graph

Once each node knows its local rule, loss.backward() is mostly graph traversal.

In the running example, once loss has been computed, the library only has to answer three practical questions:

  1. where to start
  2. which earlier tensors to visit next, since we’re going backwards
  3. how to combine gradients when more than one path reaches the same tensor

From a implementation perspective, that makes loss.backward() roughly:

  1. seed the loss with gradient 1
  2. collect the reachable graph in dependency order (backward)
  3. at each tensor, call its creator’s backward rule
  4. accumulate the returned gradients into the input tensors

For the running example, the traversal below is a compressed view of the tensors that require gradients. In the literal graph, loss = error ** 2 expands to error * error, so error is the first place where two gradient contributions meet and must be accumulated.

%%{init: {'flowchart': {'rankSpacing': 18, 'nodeSpacing': 18, 'diagramPadding': 2}}}%%
flowchart TD
    loss --> error --> pred
    pred --> prod --> w
    pred --> b

Because x and y_true were created with requires_grad=False, the walk stops before them.

So the global backward pass is just many local backward steps stitched together.

Trace backward visit order and stored gradients for the same scalar example from autograd.playground_helpers import build_running_example, print_backward_walk example = build_running_example() print_backward_walk(example["loss"], example["names"], show_grads=("x", "w", "b", "pred", "loss"))

You can inspect the graph links directly:

prod = x * w
pred = prod + b

print(prod.creator.__class__.__name__)  # Mul
print(pred.creator.__class__.__name__)  # Add

So a tensor here is not just data. It is also a pointer to the operation that produced it, and that is enough for the library to walk backward from the final loss to the trainable leaves.

7. So We Have All the Gradients in the Graph. What’s Next?

Now that we have seen how the graph is built and walked backward, we can return to the training loop and separate three jobs that are easy to blur together in practice:

  • loss.backward() computes gradients
  • a concrete optimizer.step() like SGD.step(...) applies an update rule using those gradients
  • optimizer.zero_grad() clears old gradients because gradients accumulate by default

loss.backward() does not change parameters. It walks the graph and writes gradients into the .grad fields of upstream tensors that have requires_grad=True.

The actual parameter update happens later in autograd/optim.py. An optimizer reads each parameter’s stored gradient and mutates the parameter data in place. Fancy optimizers like Adam.step(...) are just different ways to update the parameters based on the gradients and other metadata. For plain SGD.step(...), the idea is:

param.data -= learning_rate * param.grad.data

What matters here is the lifecycle of those stored gradients. They live on the parameters until something explicitly uses or clears them, so a new forward/backward pass will keep adding into the same .grad fields unless you reset them:

Gradient accumulation until zero_grad() from autograd.tensor import Tensor from autograd.optim import SGD x = Tensor(2.0, requires_grad=False) w = Tensor(3.0, requires_grad=True) b = Tensor(1.0, requires_grad=True) y_true = Tensor(8.0, requires_grad=False) optimizer = SGD({"w": w, "b": b}, lr=0.1) loss = ((x * w + b) - y_true) ** 2 loss.backward() print(w.grad.data, b.grad.data) # first set of gradients loss = ((x * w + b) - y_true) ** 2 loss.backward() print(w.grad.data, b.grad.data) # gradients have accumulated optimizer.zero_grad() print(w.grad, b.grad) # cleared

Only after gradients have been written into .grad does the optimizer consume them:

optimizer.step()  # reads the stored gradients and updates parameter data

So loss.backward() computes gradients, optimizer.step() mutates parameters using the stored gradients, and optimizer.zero_grad() resets the gradient state so the next backward pass starts fresh.

8. When Tensors Become Models

So far we were discussing tensors one operation at a time. Real models use the same mechanics, just across many more parameters. How do you package those tensors into reusable layers and models to construct an actual deep learning model? One way to think about this is that “layers” and “models” organize parameters and computations into units that are easier to understand and reuse.

That answer lives in autograd/nn.py.

We have a base Module class that:

  • collects trainable Tensor objects as parameters
  • collects child Module objects as submodules
  • keeps non-trainable arrays as states
  • exposes a flattened parameters view for optimizers

A freshly constructed nn.Linear(3, 1) registers two named parameters, bias and weight.

A small but important piece of framework magic lives in Module.__setattr__. Assigning a child Module registers it as a submodule automatically. Built-in layers like Linear currently register weight and bias by writing directly into self._parameters[...] inside Linear.__init__(...), while Module.__getattr__(...) only resolves submodules and states.

Later on, the optimizer can directly access the flattened parameters, and perform parameter updates (a.k.a. learning).

You can represent a neural network with tensors alone. In the limit, Module is not mathematically necessary. You could manually keep a dictionary of weights, biases, and states, then write the forward pass directly with tensor operations.

For example, this is already enough to describe a one-layer model: from autograd.tensor import Tensor x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=False) # data tensors usually don't need gradients y_true = Tensor([[1.0]], requires_grad=False) w = Tensor([[0.2], [0.1], [-0.3]], requires_grad=True) b = Tensor([0.0], requires_grad=True) pred = x @ w + b loss = ((pred - y_true) ** 2).mean() loss.backward() print(f"pred = {pred.data.tolist()}") # [[-0.8]] print(f"loss = {loss.data}") # 3.24 print(f"w.grad = {w.grad.data.tolist()}") print(f"b.grad = {b.grad.data.tolist()}")

Nothing is missing mathematically. The gradients still work. What is missing is mostly human convenience: naming, grouping, recursive composition, a clean parameter list, and a shared way to talk about larger pieces of a model. That gap becomes apparent when we scale up. More concretely, here is a two-layer model written directly with tensors:

from autograd.backend import xp
from autograd.tensor import Tensor

# same input and target setup
x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=False)  # set True only if you want input gradients
y_true = Tensor([[1.0]], requires_grad=False)

# layer 1 parameters
w1 = Tensor(xp.random.normal(size=(3, 4)), requires_grad=True)
b1 = Tensor(xp.zeros(4), requires_grad=True)
# layer 2 parameters
w2 = Tensor(xp.random.normal(size=(4, 1)), requires_grad=True)
b2 = Tensor(xp.zeros(1), requires_grad=True)

# first affine transform, then the same nonlinearity as functional.relu(...): max(0, z)
hidden = (x @ w1 + b1).maximum(0)
# second affine transform
pred = hidden @ w2 + b2
# the same mean-squared loss as before
loss = ((pred - y_true) ** 2).mean()
loss.backward()

This works fine, but it starts to become a little mentally challenging.

Now here is the same computation packaged with Module:

from autograd import functional, nn
from autograd.tensor import Tensor

# same input and target setup
x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=False)  # set True only if you want input gradients
y_true = Tensor([[1.0]], requires_grad=False)

class TinyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # equivalent to w1 and b1
        self.layer1 = nn.Linear(3, 4)
        # equivalent to w2 and b2
        self.layer2 = nn.Linear(4, 1)

    def forward(self, x):
        # same as: hidden = (x @ w1 + b1).maximum(0)
        hidden = functional.relu(self.layer1(x))
        # same as: pred = hidden @ w2 + b2
        return self.layer2(hidden)


model = TinyMLP()
pred = model(x)
# same mean-squared loss as the tensor-only version above
loss = ((pred - y_true) ** 2).mean()
loss.backward()

That is the real motivation for Module: it does not introduce new math. It gives you a cleaner way to organize the same tensor math once the tensor-only version starts getting noisy. In practice, it mostly solves an organization problem: small modules own tensors, larger modules own those modules, and the gradient logic still lives below that layer in tensor operations.

A Linear layer, for example, is just stored parameters plus a forward(…) written in tensor math. Since those tensor operations already know how to compute gradients, a module usually does not need its own backward(). Modules built out of primitive tensor operations like addition, multiplication, matrix multiplication, reductions like sum or mean, and a few element-wise functions.

That is the core trick. A deep learning library does not need one giant “learning” primitive. It needs local derivative rules at each operation, a way to traverse the graph in reverse, and gradient accumulation when multiple paths meet. loss.backward() handles that global reverse walk; the optimizer turns the accumulated gradients into parameter updates. Once that mechanism is clear, most of the rest of the framework is structure and convenience.

Appendix: Where To Go Next

The main argument ends above. What follows is a short resource section if you want to keep tracing these ideas through shapes, broadcasting, or want to explore the library in detail.

Shapes And Broadcasting

Shapes are the next thing to learn because machine learning almost always involves batch dimensions and higher-dimensional tensors/features.

A shape tells you how values are arranged. (2, 3) means two rows and three columns. (3,) means a length-3 vector.

If we represent it in code: from autograd.tensor import Tensor x = Tensor( [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], requires_grad=False, ) bias = Tensor( [0.1, 0.2, 0.3], requires_grad=True, ) # x.shape == (2, 3) # bias.shape == (3,) # # In x + bias, bias is broadcast across the two rows: # [[0.1, 0.2, 0.3], # [0.1, 0.2, 0.3]] # # So out becomes: # [[1.1, 2.2, 3.3], # [4.1, 5.2, 6.3]] out = x + bias loss = out.sum() loss.backward() print(f"x.shape = {x.data.shape}") print(f"bias.shape = {bias.data.shape}") print(f"out.shape = {out.data.shape}") print(out.data) print(bias.grad.data) # [2.0, 2.0, 2.0]

Here, bias starts as a vector of shape (3,), while x is a matrix of shape (2, 3). In the forward pass, broadcasting lets that one bias vector act as if it were reused across both rows.

The important part is what happens in backward. The parameter that actually exists is still bias, and its shape is still (3,). So the library must return a gradient for that original vector’s shape (3,), not for the larger (2, 3) broadcasted view used during forward. In this library, that “reduce back to the original shape” step is handled by Function.unbroadcast(...).

Because loss = out.sum(), each element of out contributes a gradient of 1.0. Each bias value was used once in the first row and once in the second row, so backward adds those contributions together:

[1.0 + 1.0, 1.0 + 1.0, 1.0 + 1.0] = [2.0, 2.0, 2.0]

A useful rule to keep in mind is that if forward expanded a tensor, backward has to reduce the gradient. That is why shapes matter so much in real model code. They are part of what makes a backward pass correct, not just bookkeeping.

Diving Further into the Code

You don’t need to read the whole library from top to bottom. A short reading path is:

tensor -> nn / functional -> optim -> trainer / examples

or

trainer / examples -> particular tensor operations / optim / nn / functional

If you want to trace these ideas back into the code, these are good places to start: