- 1. The Familiar Training Step
- 2. Why Call
loss.backward()At All - 3. Why An Update Actually Helps
- 4. Values Remember Where They Came From
- 5. What One Operation Does During Backward
- 6. How
loss.backward()Walks the Whole Graph - 7. So We Have All the Gradients in the Graph. What’s Next?
- 8. When Tensors Become Models
- Appendix: Where To Go Next
Last year, I built a deep learning library from scratch to better understand the inner workings of machine learning. Over time, I realized the value of the project was not just the library itself, but the build process and the lessons that came from it. In an earlier post, I wrote about what I learned from building the library. This post looks at a different angle: the mechanics inside a deep learning library that makes learning from data possible. I’ll use my library as a concrete bridge to help build that intuition, especially for readers who haven’t built one before.
In standard machine learning training, we’ve seen many training steps look something like this.
optimizer.zero_grad()
prediction = model(x)
loss = loss_function(prediction, true_label)
loss.backward()
optimizer.step()
The core question is: what data do we need to store and compute so that loss.backward() can produce gradients, and an optimizer can use those gradients to update parameters that reduces the loss overtime (a.k.a learning)?
Let’s go step by step.
1. The Familiar Training Step
Our from-scratch library keeps the user-facing API close to PyTorch on purpose so the surface stays familiar and we can focus on the mechanics underneath. If you have used PyTorch or TensorFlow before, none of this should look strange. The point here is not the API surface. It’s what the library has to do under it. After you understand the core steps that enable the model to learn from data, the high-level APIs will naturally make more sense. Different frameworks wrap these steps differently, but the underlying loop is the same:
- a
model()call viaModule.__call__(...) - a scalar loss
loss.backward()call- an
optimizer.step()such asSGD.step(...)
The library’s basic value type is Tensor. For now, you can treat a tensor as “an array-like value that may also carry gradient-tracking metadata.” Section 4 will unpack that in detail.
2. Why Call loss.backward() At All
A loss is just a single number that says, in some way, how wrong the model currently is. In the tiny example below, error ** 2 turns “prediction versus target” into one scalar error.
We usually call backward on a scalar loss because that gives backpropagation one clear starting point: the gradient of the loss with respect to itself is 1. That scalar tells you how the model is doing, but it does not yet tell you how to improve w and b, the (trainable) weights of the model.
To update parameters, you need a more specific question answered:
If I nudge each trainable parameter a little, how does the loss change?
That quantity is the gradient.
It helps to separate the forward and backward passes:
- in the forward pass (e.g.
model()ormodel.forward()or even justa + b), you run the model on some inputs to producepred, then keep going until that prediction is turned into a scalarloss. And yes, a simple math operation likea + bis considered a “forward pass” by the deep learning library. - in the backward pass, you start at
lossand walk backward through the same chain, asking how a small change in each earlier trainable parameter would change that loss
More concretely, in the running example above, the forward values are prod = 6, pred = 7, error = -1, and loss = 1. In that forward pass (i.e. x * w + b), the model predicts 7 while the target (i.e. y_true) is 8, so the squared error is 1.
What does requires_grad mean?
Not every tensor/parameter is trainable. w and b are trainable because they are the quantities we want to update. x and y_true are non-trainable in this example, because they are givens of the current training example, not knobs the optimizer should tune. Even though they still affect the loss.
For trainable leaf tensors, part of that bookkeeping is a .grad field. After loss.backward(), .grad stores the gradient of the loss with respect to that tensor. The optimizer later reads those stored gradients on trainable leaves like w and b. As a teaching choice, this library also keeps .grad on intermediate tensors with requires_grad=True, so pred.grad is available here even though PyTorch would usually require retain_grad() for a non-leaf tensor. In the running example above, loss.backward() gives pred.grad = -2, w.grad = -4, and b.grad = -2.
So trainable (requires_grad=True) versus non-trainable (requires_grad=False) is not just a labeling detail. requires_grad controls gradient bookkeeping, while optimizers usually update the parameter set you hand them. In many frameworks, frozen parameters are practically protected because autograd leaves .grad=None, and optimizers skip parameters whose gradient is None.
So loss.backward() gives us stored gradients on trainable tensors like w and b. Before we look at how the library computes those gradients, it helps to see why they are useful at all.
3. Why An Update Actually Helps
So far, loss.backward() may still feel like bookkeeping, but more importantly, a gradient is a direction signal.
In the tiny example above, the model predicted 7 while the target was 8, so the loss was 1. After loss.backward(), the running example gives w.grad = -4 and b.grad = -2.
Re-run the scalar example to see how one gradient step changes prediction and loss
What do those numbers mean?
w.grad = -4means: ifwincreases a little, the loss goes downb.grad = -2means: ifbincreases a little, the loss also goes down
That is why gradient descent subtracts the gradient:
w = w - learning_rate * w.grad
b = b - learning_rate * b.grad
If we use learning_rate = 0.1, then:
w_new = 3.0 - 0.1 * (-4) = 3.4b_new = 1.0 - 0.1 * (-2) = 1.2
Now the prediction becomes:
pred_new = 2.0 * 3.4 + 1.2 # 8.0
So the loss drops from 1.0 to 0.0.
This tiny example is the mechanical core of gradient-based learning:
- the loss tells you how wrong the current model is
- the gradient tells you how each trainable parameter should move to reduce that error
- subtracting the gradient moves each parameter in the local direction that reduces the loss
That is why gradients are useful. The next question is how the library computes them in the first place.
4. Values Remember Where They Came From
Let’s revisit the central object Tensor, because every forward pass in the library is built out of tensors.
Here is a leaf tensor that was created directly rather than produced by another operation. That is enough to turn ordinary math into tracked math.
So how are Tensors created?
First, recall that a forward pass can be as simple as an arithmetic operation like a + b. Arithmetic on tensors does not mutate one tensor into another. Each operation produces a new tensor:
x = Tensor(2.0, requires_grad=False)
w = Tensor(3.0, requires_grad=True)
b = Tensor(1.0, requires_grad=True)
y_true = Tensor(8.0, requires_grad=False)
prod = x * w
pred = prod + b
error = pred - y_true
loss = error ** 2
print(prod is w) # False
print(pred is prod) # False
print(error is pred) # False
print(loss is pred) # False
So loss was not created directly with Tensor(...). It was created indirectly by calling tensor operations (e.g. the ** operator is equivalent to __pow__() in python), which each returned a fresh tensor. In the same way, error was produced from pred and y_true, pred was produced from prod and b, and prod was produced from x and w.
In the running training example from Section 1, that gives us a compressed dependency chain:
%%{init: {'flowchart': {'rankSpacing': 16, 'nodeSpacing': 10, 'diagramPadding': 2}}}%%
flowchart LR
x["x"] --> prod["prod =<br/>x * w"]
w["w"] --> prod
prod --> pred["pred =<br/>prod + b"]
b["b"] --> pred
pred --> error["error =<br/>pred - y_true"]
y_true["y_true"] --> error
error --> loss["loss =<br/>error ** 2"]
This dependency chain is the computational graph. The graph is built dynamically from the tensor operations you run in each forward pass (e.g. x * w). That means a tensor remembers where it came from.
From an implementation perspective, we intentionally added a creator attribute in the Tensor class. That way, loss does not need to know every ancestor directly. It only knows its own creator, and that creator stores the input tensors for that one operation.
Inspect creator links and the dependency tree for the same scalar example
Under the hood, the raw creator graph has a few extra helper nodes because subtraction is defined as “add-plus-negate” in __sub__(), and error ** 2 is defined as error * error. The printed tree above collapses unnamed helper nodes, but that literal expansion is still the first gradient accumulation case in this example: backward sends two contributions into error, and those contributions have to be added together.
Now the role of creator becomes more concrete:
- leaf tensors are usually inputs, parameters, or targets
- intermediate tensors are graph nodes produced by operations
- a tensor with no
creatoris a boundary where the graph stops
Therefore, a tensor is not only a number container here. It is also a node in a computational graph.
5. What One Operation Does During Backward
Section 4 explained the graph structure. Now ignore the whole graph for a moment and focus on one node.
At one node, the question is simple:
If I already know how the loss changes with this operation’s output, what gradients should this operation send to its inputs?
backward() for one operation only has one job: take an incoming gradient and turn it into gradients for the inputs.
During backward, every operation follows the same pattern:
- receive a gradient at its output
- apply its own local derivative
- return one gradient per input
In shorthand:
gradient to an input = incoming gradient * local derivative
Take the add node pred = prod + b from the running example.
Suppose backward() has already reached this node and told us:
And now we need to figure out the gradient of pred with respect to prod and b respectively. For the addition operator, each input affects the output one-for-one, so:
So this node sends back:
\[\frac{\partial \text{loss}}{\partial \text{prod}} = \frac{\partial \text{loss}}{\partial \text{pred}} \cdot \frac{\partial \text{pred}}{\partial \text{prod}} = -2 \cdot 1 = -2\] \[\frac{\partial \text{loss}}{\partial b} = \frac{\partial \text{loss}}{\partial \text{pred}} \cdot \frac{\partial \text{pred}}{\partial b} = -2 \cdot 1 = -2\]Now look at the multiply node prod = x * w.
If the incoming gradient is:
\[\frac{\partial \text{loss}}{\partial \text{prod}} = -2\]then the local derivatives are:
\[\frac{\partial \text{prod}}{\partial w} = x, \qquad \frac{\partial \text{prod}}{\partial x} = w\]With x = 2 and w = 3, the local rule gives:
Conceptually, that multiply node sends one contribution toward w and one toward x. In this example, only w’s gradient is stored because x.requires_grad=False.
Let’s think through what each backward operation actually needs. Add.backward(...) only needs the incoming gradient and the two inputs it added. Mul.backward(...) only needs the incoming gradient and the two inputs it multiplied. No single operation needs to understand the whole graph.
Once each node can do that locally, the only remaining question is how the library walks the whole graph in the right order.
6. How loss.backward() Walks the Whole Graph
Once each node knows its local rule, loss.backward() is mostly graph traversal.
In the running example, once loss has been computed, the library only has to answer three practical questions:
- where to start
- which earlier tensors to visit next, since we’re going backwards
- how to combine gradients when more than one path reaches the same tensor
From a implementation perspective, that makes loss.backward() roughly:
- seed the loss with gradient 1
- collect the reachable graph in dependency order (backward)
- at each tensor, call its creator’s backward rule
- accumulate the returned gradients into the input tensors
For the running example, the traversal below is a compressed view of the tensors that require gradients. In the literal graph, loss = error ** 2 expands to error * error, so error is the first place where two gradient contributions meet and must be accumulated.
%%{init: {'flowchart': {'rankSpacing': 18, 'nodeSpacing': 18, 'diagramPadding': 2}}}%%
flowchart TD
loss --> error --> pred
pred --> prod --> w
pred --> b
Because x and y_true were created with requires_grad=False, the walk stops before them.
So the global backward pass is just many local backward steps stitched together.
Trace backward visit order and stored gradients for the same scalar example
You can inspect the graph links directly:
prod = x * w
pred = prod + b
print(prod.creator.__class__.__name__) # Mul
print(pred.creator.__class__.__name__) # Add
So a tensor here is not just data. It is also a pointer to the operation that produced it, and that is enough for the library to walk backward from the final loss to the trainable leaves.
7. So We Have All the Gradients in the Graph. What’s Next?
Now that we have seen how the graph is built and walked backward, we can return to the training loop and separate three jobs that are easy to blur together in practice:
loss.backward()computes gradients- a concrete
optimizer.step()likeSGD.step(...)applies an update rule using those gradients optimizer.zero_grad()clears old gradients because gradients accumulate by default
loss.backward() does not change parameters. It walks the graph and writes gradients into the .grad fields of upstream tensors that have requires_grad=True.
The actual parameter update happens later in autograd/optim.py. An optimizer reads each parameter’s stored gradient and mutates the parameter data in place. Fancy optimizers like Adam.step(...) are just different ways to update the parameters based on the gradients and other metadata. For plain SGD.step(...), the idea is:
param.data -= learning_rate * param.grad.data
What matters here is the lifecycle of those stored gradients. They live on the parameters until something explicitly uses or clears them, so a new forward/backward pass will keep adding into the same .grad fields unless you reset them:
Gradient accumulation until zero_grad()
Only after gradients have been written into .grad does the optimizer consume them:
optimizer.step() # reads the stored gradients and updates parameter data
So loss.backward() computes gradients, optimizer.step() mutates parameters using the stored gradients, and optimizer.zero_grad() resets the gradient state so the next backward pass starts fresh.
8. When Tensors Become Models
So far we were discussing tensors one operation at a time. Real models use the same mechanics, just across many more parameters. How do you package those tensors into reusable layers and models to construct an actual deep learning model? One way to think about this is that “layers” and “models” organize parameters and computations into units that are easier to understand and reuse.
That answer lives in autograd/nn.py.
We have a base Module class that:
- collects trainable
Tensorobjects as parameters - collects child
Moduleobjects as submodules - keeps non-trainable arrays as
states - exposes a flattened
parametersview for optimizers
A freshly constructed nn.Linear(3, 1) registers two named parameters, bias and weight.
A small but important piece of framework magic lives in Module.__setattr__. Assigning a child Module registers it as a submodule automatically. Built-in layers like Linear currently register weight and bias by writing directly into self._parameters[...] inside Linear.__init__(...), while Module.__getattr__(...) only resolves submodules and states.
Later on, the optimizer can directly access the flattened parameters, and perform parameter updates (a.k.a. learning).
You can represent a neural network with tensors alone. In the limit, Module is not mathematically necessary. You could manually keep a dictionary of weights, biases, and states, then write the forward pass directly with tensor operations.
For example, this is already enough to describe a one-layer model:
Nothing is missing mathematically. The gradients still work. What is missing is mostly human convenience: naming, grouping, recursive composition, a clean parameter list, and a shared way to talk about larger pieces of a model. That gap becomes apparent when we scale up. More concretely, here is a two-layer model written directly with tensors:
from autograd.backend import xp
from autograd.tensor import Tensor
# same input and target setup
x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=False) # set True only if you want input gradients
y_true = Tensor([[1.0]], requires_grad=False)
# layer 1 parameters
w1 = Tensor(xp.random.normal(size=(3, 4)), requires_grad=True)
b1 = Tensor(xp.zeros(4), requires_grad=True)
# layer 2 parameters
w2 = Tensor(xp.random.normal(size=(4, 1)), requires_grad=True)
b2 = Tensor(xp.zeros(1), requires_grad=True)
# first affine transform, then the same nonlinearity as functional.relu(...): max(0, z)
hidden = (x @ w1 + b1).maximum(0)
# second affine transform
pred = hidden @ w2 + b2
# the same mean-squared loss as before
loss = ((pred - y_true) ** 2).mean()
loss.backward()
This works fine, but it starts to become a little mentally challenging.
Now here is the same computation packaged with Module:
from autograd import functional, nn
from autograd.tensor import Tensor
# same input and target setup
x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=False) # set True only if you want input gradients
y_true = Tensor([[1.0]], requires_grad=False)
class TinyMLP(nn.Module):
def __init__(self):
super().__init__()
# equivalent to w1 and b1
self.layer1 = nn.Linear(3, 4)
# equivalent to w2 and b2
self.layer2 = nn.Linear(4, 1)
def forward(self, x):
# same as: hidden = (x @ w1 + b1).maximum(0)
hidden = functional.relu(self.layer1(x))
# same as: pred = hidden @ w2 + b2
return self.layer2(hidden)
model = TinyMLP()
pred = model(x)
# same mean-squared loss as the tensor-only version above
loss = ((pred - y_true) ** 2).mean()
loss.backward()
That is the real motivation for Module: it does not introduce new math. It gives you a cleaner way to organize the same tensor math once the tensor-only version starts getting noisy. In practice, it mostly solves an organization problem: small modules own tensors, larger modules own those modules, and the gradient logic still lives below that layer in tensor operations.
A Linear layer, for example, is just stored parameters plus a forward(…) written in tensor math. Since those tensor operations already know how to compute gradients, a module usually does not need its own backward(). Modules built out of primitive tensor operations like addition, multiplication, matrix multiplication, reductions like sum or mean, and a few element-wise functions.
That is the core trick. A deep learning library does not need one giant “learning” primitive. It needs local derivative rules at each operation, a way to traverse the graph in reverse, and gradient accumulation when multiple paths meet. loss.backward() handles that global reverse walk; the optimizer turns the accumulated gradients into parameter updates. Once that mechanism is clear, most of the rest of the framework is structure and convenience.
Appendix: Where To Go Next
The main argument ends above. What follows is a short resource section if you want to keep tracing these ideas through shapes, broadcasting, or want to explore the library in detail.
Shapes And Broadcasting
Shapes are the next thing to learn because machine learning almost always involves batch dimensions and higher-dimensional tensors/features.
A shape tells you how values are arranged. (2, 3) means two rows and three columns. (3,) means a length-3 vector.
If we represent it in code:
Here, bias starts as a vector of shape (3,), while x is a matrix of shape (2, 3). In the forward pass, broadcasting lets that one bias vector act as if it were reused across both rows.
The important part is what happens in backward. The parameter that actually exists is still bias, and its shape is still (3,). So the library must return a gradient for that original vector’s shape (3,), not for the larger (2, 3) broadcasted view used during forward. In this library, that “reduce back to the original shape” step is handled by Function.unbroadcast(...).
Because loss = out.sum(), each element of out contributes a gradient of 1.0. Each bias value was used once in the first row and once in the second row, so backward adds those contributions together:
[1.0 + 1.0, 1.0 + 1.0, 1.0 + 1.0] = [2.0, 2.0, 2.0]
A useful rule to keep in mind is that if forward expanded a tensor, backward has to reduce the gradient. That is why shapes matter so much in real model code. They are part of what makes a backward pass correct, not just bookkeeping.
Diving Further into the Code
You don’t need to read the whole library from top to bottom. A short reading path is:
tensor -> nn / functional -> optim -> trainer / examples
or
trainer / examples -> particular tensor operations / optim / nn / functional
If you want to trace these ideas back into the code, these are good places to start:
- For #4: Values Remember Where They Came From, #5: What One Operation Does During Backward, and #6: How
loss.backward()Walks the Whole Graph, start inautograd/tensor.pywithFunction.apply(...)andTensor.backward(...). That is where each operation becomes a node in the graph. - For #8: When Tensors Become Models, if
Modulestarted to feel more concrete, go toautograd/nn.pyand readLinear.forward(...). It is one of the clearest examples of a layer being just tensor math plus stored parameters. - For #5: What One Operation Does During Backward, start in
autograd/functional.pywithrelu(...), then move tocross_entropy(...). That gives you a path from a simple pointwise operation to a more realistic loss function. - For #7: So We Have All the Gradients in the Graph. What’s Next?, start in
autograd/optim.pywithOptimizer.zero_grad(...)andSGD.step(...). That is the smallest example of taking stored gradients and using them to mutate parameters. - For #1: The Familiar Training Step, start in
autograd/tools/trainer.pywithSimpleTrainer, then compare it withexamples/mnist.py. A good first model isMnistMultiClassClassifier, because it shows the familiarLinear -> functional.relu(...) -> Linear -> functional.relu(...) -> Linearstack without too much extra framework noise.