What Is Deep Learning?
π Part of Series: intro-to-deep-learning
Context
I've been working through Andrej Karpathy's micrograd β a tiny autograd engine in ~100 lines of Python. It's one of the cleanest explanations of backpropagation I've found: no framework abstractions, just the raw math wired up in code.
To make sure I actually understand it (and not just pattern-matching off Karpathy's code), I'm reimplementing everything in Midori. If I can port it without looking at the Python, I probably get it. These posts are my notes from that process.
The Core Insight
Deep learning, at its core, decomposes into two ideas:
- Function composition β chain simple ops (add, multiply, tanh) into a computation graph
- The chain rule β use calculus to figure out how to tweak inputs to improve the output
The engine that automates step 2 is called autograd (automatic differentiation). That's what micrograd implements, and what I'm rebuilding here.
Computation Graph
Any expression can be drawn as a DAG (directed acyclic graph). For example, :
a ----\
(*) ----\
b ----/ (+) ---> tanh ---> e
/
c -----------/
Each node stores three things:
- data β the computed value (forward pass)
- grad β (filled during backward pass)
- op β how this node was produced, so we can derive the local gradient
This is the standard tape-based autodiff representation. micrograd uses a Value class with _backward closures; I'll use a Graph array with Op tags instead (no closures needed).
Chain Rule Refresher
Given composed functions , , :
Concrete example: . Decompose as , :
At : , so .
Nothing new if you've taken multivariable calculus, but the key realization is: autograd does exactly this, automatically, for arbitrarily complex graphs. It records and as nodes, stores how they were produced, and walks the graph backward applying the chain rule at each step.
Forward Mode vs Reverse Mode
There are two ways to propagate derivatives through a computation graph:
Forward mode starts at an input and pushes derivatives forward through each operation. Given , it computes , then , then . One forward pass gives you the gradient w.r.t. one input. If you have parameters, you need passes.
Reverse mode starts at the output and pulls derivatives backward. It computes , then , then . One backward pass gives you the gradient w.r.t. every input.
In training, we want where β one scalar loss, many parameters. Reverse mode gets all gradients in a single pass. Forward mode would need passes. This is why backpropagation is reverse-mode autodiff.
Sanity Check
and its derivative, computed manually:
Set x = 2.0 β expect u = 7, f = 49, df/dx = 28.
This explicit calculation is trivial, but the goal of our autograd engine is to let us write the forward pass (
f = (2.0*x + 3.0)^2) and have it automatically derive and evaluate thatdf/dx = 28for us, without us ever having to hardcode the chain rule ourselves.
What's Coming
These are the building blocks. In the next parts, I'll define the Value/Op/Graph data structures, implement topological sort + backward pass, and finally train a neuron with SGD (stochastic gradient descent) β all without importing any library beyond basic I/O and math.
References
- Andrej Karpathy, micrograd β the reference implementation this series is based on
- Karpathy, neural networks: zero to hero β lecture series