β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—     β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—     β–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—
  β•šβ•β•β–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ•‘      β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘ 
    β–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ 
   β–ˆβ–ˆβ–ˆβ•”β•  β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ•‘      β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘   
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•     β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   
  β•šβ•β•β•β•β•β•β•β•šβ•β•  β•šβ•β• β•šβ•β•β•β•β•β•      β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β•β•β•       β•šβ•β•β•β•β•    β•šβ•β•β•β•β•β• β•šβ•β•  β•šβ•β•β•β•   

  ZHU HAN WEN
  -----------
← Back to Blog (English)

What Is Deep Learning?

πŸ“… 2026-02-10 | ⏱️ 4 min read

πŸ“– Part of Series: intro-to-deep-learning

1. What Is Deep Learning? (Current)
#computer-science

Context

I've been working through Andrej Karpathy's micrograd β€” a tiny autograd engine in ~100 lines of Python. It's one of the cleanest explanations of backpropagation I've found: no framework abstractions, just the raw math wired up in code.

To make sure I actually understand it (and not just pattern-matching off Karpathy's code), I'm reimplementing everything in Midori. If I can port it without looking at the Python, I probably get it. These posts are my notes from that process.


The Core Insight

Deep learning, at its core, decomposes into two ideas:

  1. Function composition β€” chain simple ops (add, multiply, tanh) into a computation graph
  2. The chain rule β€” use calculus to figure out how to tweak inputs to improve the output

The engine that automates step 2 is called autograd (automatic differentiation). That's what micrograd implements, and what I'm rebuilding here.

Computation Graph

Any expression can be drawn as a DAG (directed acyclic graph). For example, e=tanh⁑(aβ‹…b+c)e = \tanh(a \cdot b + c):

a ----\
       (*) ----\
b ----/        (+) ---> tanh ---> e
              /
c -----------/

Each node stores three things:

  • data β€” the computed value (forward pass)
  • grad β€” βˆ‚eβˆ‚thisΒ node\frac{\partial e}{\partial \text{this node}} (filled during backward pass)
  • op β€” how this node was produced, so we can derive the local gradient

This is the standard tape-based autodiff representation. micrograd uses a Value class with _backward closures; I'll use a Graph array with Op tags instead (no closures needed).

Chain Rule Refresher

Given composed functions y=f(x)y = f(x), z=g(y)z = g(y), L=h(z)L = h(z):

dLdx=dLdzβ‹…dzdyβ‹…dydx\frac{dL}{dx} = \frac{dL}{dz} \cdot \frac{dz}{dy} \cdot \frac{dy}{dx}

Concrete example: f(x)=(2x+3)2f(x) = (2x + 3)^2. Decompose as u=2x+3u = 2x + 3, f=u2f = u^2:

dfdx=dfduβ‹…dudx=2uβ‹…2=4u\frac{df}{dx} = \frac{df}{du} \cdot \frac{du}{dx} = 2u \cdot 2 = 4u

At x=1x = 1: u=5u = 5, so dfdx=20\frac{df}{dx} = 20.

Nothing new if you've taken multivariable calculus, but the key realization is: autograd does exactly this, automatically, for arbitrarily complex graphs. It records uu and ff as nodes, stores how they were produced, and walks the graph backward applying the chain rule at each step.

Forward Mode vs Reverse Mode

There are two ways to propagate derivatives through a computation graph:

Forward mode starts at an input and pushes derivatives forward through each operation. Given aβ†’bβ†’cβ†’La \to b \to c \to L, it computes βˆ‚bβˆ‚a\frac{\partial b}{\partial a}, then βˆ‚cβˆ‚a\frac{\partial c}{\partial a}, then βˆ‚Lβˆ‚a\frac{\partial L}{\partial a}. One forward pass gives you the gradient w.r.t. one input. If you have nn parameters, you need nn passes.

Reverse mode starts at the output and pulls derivatives backward. It computes βˆ‚Lβˆ‚c\frac{\partial L}{\partial c}, then βˆ‚Lβˆ‚b\frac{\partial L}{\partial b}, then βˆ‚Lβˆ‚a\frac{\partial L}{\partial a}. One backward pass gives you the gradient w.r.t. every input.

In training, we want βˆ‡ΞΈL\nabla_\theta L where θ∈Rn\theta \in \mathbb{R}^n β€” one scalar loss, many parameters. Reverse mode gets all nn gradients in a single pass. Forward mode would need nn passes. This is why backpropagation is reverse-mode autodiff.

Sanity Check

f=(2x+3)2f = (2x + 3)^2 and its derivative, computed manually:

Loading interactive playground...

Set x = 2.0 β†’ expect u = 7, f = 49, df/dx = 28.

This explicit calculation is trivial, but the goal of our autograd engine is to let us write the forward pass (f = (2.0*x + 3.0)^2) and have it automatically derive and evaluate that df/dx = 28 for us, without us ever having to hardcode the chain rule ourselves.

What's Coming

These are the building blocks. In the next parts, I'll define the Value/Op/Graph data structures, implement topological sort + backward pass, and finally train a neuron with SGD (stochastic gradient descent) β€” all without importing any library beyond basic I/O and math.


References