Learning Machine Learning by Building It From Scratch in Rust
For a while, I had been using machine learning without really understanding it. Libraries made it easy to train models and get results, but the mechanics were hidden behind layers of abstraction.
So instead of starting with a framework, I went all the way back to basics — paper and math — and rebuilt the foundations myself.
Stripping ML Down to Its Core
You define:
- a model (some function with parameters),
- a loss function (how wrong the model is),
- and a rule for updating parameters to reduce that loss.
Everything else builds on top of this.
To make that concrete, I started with the simplest possible model.
Linear Regression, Actually Understood
Imagine you have some data points on a graph — maybe house sizes and their prices. You want to find a straight line that best predicts the price given the size.
The Model: A Straight Line
Any straight line can be written as:
In plain English:
- is your input (e.g., house size)
- (read “y-hat”) is your prediction (e.g., predicted price)
- is the weight — it controls how steep the line is (how much changes when changes)
- is the bias — it shifts the line up or down (the value of when is zero)
The whole game is figuring out what values of and make the line fit the data best.
Measuring How Wrong We Are
Given a real data point where is the actual correct answer, the error is just:
This tells us how far off our prediction was. Positive means we guessed too low, negative means too high.
But we don’t just want to know the error — we want a single number that tells us how bad the model is overall. That’s the loss function. We use the squared error:
Why squared? Two reasons:
- It makes all errors positive (a prediction that’s 5 too high is just as bad as 5 too low)
- It punishes big errors more than small ones (being off by 10 is worse than being off by 2, and squaring makes that difference dramatic)
Finding the Right Direction to Improve
Here’s the key insight: if we could figure out which direction to nudge and to make the loss smaller, we could just keep nudging until the line fits perfectly.
That’s what gradients tell us. A gradient is just the answer to: “If I increase this parameter a tiny bit, does the loss go up or down, and by how much?”
The math gives us:
Here’s what the symbols mean:
- is “how much does the loss change when we change ”
- If this number is positive, increasing makes the loss worse → we should decrease
- If this number is negative, increasing makes the loss better → we should increase
Same logic applies to .
The Update Rule: Learning Step by Step
Now we can actually improve the model. Each step, we nudge the parameters in the direction that reduces the loss:
The just means “update the value.” And (alpha) is the learning rate — a small number (like 0.01) that controls how big each step is.
Why subtract? Because if the gradient is positive (loss goes up when increases), we want to go the opposite direction and decrease . The subtraction handles that automatically.
That’s it. Repeat this process thousands of times, and the line gradually converges to the best fit.
I worked through this by hand for a single data point. After one update step, the model landed exactly on the target.
Turning the Math Into Rust
Once the math was clear, writing the code was almost boring — in a good way.
The Rust version was just:
- compute the prediction,
- compute the gradients,
- update
wandb, - repeat.
No libraries. No autodiff. No magic.
What surprised me was how fragile learning can be. A learning rate that worked fine for small values caused the model to completely explode when inputs were larger. Seeing numbers shoot to infinity made it obvious why normalization and careful step sizes matter.
Simple Language Models (Without Neural Networks)
With gradient descent under control, I wanted to try something different: language.
Instead of jumping to neural networks, I built the simplest possible language models based purely on counting.
Character-Level Bigrams
The first version learned:
Training was just counting transitions between characters.
The output looked like word-shaped noise — not broken, just extremely limited.
Word-Level Bigrams
Switching from characters to words made a huge difference:
With enough training text, the model started producing sentences that felt oddly familiar — grammatically plausible, topical, and clearly inspired by the source material.
Still wrong. Still shallow. But undeniably language-like.
That made one thing very clear: model quality depends heavily on what information the model is allowed to remember.
Source Code
All of this lives here:
👉 neural-nets https://github.com/Lucas8448/neural-nets
The repository contains:
- linear regression implemented from scratch,
- gradient descent experiments,
- character- and word-level language models