I improved my app by 44% without writing a single new feature

This is a green bar post. Something worked, it worked well, and I want to share it because I think you can use it too.

I built Transcripted — a macOS app that records meetings and figures out who was talking. The speaker identification part was getting it wrong about a third of the time. I knew the technology was capable of better. I just didn’t know which settings to change.

So I tried something: I adapted Andrej Karpathy’s AutoResearch loop — originally built for tuning ML training — and pointed it at my diarization error rate instead. The concept is dead simple:

1. You define your source of truth. What does “good” look like? For me: meetings where I already knew who said what. That’s your benchmark.

2. You define the knobs. What settings could you turn up or down? Clustering sensitivity, segment length, gap duration — each one a dial. You don’t need to fully understand them. You just need to know they exist.

3. You let the loop run. It proposes a change, tests it, keeps it if the score improves, discards it if it doesn’t. Repeat ~200 times.

I went from 34.1% error rate to 19.2%. 44% improvement. No new model. No new features. No retraining anything.

The thing that genuinely surprised me: one variable did almost all the work. clusteringThreshold=0.600. Every other dial I thought mattered barely moved the needle. I would have spent weeks tweaking the wrong things. The loop found the right one in a weekend.

That’s the real insight here. You don’t just get a better number — you learn what actually matters. That’s worth more than the improvement itself.

If you have a system with measurable output and tunable settings, this pattern works. Ground truth → knobs → loop → keep the winners. That’s it.

You don’t need to be a researcher. You just need to be able to measure whether something got better or worse.

Code is in the Transcripted repo if you want to dig in.