what if my app fixed itself overnight

Transcripted works great with two people. One mic, one system audio on a Zoom call. Clean separation. Easy problem.

Add four more people and it falls apart.

The transcription is fine. Whisper handles the audio no problem. The part that breaks is diarization. Figuring out who said what. With 6-12 speakers the pipeline starts guessing. Merging voices. Splitting one person into two. The kind of errors that make the whole transcript useless.

I’ve been poking at PyAnnote parameters and VAD thresholds for weeks. Manual tuning. Change a number, run a test, check the output, repeat. It’s slow. And I’m never sure if I’m finding real improvements or just overfitting to one recording.

Then I watched Karpathy’s AutoResearch talk.

The setup is simple. Three files. A data prep script. A training script the agent can edit. A markdown prompt that tells the agent what to optimize. The agent modifies the code, runs a short experiment, measures the result, keeps what works, throws away what doesn’t, and loops. He ran 700 experiments in two days. Found 20 real optimizations. No human in the loop.

700 experiments. I’ve run maybe 30 in the past month.

So now I’m thinking about pointing that same loop at my diarization problem. Build a test corpus of multi-speaker recordings with ground truth labels. I’ve already got about 40 meeting recordings. I’d need to hand-label maybe 15-20 of them. Tedious but doable.

Then let the agent modify everything. PyAnnote parameters. VAD thresholds. Embedding models. Clustering algorithms. Overlap handling. Each experiment runs against the labeled data, scores the result, decides whether to keep or discard.

The M5 Max with 128GB unified memory can handle this. Local compute. No API costs. Run it overnight. Wake up and check what it found.

That’s the idea anyway.

So. The red bar.

I haven’t built any of this. I don’t have labeled ground truth data yet. I don’t know if 5-minute experiment cycles are realistic for diarization the way they are for training runs. The search space might be too big or the signal too noisy. AutoResearch was built for a specific kind of ML experiment loop. Diarization pipeline tuning might not fit the pattern.

And there’s the meta problem. I could spend a week building the AutoResearch harness instead of just fixing the diarization by hand. That’s a trap I keep falling into. Build the system that builds the thing instead of building the thing.

But 700 experiments versus my 30. That ratio is hard to ignore. Even if half the experiments are garbage, the coverage is something I can’t match manually. And the M5 Max is just sitting there at night doing nothing.

I’m going to start with the ground truth labels this weekend. If the corpus comes together, the rest is plumbing.

Red bar: haven’t started yet. but 700 experiments beats 30 and i have a machine that can run them while i sleep.