the autoresearch loop is running. baseline is 34.1% DER.

Last week I wrote about the idea. Karpathy’s AutoResearch. 700 experiments in two days. I had 30 manual experiments and a machine sitting idle overnight. I said I’d start building the ground truth corpus this weekend.

I didn’t just build the corpus. I built the whole thing.

It’s running on my Mac right now while I type this. Loop iteration 4 of 100. Qwen 35B doing the parameter suggestions. FluidAudio running the actual diarization. No cloud APIs, no human in the loop. Just the machine doing its thing.

Here’s the full story, including everything that broke.

The idea

Transcripted’s diarization pipeline has about 18 tunable parameters. Things like: how similar do two voice samples need to sound before you call them the same person? How long does someone need to be silent before you mark them as “stopped talking”? What’s the minimum speech duration worth keeping?

I’ve been tuning these by hand. Try a number, run it against one or two meetings, squint at the output, move the number. That’s maybe one real data point every 30 minutes if I’m focused. And I can never tell if I’m actually finding signal or just overfitting to the recordings I happened to test on.

The AutoResearch approach is: build a proper corpus with ground truth labels, score everything automatically, and let an LLM suggest which knob to turn next. Keep what improves the score, discard what doesn’t. Run it 100+ times overnight.

The corpus

I already had 40+ recordings from real meetings. The piece I was missing was ground truth labels — verified answers for who was speaking at each moment. That’s the thing you need to actually score diarization accuracy.

Turns out I had it the whole time. Meeting platforms that record with per-participant tracks produce transcript files with speaker attribution already baked in: timestamps and speaker names, ready to go. I just needed to convert that to RTTM format (the standard diarization ground truth format) and I’d have a complete labeled corpus.

So I wrote the converter and built the corpus. 16 meetings. 3 to 8 speakers each. ~11 hours of total audio. 20 unique speakers across all the recordings. One consistent speaker in every meeting, which gives the model a stable voice baseline to work against.

This is the honest version of the test. Real meetings, not a benchmark dataset designed to make models look good. Mixed audio from a single track, not separate per-speaker streams. The diarizer is working with the same degraded signal that real users get.

The build

Two pieces.

TranscriptedCLI — a headless Swift binary that runs Transcripted’s diarization engine on an audio file and spits out an RTTM file. No GUI. No Xcode open. Just: transcripted-cli diarize audio.m4a --output result.rttm and it runs. Built it as a standalone Swift Package inside the Transcripted repo on a feat/autoresearch-cli branch. Zero existing files modified. The GUI app is completely untouched.

Building the CLI was mostly a linking problem. FluidAudio isn’t a standard Swift package — it’s a pre-built static library (28MB of .a file) plus extracted .swiftmodule files with custom C module maps. The package’s dependency chain was missing HuggingFace, EventSource, and a bunch of transitive dependencies that weren’t compiled into the main static library. I ended up building a single supplementary static lib from all the missing .o files and adding that to the linker flags.

First run: transcripted-cli --help. That was a good moment.

autoresearch-diarization.py — the Python loop. It runs the CLI against the whole corpus, converts transcript ground truth to RTTM annotations, scores each meeting using pyannote.metrics (handles the tricky parts of DER: optimal speaker label matching via Hungarian algorithm, collar tolerance, overlap handling), asks an LLM for a parameter suggestion, applies it, keeps or discards based on whether DER went down.

The bugs

Three notable ones.

Bug 1: CoreML crashes on the 4th meeting

Every file works fine individually. But when the Python script called the CLI once per file in a loop, the 4th meeting would crash with a CoreML shape error. The 3rd would succeed. The 5th would fail. Totally non-deterministic.

The problem: CoreML model state leaking between separate process invocations. Each CLI call was loading the neural network models fresh, and something in the initialization was going sideways when you did it too many times in quick succession.

Fix: use batch mode instead. The CLI’s batch command loads models once and processes all files in a single process. One model load, 16 meetings. CoreML stays happy. Processing time actually went down too.

Bug 2: The RTTM parser reading garbage data

One meeting — the 72-minute 5-speaker call — was scoring at 32,813% DER. Not 32%. Thirty-two thousand percent.

Spent a while on that one. The hypothesis RTTM file looked fine. The ground truth looked fine. But when I checked the raw numbers: the hypothesis was claiming 1.37 million seconds of speaker audio against a 4,357 second reference.

The issue was a filename with spaces in it. The space was getting written into the RTTM file ID field, which is space-delimited. So the parser was reading the filename as two fields, shifting all the columns. The start timestamp was being read from the wrong position. Everything was off.

Fix: sanitize the file ID in the RTTM writer (replace spaces with underscores), and make the parser more robust by reading from the end of the line where the field positions are predictable.

Bug 3: Qwen’s thinking preamble breaking the config parser

Qwen 35B is a “thinking” model — before it gives you an answer, it outputs a “Thinking Process:” block walking through its reasoning. That’s great for accuracy. Terrible for automated parsing.

My script was looking for REASONING: ... and CONFIG: { ... } in the LLM’s response. But the thinking preamble comes first and can contain its own JSON examples, which were confusing the parser. The script kept falling back to random perturbation instead of using the LLM’s suggestions.

Fix: find the index of REASONING: in the response and slice everything before it. Also bumped max_tokens from 1000 to 4000 — the thinking process eats tokens before it gets to the actual answer.

After that fix, Qwen’s responses started working. Its first real suggestion: “high confusion errors indicate over-segmentation — lower clusteringThreshold to merge embeddings more aggressively.” Applied. Testing now.

Baseline results

First pass, default settings, all 16 meetings:

Meeting	Speakers	DER
20250725-144458	3	8.5%
20260107-223834	3	10.6%
20250806-163617	7	11.6%
20250501-170626	5	24.9%
20250820-145539	3	25.6%
20250725-160304	6	28.4%
20260129-211935	7	29.1%
20250922-170559	4	29.5%
20250926-172153	4	30.3%
20260112-181353	4	32.5%
20260213-154428	3	40.6%
20250918-180240	4	43.7%
20251016-200631	4	53.1%
20250919-161032	3	55.6%
20250820-160635	7	56.6%
20250528-160623	8	70.1%

Mean DER: 34.1%

State of the art on conversational audio from a single mixed-down recording is 5–15%. I’m at 34.1%. That’s the honest number and I’m not going to spin it.

The interesting thing is that speaker count isn’t the main predictor of difficulty. I have three-speaker meetings ranging from 8.5% to 55.6%. Something else is driving that spread — recording conditions, crosstalk density, overlapping speech. The loop will find it.

The 8-speaker meeting at 70.1% is the obvious worst case. But I’m almost more interested in why three of my simpler three-speaker meetings are in the 40–55% range. Those should be easy.

The meta thing

Last week I wrote: “I could spend a week building the AutoResearch harness instead of just fixing the diarization by hand. That’s a trap I keep falling into.”

I spent one day on the harness. Half of it was those three bugs. The other half was plumbing: getting pyannote.metrics installed and working on Python 3.9, figuring out the Swift package linking, writing the transcript-to-RTTM converter.

But the loop is running. When I wake up tomorrow I’ll have 100 experiments instead of 30. And unlike my manual experiments, they’re all scored against the same 16-meeting corpus, so the results are actually comparable to each other.

That’s the real unlock. Not the code. The comparability.

Red bar: 34.1% mean DER. I’m going to keep posting these numbers until they’re single digits.