diarize might change the pipeline

Speaker diarization — figuring out who said what in a recording — is one of the hardest parts of building a meeting transcription app. It’s also one of the slowest.

Transcripted currently leans on pyannote for this. It works. But “works” comes with asterisks. It’s GPU-hungry on most platforms. On CPU it’s slow enough that users notice. And the accuracy on speaker count estimation gets shaky once you’re past 3-4 people in a room.

This morning my community scanner pulled up a new open-source library called Diarize. The claim: 7x faster than pyannote on CPU, with lower error rates. Optimized for 1-5 speaker meetings, which is exactly the use case. 87-97% accuracy on speaker count detection.

I haven’t tested it yet. That matters. Claims on a Show HN post are not benchmarks on my hardware with my audio files. “7x faster” could mean a lot of things depending on what they measured, how they measured it, and whether the accuracy tradeoffs are acceptable for real meeting audio with crosstalk and background noise.

But if even half of that holds up, it changes the pipeline.

Right now diarization is the bottleneck in Transcripted’s post-recording processing. Whisper runs fast on Apple Silicon — the M-series chips are genuinely good at this. But then you hit the diarization step and everything slows down. On my M5 Max it’s tolerable. On an M1 MacBook Air it’s… noticeable. The kind of noticeable where users wonder if the app froze.

A 3-4x improvement (being conservative about the 7x claim) would mean diarization drops from “the slow step” to “just another step.” That’s the difference between users feeling like the app is fast and users watching a progress bar.

The other thing that’s interesting is the CPU-only focus. Pyannote really wants a GPU. On macOS that means Core ML or Metal, which means conversion work and platform-specific optimization. A library that’s just… fast on CPU? That simplifies a lot.

What I need to do:

Clone the repo and run it against my test recordings — I’ve got about 40 meeting recordings of varying quality and speaker count.
Compare accuracy against pyannote on the same files. Speed doesn’t matter if it can’t tell my voice from my coworker’s.
Check the dependency tree. If it drags in PyTorch or something heavy, the install story gets complicated.
Figure out if it can be called from Swift or if I need a Python bridge.

That’s this week’s experiment. Red bar because I don’t know if it works yet and I’ve been burned by benchmark claims before. But the potential is real enough that I have to find out.