Paperclip was the red bar. Claude Code is the pivot.

I wanted a local AI agent system that just worked. No cloud API costs, no sending my code to a server, just models running on my M5 doing useful work while I sleep.

Paperclip was the bet. You wire up local LLMs to a task queue, agents pick up tickets, they do research and write code and update docs. The idea is solid. The execution has been janky.

The problems aren’t catastrophic. They’re worse than that — they’re intermittent. The heartbeat fires but the agent runs for 4 minutes and exits without finishing. The workspace warning prints to stderr and Paperclip lights it up red even when nothing actually failed. The model that’s loaded when the heartbeat fires isn’t always the one the agent expects. First run fails, second run succeeds, ticket sits in in_review forever because nobody closed it.

It works. Just not reliably enough to trust.

So I’m pivoting. Claude Code Desktop added scheduled tasks and I’ve been playing with it for Transcripted. The setup I built is five nightly jobs staggered across the early morning:

1:47 AM — build and test watchdog. Runs xcodebuild, catches failures, attempts a surgical fix, opens a PR if it worked or files an issue if it didn’t. Silent on healthy nights.

2:17 AM — simplification sweep. Looks for duplicate code and efficiency opportunities across the ~140 Swift files. No PR if there’s nothing real to fix.

3:23 AM — code review. Focuses on the last 7 days of commits. Threading violations, missing error handling, logic bugs in the transcription pipeline.

4:37 AM — security audit. Path traversal in file handling, SQL injection in the speaker database, force unwraps in critical paths, HuggingFace download integrity.

Separately — CLAUDE.md sync. Compares all 15 documentation files against the actual codebase and updates anything that’s drifted.

The key design principle: no noise on clean nights. PRs only get created when something real was found and fixed. Issues only get filed when something is genuinely broken. If everything’s fine, nothing happens.

I haven’t run these long enough to know if they work yet. That’s the experiment. But the mental model is cleaner — Claude Code is deeply integrated into the codebase, it already knows the files and the patterns, and the scheduling is native rather than bolted on through a bridge layer and a heartbeat script and an MLX server that may or may not have the right model loaded.

Paperclip as a research tool is still useful. The GitHub issues mining it ran last night — scanning Meetily, Screenpipe, Whisper.cpp for open issues and user complaints — that was legitimately good. Local Nemotron 49B, no API costs, solid output. That use case plays to its strengths.

For code work on a production app, I need more reliability than I’ve been getting. Red bar on the original plan. New experiment live.