SpeakClean: a Typeless clone with a local model

Overview

SpeakClean is a macOS menu bar app inspired by Typeless, scoped down to the one workflow I actually use: hold a hotkey to dictate, release to paste cleaned text into the active app. Mic audio is transcribed on-device with Apple’s SpeechAnalyzer, and the transcript is rewritten by a locally-served Gemma 4 E2B (via Ollama) to strip filler words. That’s the whole app — no feature parity with Typeless intended, no network calls, no API keys, no subscription.

Architecture

Two layers. A thin orchestration layer (menu bar UI, global hotkey, recording coordinator) on top of a reusable core with two pieces: a streaming transcriber and a stateless text cleaner. The split keeps the AppKit plumbing separate from the parts that could be lifted into another tool.
One short-lived pipeline per press. Each press runs audio → text → cleaned text → paste, then stops. No conversation history, no background workers, no queues.
Streaming, not batch. Audio feeds into the speech engine as it arrives rather than being recorded to a file and transcribed afterwards — keeps latency low and avoids temp-file handling.
Push-to-talk, not voice-activity detection. Recording starts on hotkey press and ends on release. No silence timeout, no endpointing heuristic. Simpler, more predictable, avoids a tuning problem that never fully goes away.
LLM as a local service, not an embedded runtime. Cleanup runs in a separate Ollama process that the app talks to over HTTP. The app doesn’t bundle model weights or own the inference runtime — Ollama handles model loading, memory management, and updates.

Things I wanted but didn’t ship

Auto-detect language. SpeechAnalyzer doesn’t support it, and running an additional local LLM just for language ID would cost too much memory. Dropped.
A smaller cleanup model. Gemma 4 E2B at ~7 GB is the smallest model I found that does the cleanup job reliably. Waiting for smaller models to get good enough, rather than shipping something worse now.

Learnings

1. Prompt caching still helps when the model is resident

Ollama keeps the Gemma weights in memory across requests, but that alone didn’t give me the latency I wanted. Reusing a stable cached prefix (system prompt + dictionary) on each request made a noticeable difference on top of that. “Model loaded” and “prompt cached” aren’t the same thing.

2. Claude Code is weak on macOS / Swift API domain knowledge

It often confidently asserts the wrong thing for AppKit-level specifics and then writes code against its (wrong) mental model.

The clearest example in this project: listening for a global shortcut on macOS needs Accessibility permission, not Input Monitoring. The agent kept reaching for Input Monitoring and burned a long debugging loop around a permission that wasn’t the actual problem. Web search didn’t surface the fix either. I had to work it out by hand.

3. Agent coding can’t automate OS-interaction testing yet

A lot of the bugs only surfaced through real interaction — pressing the shortcut from different foreground apps, recording for different durations, checking that the pasted text actually lands in the right field. The agent can run swift test, but it can’t hold down a key, speak into a mic, or verify what ended up in Notes. Manual testing was the tight feedback loop for this project, and that shapes how the work divides between me and the agent.