I wanted to see how far I could get with a speak-to-LLM workflow in one night.
Not a polished app. Not a product launch. Just a real proof of concept: hotkey, microphone input, Whisper transcription, command detection, and text injection into whatever app is focused.
I got there in about two hours.
That matters to me because it means the idea is real enough to test before I turn it into a much bigger project.
The basic loop
The POC is built around a simple flow:
- Press
Cmd+Shift+Dto toggle listening - Capture audio in short chunks
- Transcribe locally with Whisper
- Parse the transcript for commands at the end of the phrase
- Inject the remaining text into the active app
The important part is that this is not “speech recognition and hope for the best.” It’s speech recognition plus a command layer.
That distinction is where the hard part lives.
What the command parser has to do
Natural speech is messy. Commands are messy too.
If I say:
“Can you draft a short reply about the meeting tomorrow enter”
the system needs to understand that enter is a command, not part of the sentence. If I say:
“new line thanks for the update”
it should insert a line break and keep going. If I say:
“scratch that”
it should clear the current field instead of treating those words as content.
That sounds straightforward until you hit real transcripts. Whisper can be excellent, but it still gives you occasional weird spacing, partial words, or unexpected punctuation. A command parser has to be conservative. It should only fire when the intent is clear.
A real example
Here’s the kind of flow I’m aiming for:
Transcript:
“Hey Claude can you summarize this pull request for me enter”
Parser result:
- spoken text:
Hey Claude can you summarize this pull request for me - command:
enter - action: inject text, then press Return
Another one:
Transcript:
“Let me add one more note new line this change needs tests”
Parser result:
- spoken text:
Let me add one more note - command:
new line - action: inject text, then Shift+Return, then continue
That separation is the whole product.
Why I used confidence filtering
I added confidence filtering because voice input gets dangerous when the system is too eager.
If the transcription is low confidence, I would rather keep the text in a pending buffer than immediately commit it to the target app. The same goes for command detection. A missed command is annoying. A false command can wreck a form, a message, or a prompt.
So the rule is simple: don’t over-trigger.
That makes the tool feel calmer. It gives me a chance to review what was heard before it gets injected, which is exactly what I want in a workflow that could be used with Claude, ChatGPT, or any other text field.
Why clipboard injection won
For text insertion, I used clipboard paste instead of simulating every keystroke.
That’s not glamorous, but it’s the right tradeoff for a POC. Paste is faster, more reliable with Unicode, and less fragile across apps. Key simulation sounds cleaner until you hit an emoji, a non-English character, or an app that handles input weirdly.
I’m happy to take the boring path when it removes failure modes.
Phase 2 is about tightening the edges
The POC is working. Phase 2 is where it gets serious.
That next step is about refining the command parser, improving confidence handling, and making the workflow feel polished enough that I’d actually use it regularly. I already have 70 tests lined up around the pipeline, which is the part I care about most: once the behavior is encoded, I can improve it without guessing.
That’s the whole reason I like a fast prototype.
You don’t just learn whether an idea is possible. You learn where the sharp edges are, and you learn them quickly.
For Gaze, the answer is yes: the core loop works. Voice goes in. Commands are detected. Text comes out. Now it’s a matter of making it trustworthy enough to disappear into the background.