Build a Local, Private Mac AI Voice Assistant

LocalClicky is an open-source voice assistant that runs entirely on your Mac.

It combines local speech recognition, local AI reasoning, and a local vision model so you can control your desktop, manage files, edit video, and create reminders.

No voice, screenshot, or command ever leaves your hardware.

You can start a session by saying “Hey Jarvis.” The app records your speech, transcribes it with Whisper.cpp, and sends the text to Ollama running a command model (qwen3:8b) that chooses the right tool.

The assistant also allows you to open an app, click a UI element after a vision model (gemma4:e4b) takes a screenshot, run a shell command, trim video with ffmpeg, or drop a reminder into macOS Reminders.

The session stays active, and you can chain commands without repeating the wake word.

Visit ClaudePrism

Features

Lives in the menubar and shows the current state through clear icons.
Wake word “Hey Jarvis” starts a session and keeps it active after each response.
Voice Activity Detection stops recording when you stop talking.
The vision model takes a screenshot on demand, finds the target element, and clicks its center.
Controls Spotify playback, system volume, app launching, Chrome tabs, and file operations through AppleScript and shell commands.
Edits video via local ffmpeg: trim, mute, merge, speed up, resize, and add text.
Creates reminders with natural‑language dates and writes them directly to the macOS Reminders app.
Multi‑round tool calling runs a command, checks the result, and retries or confirms up to five rounds.
Conversation memory keeps the last 10 exchanges for follow‑up clarification.
Session auto‑expires after 25 seconds of silence and returns to wake‑word mode.

Use Cases

Control media playback and system volume hands‑free while working in another app.
Automate desktop routines like opening apps, moving files, and clicking UI elements by name.
Edit video files locally with spoken commands, keeping raw footage on your own drive.
Create reminders quickly with natural time expressions without opening the Reminders app.
Run ad‑hoc shell commands or query system information by voice when the terminal isn’t reachable.

Setup Requirements and Permissions

LocalClicky needs the following before it can run:

macOS 12 or later
Python 3.11 or later
Homebrew
Ollama running locally
The default qwen3:8b command model and gemma4:e4b vision model
About 8 GB of free RAM
ffmpeg for video-editing commands

macOS also requires permissions for the Python executable that launches LocalClicky:

Microphone for voice input
Screen Recording for screenshot-based requests
Accessibility for cursor movement and clicks

The project can run without ffmpeg when you do not need video editing. It can also run without webrtcvad-wheels, but recordings then fall back to a 30-second limit instead of stopping when you finish speaking.

How to Use LocalClicky

1. Download the project and open its folder in Terminal.

2. Install Whisper.cpp and Ollama, then pull the default local models.

brew install whisper-cpp ollama
ollama pull qwen3:8b
ollama pull gemma4:e4b

3. Install ffmpeg when you plan to edit video files by voice.

brew install ffmpeg

4. Create a Python virtual environment and install the project dependencies.

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -c "import openwakeword; openwakeword.utils.download_models()"

5. Install the optional voice activity detection package for automatic recording stop detection.

pip install webrtcvad-wheels

6. Start Ollama, then launch LocalClicky.

ollama serve &
python main.py

7. Grant Microphone, Screen Recording, and Accessibility permission when macOS prompts you. Grant the permissions to Terminal if the virtual-environment Python executable does not appear in the system picker.

8. Say “Hey Jarvis,” then speak a request. Say “bye,” “goodbye,” “stop listening,” or “go to sleep” when you want to end the session.

Configuration Options

Command model and vision model: edit ollama_client.py.

  COMMAND_MODEL = "qwen3:8b"
  VISION_MODEL = "gemma4:e4b"

Wake word: change wake_word.py.

  WAKE_MODEL = "hey_jarvis"           # pretrained model name
  WAKE_MODEL_PATH = "/path/to/custom.onnx"   # overrides WAKE_MODEL when set
  DETECTION_THRESHOLD = 0.5           # lower = more sensitive

Session idle timeout: edit companion.py.

  SESSION_IDLE_TIMEOUT = 25.0   # seconds

Screenshot resolution: edit screen_capture.py.

  MAX_WIDTH = 1280
  JPEG_QUALITY = 75

Ollama endpoint: edit ollama_client.py.

  OLLAMA_URL = "http://localhost:11434/api/chat"

Supported Models

Vision Model	Command Model	Notes
`gemma4:e4b`	`qwen3:8b`	Default; good balance
`gemma4:e4b`	`qwen3:14b`	Better reasoning, needs ~16 GB RAM
`gemma4:27b`	`qwen3:8b`	Better vision accuracy, ~32 GB RAM
`qwen2.5vl:7b`	`qwen3:8b`	Alternative vision model

Alternatives & Related Resources

NeuralAgent: Open-Source Local AI Agent for Desktop Automation
OpenHuman: Free Private Desktop AI Agent with Local Memory
Free On-Device Real-Time Voice and Vision AI – Parlor
Automate Anything: 10 Best & Open-source AI Agents
Whisper.cpp: the local speech‑recognition engine that transcribes your voice.
Ollama: the local model runner that serves qwen3 and gemma4.
ffmpeg: the multimedia framework that handles all video editing commands.

Pros

Runs entirely offline.
No API keys or subscriptions.
Open‑source under MIT license.
Menubar‑only, no Dock clutter.
Vision model clicks UI elements.
Local ffmpeg video editing.
Customizable wake word and models.

Cons

macOS only (12+).
Heavy installation steps.
Needs ~8 GB free RAM.
Manual macOS permissions required.

FAQs

Q: Does LocalClicky work offline?
A: The first installation requires internet access to install dependencies and download local models. After setup, the workflow uses local Whisper transcription, local Ollama models, macOS text-to-speech, and local cursor control instead of cloud APIs.

Q: Is LocalClicky only a dictation app?
A: No. LocalClicky can interpret a spoken instruction and take supported actions on the Mac.

Q: Can LocalClicky click buttons on the screen?
A: Yes. LocalClicky can capture a screenshot, use its local vision model to identify a requested target, and click the center of the returned target area. Screen Recording and Accessibility permission are required.

Q: Which Mac models can run it comfortably?
A: Any Mac with macOS 12 or later and at least 8 GB of free RAM for the two models. Apple Silicon Macs handle the models faster, but Intel Macs can also run them.

Q: Why does the recording sometimes run for 30 seconds even when I’ve stopped speaking?
A: That happens when the optional webrtcvad‑wheels package isn’t installed. Without it, the recorder uses a fixed 30‑second cap instead of real silence detection. Install pip install webrtcvad-wheels to enable automatic stop.

Q: What should I do if the wake word never triggers?
A: Check that you’re saying “Hey Jarvis” (not “Computer”). If it still fails, lower the DETECTION_THRESHOLD in wake_word.py and watch the console for “WAKE triggered” lines. Speak clearly and at a normal pace.