Build a Local, Private Mac AI Voice Assistant – LocalClicky

Get a self‑hosted macOS AI voice assistant that transcribes, reasons, and clicks without sending data anywhere.

LocalClicky is an open-source voice assistant that runs entirely on your Mac.

It combines local speech recognition, local AI reasoning, and a local vision model so you can control your desktop, manage files, edit video, and create reminders.

No voice, screenshot, or command ever leaves your hardware.

You can start a session by saying “Hey Jarvis.” The app records your speech, transcribes it with Whisper.cpp, and sends the text to Ollama running a command model (qwen3:8b) that chooses the right tool.

The assistant also allows you to open an app, click a UI element after a vision model (gemma4:e4b) takes a screenshot, run a shell command, trim video with ffmpeg, or drop a reminder into macOS Reminders.

The session stays active, and you can chain commands without repeating the wake word.

Features

  • Lives in the menubar and shows the current state through clear icons.
  • Wake word “Hey Jarvis” starts a session and keeps it active after each response.
  • Voice Activity Detection stops recording when you stop talking.
  • The vision model takes a screenshot on demand, finds the target element, and clicks its center.
  • Controls Spotify playback, system volume, app launching, Chrome tabs, and file operations through AppleScript and shell commands.
  • Edits video via local ffmpeg: trim, mute, merge, speed up, resize, and add text.
  • Creates reminders with natural‑language dates and writes them directly to the macOS Reminders app.
  • Multi‑round tool calling runs a command, checks the result, and retries or confirms up to five rounds.
  • Conversation memory keeps the last 10 exchanges for follow‑up clarification.
  • Session auto‑expires after 25 seconds of silence and returns to wake‑word mode.

Use Cases

  • Control media playback and system volume hands‑free while working in another app.
  • Automate desktop routines like opening apps, moving files, and clicking UI elements by name.
  • Edit video files locally with spoken commands, keeping raw footage on your own drive.
  • Create reminders quickly with natural time expressions without opening the Reminders app.
  • Run ad‑hoc shell commands or query system information by voice when the terminal isn’t reachable.

Setup Requirements and Permissions

LocalClicky needs the following before it can run:

  • macOS 12 or later
  • Python 3.11 or later
  • Homebrew
  • Ollama running locally
  • The default qwen3:8b command model and gemma4:e4b vision model
  • About 8 GB of free RAM
  • ffmpeg for video-editing commands

macOS also requires permissions for the Python executable that launches LocalClicky:

  • Microphone for voice input
  • Screen Recording for screenshot-based requests
  • Accessibility for cursor movement and clicks

The project can run without ffmpeg when you do not need video editing. It can also run without webrtcvad-wheels, but recordings then fall back to a 30-second limit instead of stopping when you finish speaking.

How to Use LocalClicky

1. Download the project and open its folder in Terminal.

2. Install Whisper.cpp and Ollama, then pull the default local models.

    brew install whisper-cpp ollama
    ollama pull qwen3:8b
    ollama pull gemma4:e4b

    3. Install ffmpeg when you plan to edit video files by voice.

      brew install ffmpeg

      4. Create a Python virtual environment and install the project dependencies.

        python3 -m venv venv
        source venv/bin/activate
        pip install -r requirements.txt
        python -c "import openwakeword; openwakeword.utils.download_models()"

        5. Install the optional voice activity detection package for automatic recording stop detection.

          pip install webrtcvad-wheels

          6. Start Ollama, then launch LocalClicky.

            ollama serve &
            python main.py

            7. Grant Microphone, Screen Recording, and Accessibility permission when macOS prompts you. Grant the permissions to Terminal if the virtual-environment Python executable does not appear in the system picker.

            8. Say “Hey Jarvis,” then speak a request. Say “bye,” “goodbye,” “stop listening,” or “go to sleep” when you want to end the session.

              Configuration Options

              Command model and vision model: edit ollama_client.py.

                COMMAND_MODEL = "qwen3:8b"
                VISION_MODEL = "gemma4:e4b"

              Wake word: change wake_word.py.

                WAKE_MODEL = "hey_jarvis"           # pretrained model name
                WAKE_MODEL_PATH = "/path/to/custom.onnx"   # overrides WAKE_MODEL when set
                DETECTION_THRESHOLD = 0.5           # lower = more sensitive

              Session idle timeout: edit companion.py.

                SESSION_IDLE_TIMEOUT = 25.0   # seconds

              Screenshot resolution: edit screen_capture.py.

                MAX_WIDTH = 1280
                JPEG_QUALITY = 75

              Ollama endpoint: edit ollama_client.py.

                OLLAMA_URL = "http://localhost:11434/api/chat"

              Supported Models

              Vision ModelCommand ModelNotes
              gemma4:e4bqwen3:8bDefault; good balance
              gemma4:e4bqwen3:14bBetter reasoning, needs ~16 GB RAM
              gemma4:27bqwen3:8bBetter vision accuracy, ~32 GB RAM
              qwen2.5vl:7bqwen3:8bAlternative vision model

              Alternatives & Related Resources

              Pros

              • Runs entirely offline.
              • No API keys or subscriptions.
              • Open‑source under MIT license.
              • Menubar‑only, no Dock clutter.
              • Vision model clicks UI elements.
              • Local ffmpeg video editing.
              • Customizable wake word and models.

              Cons

              • macOS only (12+).
              • Heavy installation steps.
              • Needs ~8 GB free RAM.
              • Manual macOS permissions required.

              FAQs

              Q: Does LocalClicky work offline?
              A: The first installation requires internet access to install dependencies and download local models. After setup, the workflow uses local Whisper transcription, local Ollama models, macOS text-to-speech, and local cursor control instead of cloud APIs.

              Q: Is LocalClicky only a dictation app?
              A: No. LocalClicky can interpret a spoken instruction and take supported actions on the Mac.

              Q: Can LocalClicky click buttons on the screen?
              A: Yes. LocalClicky can capture a screenshot, use its local vision model to identify a requested target, and click the center of the returned target area. Screen Recording and Accessibility permission are required.

              Q: Which Mac models can run it comfortably?
              A: Any Mac with macOS 12 or later and at least 8 GB of free RAM for the two models. Apple Silicon Macs handle the models faster, but Intel Macs can also run them.

              Q: Why does the recording sometimes run for 30 seconds even when I’ve stopped speaking?
              A: That happens when the optional webrtcvad‑wheels package isn’t installed. Without it, the recorder uses a fixed 30‑second cap instead of real silence detection. Install pip install webrtcvad-wheels to enable automatic stop.

              Q: What should I do if the wake word never triggers?
              A: Check that you’re saying “Hey Jarvis” (not “Computer”). If it still fails, lower the DETECTION_THRESHOLD in wake_word.py and watch the console for “WAKE triggered” lines. Speak clearly and at a normal pace.

              Leave a Reply

              Your email address will not be published. Required fields are marked *

              Get the latest & top AI tools sent directly to your email.

              Subscribe now to explore the latest & top AI tools and resources, all in one convenient newsletter. No spam, we promise!