Local AI Voice Companion - Realtime Voice Chat

RealtimeVoiceChat is an open-source project that lets you have fluid, spoken conversations with any AI models.

Unlike clunky voice assistants that make you wait between turns, this tool delivers near-human response times around 500ms by stitching together several cutting-edge AI components.

It’s the kind of system that makes you forget you’re talking to a machine – the responses come so fast you can actually interrupt the AI mid-sentence, just like a real conversation.

See It In Action

From RealtimeVoiceChat’s Official GitHub Repo

Features

Fluid Conversation: The main selling point. Speak and listen, aiming for a natural conversational flow.
Real-Time Feedback: You can see partial transcriptions as you speak and even the AI’s text response as it’s being generated, before the audio is ready. This helps reduce that “am I talking to a void?” feeling.
Low Latency Focus: The architecture is built around streaming audio chunks via WebSockets to keep delays minimal.
Smart Turn-Taking: It uses dynamic silence detection (turndetect.py) to try and figure out when you’ve finished your thought, adapting to the conversation’s pace. This is more sophisticated than a fixed pause timer.
Flexible AI Brains: You’re not locked into one LLM. It defaults to Ollama, letting you run local models (which is great for privacy and cost), but it also supports OpenAI if you prefer cloud-based models. This is handled by llm_module.py.
Customizable Voices: You get choices for Text-to-Speech engines. The audio_module.py allows you to switch between options like Kokoro, Coqui, or Orpheus.
Web Interface: A clean, no-frills UI using Vanilla JavaScript and the Web Audio API.

Use Cases

Personal AI Companion: Set it up with a local LLM via Ollama, tweak the system_prompt.txt to give your AI a specific personality, and just chat. It’s quite different from typing at a model.
Voice-Controlled Information Kiosk: Imagine a scenario where you need hands-free access to information. You could adapt this to query specific knowledge bases or APIs via voice.
Accessibility Tool Exploration: For users who find typing difficult, a robust voice chat interface to an LLM could be a significant enabler.
Language Learning Practice: Configure the LLM to act as a conversation partner in a language you’re learning. The real-time nature could make practice more engaging.

How to Use It

1. Install prerequisites:

For best performance, use a machine with a CUDA-enabled NVIDIA GPU
You’ll need Python 3.9+ if installing manually
Docker is recommended for simpler setup

2. Get the code:

git clone https://github.com/KoljaB/RealtimeVoiceChat.git
cd RealtimeVoiceChat

3. Choose your installation method:

Docker installation (recommended for Linux/GPU):

# Build the Docker images
docker compose build
# Start the services
docker compose up -d
# Pull your desired Ollama model
docker compose exec ollama ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M

Manual installation (Windows):

# Use the provided script
install.bat
# Or follow manual steps for Linux/macOS/Windows
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
.\venv\Scripts\activate   # Windows
cd code
pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

4. Run the application:

With Docker, it’s already running after docker compose up -d
If installed manually:

 # Navigate to code directory
 cd code
 # Start the server
 python server.py

5. Access the interface:

Open your browser to http://localhost:8000
Grant microphone permissions when prompted
Click “Start” to begin chatting
Use “Stop” to end and “Reset” to clear the conversation

6. Customize as needed:

Edit configuration in Python files under the code/ directory
Change TTS engine in server.py and audio_module.py
Switch LLM backend in server.py and llm_module.py
Adjust STT settings in transcribe.py
Modify turn detection sensitivity in turndetect.py

Pros

Genuinely natural flow – The turn detection and interruption handling make conversations feel much more like talking with a person
Open source – Complete access to the code for customization and learning
Modular design – Easy to swap components like TTS engines or AI backends
Web-based interface – No special app needed beyond a modern browser
Local processing – Can run entirely on your own hardware
Docker support – Simplifies deployment and dependency management

Cons

Hardware requirements – Needs a decent GPU for good performance, especially for speech synthesis
Technical setup – Not plug-and-play; requires comfort with command line and possibly Docker
Limited mobile support – Primarily designed for desktop browsers
Dependency complexity – Multiple ML libraries with their own requirements and potential conflicts
Configuration requires code edits – No simple settings UI for changing voices or models

Related Resources

RealtimeVoiceChat GitHub Repository: https://github.com/KoljaB/RealtimeVoiceChat – The source code and primary documentation.
RealtimeSTT GitHub Repository: https://github.com/KoljaB/RealtimeSTT – The speech-to-text library used.
RealtimeTTS GitHub Repository: https://github.com/KoljaB/RealtimeTTS – The text-to-speech library used.
Ollama: https://ollama.com/ – For running LLMs locally.
NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html – For GPU support in Docker.

FAQs

Q: Can I run RealtimeVoiceChat without an NVIDIA GPU?
A: Technically, yes, the components can fall back to CPU. However, performance for STT and TTS will be significantly slower, and you likely won’t achieve that “real-time” conversational feel. A powerful CUDA-enabled NVIDIA GPU is highly recommended.

Q: What LLMs are supported out of the box?
A: It’s primarily set up for Ollama, which allows you to run a wide variety of open-source LLMs locally (like Mistral, Llama, etc.). There’s also a connector for OpenAI models if you have an API key and prefer to use their services.

Q: How do I change the AI’s voice?
A: You’ll need to edit server.py to change the START_ENGINE variable to your desired TTS engine (e.g., "coqui", "kokoro", "orpheus"). Then, you’d go into audio_module.py and adjust the engine-specific settings within the AudioProcessor.__init__ method, like voice model paths or speaker IDs.

Q: Is it difficult to set up for someone new to Docker?
A: If you’re new to Docker, there might be a slight learning curve, but the docker compose build and docker compose up -d commands do most of the heavy lifting. The main challenge might be ensuring Docker itself is installed correctly and, if you’re using a GPU, that the NVIDIA Container Toolkit is set up so Docker can access the GPU. The provided docker-compose.yml handles the service definitions.

Q: How much RAM do I need?
A: At least 8GB, but 16GB or more is recommended, especially when running the LLM and TTS models together.

Try It Yourself

If you’ve been looking for a more natural way to interact with AI through conversation, RealtimeVoiceChat is worth setting up. The initial configuration takes some work, but the resulting experience feels significantly more fluid than most voice interfaces I’ve used.

The project is actively developed on GitHub, so if you run into issues or have ideas for improvements, the community there can help. I’d love to hear about your experiences if you give it a try!