AI Voice Assistant Gateway (ESP32-S3 + Docker)
🎯 Project Objective
A private, local voice assistant system that separates high-bandwidth audio streaming from smart home command logic.
- Frontend: ESP32-S3 acts as a dumb "microphone-to-MQTT" bridge.
- Backend: Docker container performs Wake Word detection (OpenWakeWord) and Speech-to-Text (Faster-Whisper).
- Controller: Home Assistant receives clean text commands for execution.
🏗️ Architecture
The system uses a Dual-Broker strategy to prevent flooding Home Assistant with raw audio data.
graph LR
Mic[INMP441] -->|I2S| ESP32
ESP32 -->|Raw Audio (Voice Broker)| MQTT_Voice[Mosquitto .13]
subgraph Docker Server [.13]
MQTT_Voice -->|Sub Audio| Bridge[Python Bridge]
Bridge -->|HTTP| Whisper[Whisper API]
Whisper -->|Text| Bridge
end
Bridge -->|Text Command (HA Broker)| MQTT_HA[Mosquitto .30]
MQTT_HA -->|Trigger| HA[Home Assistant]
HA -->|Action| Lights/Scripts
📂 File Structure
1. Firmware (Client)
Location: ~/Documents/Arduino/voice_assistant/
voice_assistant.ino: Main ESP32 firmware. Handles I2S reading, VAD (Voice Activity Detection), and MQTT streaming.- Dependencies:
PubSubClient,Freenove_WS2812_Lib_for_ESP32.
2. Backend (Server)
Location: ~/voice_bridge/ (On Server .13)
docker-compose.yml: Orchestrates the Whisper and Bridge containers.mqtt_audio_bridge.py: The "Brain".- Listens to Audio on
.13. - Buffers audio and checks for Wake Word ("Hey Jarvis").
- Sends valid audio to Whisper.
- Publishes resulting text to Home Assistant Broker
.30.
- Listens to Audio on
app.py: A lightweight Flask wrapper forfaster-whisper(running in separate container).
🚦 LED Status Indicators (ESP32)
The LED Ring provides real-time feedback on the device state.
| Color | State | Trigger |
|---|---|---|
| Off | Idle | Listening for sound (VAD). |
| Green | Recording | Sound detected above threshold. Streaming audio. |
| Blue | Processing | Silence detected. End of stream sent to server. |
| Rainbow | Acknowledged | "Wake Word" detected by server OR Command executed. |
| Red | Error | Wi-Fi or MQTT connection loss. |
⚙️ Configuration Details
Hardware Config (ESP32-S3)
- Mic SCK: GPIO 4
- Mic WS: GPIO 5
- Mic SD: GPIO 6
- LED Pin: GPIO 5
- VAD Threshold:
1000(Adjust based on noise floor) - Soft Gain:
x6(Multiplies raw input to match Whisper requirements)
Docker Environment
- Network: Internal bridge network between Bridge and Whisper.
- Models:
- Wake Word:
hey_jarvis(OpenWakeWord) - STT:
small.en(Faster-Whisper, int8 quantization)
- Wake Word:
🚀 Setup & Commands
1. Flash ESP32
arduino-cli upload -v -p /dev/ttyACM0 --fqbn esp32:esp32:esp32s3:CDCOnBoot=cdc,USBMode=hwcdc,PSRAM=disabled,FlashMode=dio,DebugLevel=info --input-dir ./build
2. Start Backend
cd ~/voice_bridge
docker compose up -d
3. Monitor Logs
docker compose logs -f voice_bridge
4. Monitor Audio Stream (Debug)
mosquitto_sub -h 192.168.20.13 -u "mqtt-user" -P "pass" -t "voice/audio_stream" | awk '{printf "."}'
Description
Languages
C
52.5%
Makefile
47%
C++
0.4%
Python
0.1%