AI Voice Assistant Gateway (ESP32-S3 + Docker)

🎯 Project Objective

A private, local voice assistant system that separates high-bandwidth audio streaming from smart home command logic.

  • Frontend: ESP32-S3 acts as a dumb "microphone-to-MQTT" bridge.
  • Backend: Docker container performs Wake Word detection (OpenWakeWord) and Speech-to-Text (Faster-Whisper).
  • Controller: Home Assistant receives clean text commands for execution.

🏗️ Architecture

The system uses a Dual-Broker strategy to prevent flooding Home Assistant with raw audio data.

graph LR
    Mic[INMP441] -->|I2S| ESP32
    ESP32 -->|Raw Audio (Voice Broker)| MQTT_Voice[Mosquitto .13]
    
    subgraph Docker Server [.13]
        MQTT_Voice -->|Sub Audio| Bridge[Python Bridge]
        Bridge -->|HTTP| Whisper[Whisper API]
        Whisper -->|Text| Bridge
    end
    
    Bridge -->|Text Command (HA Broker)| MQTT_HA[Mosquitto .30]
    MQTT_HA -->|Trigger| HA[Home Assistant]
    HA -->|Action| Lights/Scripts

📂 File Structure

1. Firmware (Client)

Location: ~/Documents/Arduino/voice_assistant/

  • voice_assistant.ino: Main ESP32 firmware. Handles I2S reading, VAD (Voice Activity Detection), and MQTT streaming.
  • Dependencies: PubSubClient, Freenove_WS2812_Lib_for_ESP32.

2. Backend (Server)

Location: ~/voice_bridge/ (On Server .13)

  • docker-compose.yml: Orchestrates the Whisper and Bridge containers.
  • mqtt_audio_bridge.py: The "Brain".
    • Listens to Audio on .13.
    • Buffers audio and checks for Wake Word ("Hey Jarvis").
    • Sends valid audio to Whisper.
    • Publishes resulting text to Home Assistant Broker .30.
  • app.py: A lightweight Flask wrapper for faster-whisper (running in separate container).

🚦 LED Status Indicators (ESP32)

The LED Ring provides real-time feedback on the device state.

Color State Trigger
Off Idle Listening for sound (VAD).
Green Recording Sound detected above threshold. Streaming audio.
Blue Processing Silence detected. End of stream sent to server.
Rainbow Acknowledged "Wake Word" detected by server OR Command executed.
Red Error Wi-Fi or MQTT connection loss.

⚙️ Configuration Details

Hardware Config (ESP32-S3)

  • Mic SCK: GPIO 4
  • Mic WS: GPIO 5
  • Mic SD: GPIO 6
  • LED Pin: GPIO 5
  • VAD Threshold: 1000 (Adjust based on noise floor)
  • Soft Gain: x6 (Multiplies raw input to match Whisper requirements)

Docker Environment

  • Network: Internal bridge network between Bridge and Whisper.
  • Models:
    • Wake Word: hey_jarvis (OpenWakeWord)
    • STT: small.en (Faster-Whisper, int8 quantization)

🚀 Setup & Commands

1. Flash ESP32

arduino-cli upload -v -p /dev/ttyACM0 --fqbn esp32:esp32:esp32s3:CDCOnBoot=cdc,USBMode=hwcdc,PSRAM=disabled,FlashMode=dio,DebugLevel=info --input-dir ./build

2. Start Backend

cd ~/voice_bridge
docker compose up -d

3. Monitor Logs

docker compose logs -f voice_bridge

4. Monitor Audio Stream (Debug)

mosquitto_sub -h 192.168.20.13 -u "mqtt-user" -P "pass" -t "voice/audio_stream" | awk '{printf "."}'
Description
No description provided
Readme 37 MiB
Languages
C 52.5%
Makefile 47%
C++ 0.4%
Python 0.1%