95 lines
3.3 KiB
Markdown
95 lines
3.3 KiB
Markdown
#line 1 "/media/sam/8294CD2994CD2111/Users/Dell/Documents/Arduino/voice_assistant/README.md"
|
|
# AI Voice Assistant Gateway (ESP32-S3 + Docker)
|
|
|
|
## 🎯 Project Objective
|
|
A private, local voice assistant system that separates high-bandwidth audio streaming from smart home command logic.
|
|
- **Frontend:** ESP32-S3 acts as a dumb "microphone-to-MQTT" bridge.
|
|
- **Backend:** Docker container performs Wake Word detection (OpenWakeWord) and Speech-to-Text (Faster-Whisper).
|
|
- **Controller:** Home Assistant receives clean text commands for execution.
|
|
|
|
## 🏗️ Architecture
|
|
The system uses a **Dual-Broker** strategy to prevent flooding Home Assistant with raw audio data.
|
|
|
|
```mermaid
|
|
graph LR
|
|
Mic[INMP441] -->|I2S| ESP32
|
|
ESP32 -->|Raw Audio (Voice Broker)| MQTT_Voice[Mosquitto .13]
|
|
|
|
subgraph Docker Server [.13]
|
|
MQTT_Voice -->|Sub Audio| Bridge[Python Bridge]
|
|
Bridge -->|HTTP| Whisper[Whisper API]
|
|
Whisper -->|Text| Bridge
|
|
end
|
|
|
|
Bridge -->|Text Command (HA Broker)| MQTT_HA[Mosquitto .30]
|
|
MQTT_HA -->|Trigger| HA[Home Assistant]
|
|
HA -->|Action| Lights/Scripts
|
|
```
|
|
|
|
## 📂 File Structure
|
|
|
|
### 1. Firmware (Client)
|
|
Location: `~/Documents/Arduino/voice_assistant/`
|
|
* **`voice_assistant.ino`**: Main ESP32 firmware. Handles I2S reading, VAD (Voice Activity Detection), and MQTT streaming.
|
|
* **Dependencies**: `PubSubClient`, `Freenove_WS2812_Lib_for_ESP32`.
|
|
|
|
### 2. Backend (Server)
|
|
Location: `~/voice_bridge/` (On Server .13)
|
|
* **`docker-compose.yml`**: Orchestrates the Whisper and Bridge containers.
|
|
* **`mqtt_audio_bridge.py`**: The "Brain".
|
|
* Listens to Audio on `.13`.
|
|
* Buffers audio and checks for Wake Word ("Hey Jarvis").
|
|
* Sends valid audio to Whisper.
|
|
* Publishes resulting text to Home Assistant Broker `.30`.
|
|
* **`app.py`**: A lightweight Flask wrapper for `faster-whisper` (running in separate container).
|
|
|
|
## 🚦 LED Status Indicators (ESP32)
|
|
The LED Ring provides real-time feedback on the device state.
|
|
|
|
| Color | State | Trigger |
|
|
| :--- | :--- | :--- |
|
|
| **Off** | Idle | Listening for sound (VAD). |
|
|
| **Green** | Recording | Sound detected above threshold. Streaming audio. |
|
|
| **Blue** | Processing | Silence detected. End of stream sent to server. |
|
|
| **Rainbow** | Acknowledged | "Wake Word" detected by server OR Command executed. |
|
|
| **Red** | Error | Wi-Fi or MQTT connection loss. |
|
|
|
|
## ⚙️ Configuration Details
|
|
|
|
### Hardware Config (ESP32-S3)
|
|
* **Mic SCK:** GPIO 4
|
|
* **Mic WS:** GPIO 5
|
|
* **Mic SD:** GPIO 6
|
|
* **LED Pin:** GPIO 5
|
|
* **VAD Threshold:** `1000` (Adjust based on noise floor)
|
|
* **Soft Gain:** `x6` (Multiplies raw input to match Whisper requirements)
|
|
|
|
### Docker Environment
|
|
* **Network:** Internal bridge network between Bridge and Whisper.
|
|
* **Models:**
|
|
* Wake Word: `hey_jarvis` (OpenWakeWord)
|
|
* STT: `small.en` (Faster-Whisper, int8 quantization)
|
|
|
|
## 🚀 Setup & Commands
|
|
|
|
**1. Flash ESP32**
|
|
```bash
|
|
arduino-cli upload -v -p /dev/ttyACM0 --fqbn esp32:esp32:esp32s3:CDCOnBoot=cdc,USBMode=hwcdc,PSRAM=disabled,FlashMode=dio,DebugLevel=info --input-dir ./build
|
|
```
|
|
|
|
**2. Start Backend**
|
|
```bash
|
|
cd ~/voice_bridge
|
|
docker compose up -d
|
|
```
|
|
|
|
**3. Monitor Logs**
|
|
```bash
|
|
docker compose logs -f voice_bridge
|
|
```
|
|
|
|
**4. Monitor Audio Stream (Debug)**
|
|
```bash
|
|
mosquitto_sub -h 192.168.20.13 -u "mqtt-user" -P "pass" -t "voice/audio_stream" | awk '{printf "."}'
|
|
```
|