Real-Time Speech Recognition in Unity + Python & WebSockets

Real-Time Speech Recognition in Unity + Python & WebSockets | Open source

payamranjbar97@gmail.com

February 27, 2025

Real-Time Speech Recognition in Unity + Python & WebSockets | Open source

This project bridges Python and Unity to create a voice-controlled gaming experience using real-time speech recognition. A Python server leverages Vosk Mini for low-latency, offline audio transcription, streaming results to Unity via a WebSocket connection. The system employs a dual-threaded architecture: the Python server isolates audio processing in a background thread, while Unity’s C# client uses a ConcurrentQueue to safely pass voice commands from a WebSocket thread to the main game loop. Designed for scalability, voice triggers are defined through ScriptableObjects, allowing developers to map words like “pizza” or “burger” to 3D models without code changes. Ideal for games requiring instant voice interaction, this solution prioritizes performance (sub-500ms latency) and modularity.

This project demonstrates a low-latency speech recognition system for Unity games, combining:

Python Server: Uses Vosk Mini for offline speech-to-text.
Unity Client: Handles voice commands via WebSocket.
Threaded Architecture: Keeps networking separate from game logic.

Technical Breakdown

1. Python Server (Vosk + WebSocket)

Vosk Mini: Lightweight ASR model for real-time transcription.
WebSocket: Async server (websockets library) on port 8765.
Dual Threads:
- Main Thread: Manages WebSocket connections.
- Audio Thread: Processes microphone input without blocking.

# Simplified server snippet
async def send_audio(websocket):
    def sync_callback(word):
        asyncio.run_coroutine_threadsafe(send_word(word), loop)
    await loop.run_in_executor(None, run_recording, sync_callback)

2. Unity Client (C# WebSocket)

Threaded WebSocket: Runs in background via System.Threading.
Concurrent Queue: Safely passes messages to the main thread.
ScriptableObjects: Configurable voice commands (e.g., “apple” → spawn 3D fruit).

// Unity WebSocket handler
private void RunWebSocket() {
    ws = new WebSocket("ws://localhost:8765");
    ws.OnMessage += (sender, e) => receivedWordsQueue.Enqueue(e.Data);
}

Key Features

Latency < 500ms: Achieved via Vosk Mini’s optimized inference.
Thread Safety:
- Python: asyncio.run_coroutine_threadsafe() for async sync.
- Unity: ConcurrentQueue decouples networking from gameplay.
Scalability: Add commands via Unity’s ScriptableObjects, no code changes.

Use Cases

Voice-controlled character transformations
Speech-driven puzzle mechanics
Accessibility features for motor-impaired players

[GitHub Repositories]

Python Server: https://github.com/payam-ranjbar/Speech-Transfer-Socket
Unity Client: https://github.com/payam-ranjbar/Speech-Mania-Game

payamranjbar97@gmail.com

Payam Ranjbar