Xybrid

StreamSession

Real-time audio processing

The StreamSession manages real-time streaming inference. Audio is processed in chunks with partial results emitted as you speak.

Architecture

How It Works

  1. Audio Input - Continuous audio fed to the AudioBuffer
  2. Chunking - When buffer reaches threshold (5 seconds), chunk is extracted
  3. Overlap - Configurable overlap between chunks for continuity
  4. Processing - Model transcribes the chunk
  5. Callback - Partial result emitted to application
  6. Flush - On stop, remaining audio processed for final result

Components

ComponentDescription
StreamManagerCoordinates input/output buffers
AudioBufferRing buffer with overlap support
StreamSessionManages streaming state

Session States

StateDescription
idleReady to start
streamingReceiving and processing audio
finalizingProcessing remaining audio
completedDone
errorError occurred

Configuration

ParameterDefaultDescription
chunk_size_ms5000Process every N milliseconds
overlap_ms500Overlap between chunks
use_vadfalseEnable Voice Activity Detection

Chunk Size Trade-offs

Chunk SizeLatencyAccuracy
2 secondsLowLower (less context)
5 secondsMediumGood balance
10 secondsHighBetter (more context)

Statistics

The StreamSession tracks:

MetricDescription
samples_receivedTotal audio samples fed
samples_processedSamples sent to model
chunks_processedNumber of chunks transcribed
audio_duration_msTotal audio duration

Integration with Orchestrator

The Orchestrator supports streaming mode: // TODO Reevaluate the use of this chart

Voice Activity Detection (VAD)

VAD automatically segments audio based on speech detection:

  • Detects speech vs silence
  • Can trigger chunk processing on speech end
  • Reduces unnecessary processing of silence

Performance

Current baseline (M1 Mac, whisper-tiny-candle):

MetricValue
Chunk duration5 seconds
Partial result latency~5-7 seconds
Processing modeCPU

Optimization Opportunities

  • GPU/Metal acceleration
  • Smaller chunk sizes (2-3 seconds)
  • Distilled Whisper models
  • Voice Activity Detection to reduce processing

On this page