StreamSession

Real-time audio processing

The StreamSession manages real-time streaming inference. Audio is processed in chunks with partial results emitted as you speak.

Architecture

How It Works

Audio Input - Continuous audio fed to the AudioBuffer
Chunking - When buffer reaches threshold (5 seconds), chunk is extracted
Overlap - Configurable overlap between chunks for continuity
Processing - Model transcribes the chunk
Callback - Partial result emitted to application
Flush - On stop, remaining audio processed for final result

Components

Component	Description
`StreamManager`	Coordinates input/output buffers
`AudioBuffer`	Ring buffer with overlap support
`StreamSession`	Manages streaming state

Session States

State	Description
`idle`	Ready to start
`streaming`	Receiving and processing audio
`finalizing`	Processing remaining audio
`completed`	Done
`error`	Error occurred

Configuration

Parameter	Default	Description
`chunk_size_ms`	5000	Process every N milliseconds
`overlap_ms`	500	Overlap between chunks
`use_vad`	false	Enable Voice Activity Detection

Chunk Size Trade-offs

Chunk Size	Latency	Accuracy
2 seconds	Low	Lower (less context)
5 seconds	Medium	Good balance
10 seconds	High	Better (more context)

Statistics

The StreamSession tracks:

Metric	Description
`samples_received`	Total audio samples fed
`samples_processed`	Samples sent to model
`chunks_processed`	Number of chunks transcribed
`audio_duration_ms`	Total audio duration

Integration with Orchestrator

The Orchestrator supports streaming mode: // TODO Reevaluate the use of this chart

Voice Activity Detection (VAD)

VAD automatically segments audio based on speech detection:

Detects speech vs silence
Can trigger chunk processing on speech end
Reduces unnecessary processing of silence

Performance

Current baseline (M1 Mac, whisper-tiny-candle):

Metric	Value
Chunk duration	5 seconds
Partial result latency	~5-7 seconds
Processing mode	CPU

Optimization Opportunities

GPU/Metal acceleration
Smaller chunk sizes (2-3 seconds)
Distilled Whisper models
Voice Activity Detection to reduce processing

Orchestrator - Manages streaming execution
Streaming SDK - SDK usage guide

On this page

Architecture How It Works Components Session States Configuration Chunk Size Trade-offs Statistics Integration with Orchestrator Voice Activity Detection (VAD)Performance Optimization Opportunities Related