Streaming Inference

Xybrid supports real-time streaming inference for speech recognition. Audio is processed in chunks with partial transcription results as you speak.

Architecture

Microphone → AudioBuffer → Chunking → Whisper → Partial Results
                 ↑                         ↓
             Overlap                  Callback

Key Components

Component	Description
`StreamSession`	Manages streaming state and processing
`AudioBuffer`	Ring buffer with overlap support
`XybridStreamer`	Flutter SDK streaming class

How It Works

Audio is fed continuously to the AudioBuffer
When a chunk is ready (5 seconds by default), it's processed
Whisper transcribes the chunk
Partial result is emitted via callback
On stop, remaining audio is flushed

Flutter SDK

Basic Usage

import 'package:xybrid_flutter/xybrid_flutter.dart';

// Create streaming session
final streamer = await XybridStreamer.create(
  modelPath: '/path/to/whisper-tiny-candle',
  config: StreamingConfig(
    chunkSizeMs: 5000,   // 5 second chunks
    overlapMs: 500,       // 0.5s overlap
  ),
);

// Listen for partial results
streamer.onPartialResult.listen((partial) {
  print('Partial: $partial');
});

// Feed audio from microphone
micStream.listen((pcmChunk) {
  streamer.feedPcm16(pcmChunk);
});

// Stop and get final result
final result = await streamer.flush();
print('Final: $result');

// Cleanup
await streamer.dispose();

From Registry

final streamer = await XybridStreamer.createFromRegistry(
  config: RegistryStreamingConfig(
    modelId: 'whisper-tiny-candle',
    version: '1.0',
    registryUrl: 'http://localhost:8080',
  ),
);

With XybridRecorder

final recorder = XybridRecorder();
final streamer = await XybridStreamer.create(modelPath: modelPath);

// Start streaming from microphone
await recorder.startStreaming((samples) {
  streamer.feed(samples);
});

streamer.onPartialResult.listen((text) {
  setState(() => transcription = text);
});

// Stop
await recorder.stopStreaming();
final result = await streamer.flush();

Streaming Config

StreamingConfig(
  chunkSizeMs: 5000,    // Process every 5 seconds
  overlapMs: 500,        // 0.5s overlap between chunks
  useVad: false,         // Enable Voice Activity Detection
)

Chunk Size Trade-offs

Chunk Size	Latency	Accuracy
2 seconds	Low	Lower (less context)
5 seconds	Medium	Good balance
10 seconds	High	Better (more context)

Session Lifecycle

// States
enum StreamState {
  idle,        // Ready to start
  streaming,   // Receiving audio
  finalizing,  // Processing remaining audio
  completed,   // Done
  error,       // Error occurred
}

// Check state
if (streamer.state == StreamState.streaming) {
  // Currently processing
}

Statistics

final stats = streamer.stats;
print('Samples received: ${stats.samplesReceived}');
print('Samples processed: ${stats.samplesProcessed}');
print('Chunks processed: ${stats.chunksProcessed}');
print('Audio duration: ${stats.audioDurationMs}ms');

WebSocket Streaming

For browser clients, use the WebSocket endpoint:

Connect

const ws = new WebSocket('ws://localhost:3000/v1/audio/transcriptions/stream');

Send Audio

// Binary: PCM 16-bit, 16kHz, mono
ws.send(audioChunk);

// Control: JSON messages
ws.send(JSON.stringify({ type: 'flush' }));
ws.send(JSON.stringify({ type: 'reset' }));
ws.send(JSON.stringify({ type: 'close' }));

Receive Results

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  switch (msg.type) {
    case 'ready':
      console.log('Session ready');
      break;
    case 'partial':
      console.log('Partial:', msg.text);
      break;
    case 'final':
      console.log('Final:', msg.text);
      break;
    case 'error':
      console.error('Error:', msg.message);
      break;
  }
};

Voice Activity Detection (VAD)

VAD automatically segments audio based on speech detection:

final streamer = await XybridStreamer.create(
  modelPath: modelPath,
  config: StreamingConfig(
    useVad: true,
  ),
);

// Check VAD availability
if (streamer.hasVad) {
  print('VAD enabled');
}

Performance

Current baseline (M1 Mac, whisper-tiny-candle):

Metric	Value
Chunk duration	5 seconds
Partial result latency	~5-7 seconds
Processing mode	CPU

Optimization Opportunities

GPU/Metal acceleration
Smaller chunk sizes (2-3 seconds)
Distilled Whisper models

Streaming Inference

On this page