Audio Architecture & Capture

Audio Architecture & Capture

The DiktaMe.Core/Audio/ namespace controls the heartbeat of the dictation pipeline. It exclusively manages microphone devices, raw buffer capture, Memory Streams, and the automated ducking of competing desktop sounds.

dIKta.me relies internally on NAudio to securely interface with the low-level Windows WASAPI (Windows Audio Session API) driver ecosystem.

Audio Capture Lifecycle

sequenceDiagram
    participant User as User
    participant Hotkey as GlobalHotkeyService
    participant Audio as AudioCaptureService
    participant Ducker as AudioDuckerService
    participant STT as STTRouter (Pipeline)

    User->>Hotkey: Presses Ctrl+Alt+D
    Hotkey->>Audio: StartRecordingAsync()
    Audio->>Ducker: DuckDesktopAudio(20%)
    
    rect rgb(20, 40, 60)
        Note right of Audio: WASAPI Loop (16kHz 16-bit PCM)
        loop Every 300ms
            Audio-->>Audio: Write bytes to MemoryStream
            opt If IStreamingSTTProvider
                Audio->>STT: Yield return byte[] (WebSockets)
            end
        end
    end
    
    User->>Hotkey: Releases Ctrl+Alt+D
    Hotkey->>Audio: StopRecordingAsync()
    Audio->>Ducker: RestoreAudioVolume()
    
    opt If ISTTProvider (Batch)
        Audio->>STT: Submit monolithic byte[] array
    end

The interaction between the user pressing Ctrl+Alt+D (Dictate) and the raw voice audio being submitted to the Speech-to-Text inference engine involves a strict, memory-safe data lifecycle.

Because audio capture generates massive arrays of floating-point data, the architecture prioritizes minimizing object allocations.

1. Device Selection & Initialization

When the AudioCaptureService initializes, it queries the internal Windows mixer driver using NAudio's MMDeviceEnumerator. It will default to the primary system recording device unless the user has explicitly defined a unique microphone override string in the .json application settings.

2. Stream Buffering

Instead of holding an entire 10-minute WAV file in RAM identically, capture runs inside an asynchronous stream handler hooked into NAudio's DataAvailable event loop.

Audio is fundamentally sampled at 16 kHz, 16-bit Mono (PCM). This exact footprint is chosen intentionally because it is natively supported directly by Deepgram Flux/Nova, Whisper.net Local execution, and Google Gemini without requiring CPU-intensive upsampling or format conversion logic mid-flight.

3. Ducking (Audio Attenuation)

If the user has Audio Ducking enabled, the AudioDuckerService is immediately triggered on record start.

The service drops into the Windows AudioSessionManager to find any process outputting audio (Spotify, Edge, Games) that isn't the core dIKta.me process. It sets the session volume scalar to the configured interface percentage (e.g. 20%) and then strictly waits to restore the volume exactly after the key is released.

4. Memory Stream Handoff

As the WasapiCapture feeds chunks of audio byte arrays, the AudioCaptureService writes them securely into an expanding, encapsulated MemoryStream.

If the current Pipeline supports Batch Processing (like Whisper.net), this MemoryStream simply closes on key release and acts as the payload.

If the pipeline supports Streaming Processing (like Deepgram WebSockets), the capture loop actively intercepts each 320ms buffer packet and yields it directly onto the network socket asynchronously immediately, drastically cutting down total end-to-end user latency.