Deepgram Integration for Real-Time Transcription in Mobile Applications
Deepgram Nova-2 is the only provider with truly low latency on streaming: median around 300 ms from phrase end to text. Whisper can't do this in principle — it's synchronous. If task is "user speaks — text appears on screen" with latency under one second, this is Deepgram.
Connection Protocol
Deepgram works via WebSocket. Endpoint:
wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true
Parameters critical: encoding=linear16 means raw PCM 16-bit little-endian. Any other format without explicit codec — risk of 1008 Policy Violation. interim_results=true enables partial results — these create real-time feel.
iOS: AVAudioEngine + URLSessionWebSocketTask
class DeepgramStreamer {
private var audioEngine = AVAudioEngine()
private var webSocket: URLSessionWebSocketTask?
func start() throws {
let session = URLSession(configuration: .default)
var request = URLRequest(url: URL(string: "wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true")!)
request.setValue("Token \(apiKey)", forHTTPHeaderField: "Authorization")
webSocket = session.webSocketTask(with: request)
webSocket?.resume()
receiveLoop()
let inputNode = audioEngine.inputNode
let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: 1, interleaved: false)!
inputNode.installTap(onBus: 0, bufferSize: 4096, format: format) { buffer, _ in
guard let channelData = buffer.int16ChannelData else { return }
let frameLength = Int(buffer.frameLength)
let data = Data(bytes: channelData[0], count: frameLength * 2)
self.webSocket?.send(.data(data)) { _ in }
}
try audioEngine.start()
}
private func receiveLoop() {
webSocket?.receive { [weak self] result in
if case .success(let message) = result, case .string(let text) = message {
// Decode Deepgram JSON response
self?.handleTranscript(text)
}
self?.receiveLoop()
}
}
}
Important detail: AVAudioEngine.inputNode on iOS 16+ requires explicit microphone permission via AVAudioSession.sharedInstance().requestRecordPermission. And mandatory AVAudioSession.setCategory(.record, mode: .measurement) — .measurement mode disables AEC and AGC, which can distort signal for transcription.
Android: AudioRecord + OkHttp WebSocket
class DeepgramStreamer(private val apiKey: String) {
private val client = OkHttpClient()
private var webSocket: WebSocket? = null
private var audioRecord: AudioRecord? = null
fun start(onTranscript: (String, Boolean) -> Unit) {
val request = Request.Builder()
.url("wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true")
.header("Authorization", "Token $apiKey")
.build()
webSocket = client.newWebSocket(request, object : WebSocketListener() {
override fun onMessage(webSocket: WebSocket, text: String) {
val json = JSONObject(text)
val channel = json.getJSONObject("channel")
val alternatives = channel.getJSONArray("alternatives")
val transcript = alternatives.getJSONObject(0).getString("transcript")
val isFinal = json.getBoolean("is_final")
if (transcript.isNotEmpty()) onTranscript(transcript, isFinal)
}
})
val bufferSize = AudioRecord.getMinBufferSize(16000, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT)
audioRecord = AudioRecord(MediaRecorder.AudioSource.MIC, 16000, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, bufferSize)
audioRecord?.startRecording()
Thread {
val buffer = ShortArray(bufferSize / 2)
while (audioRecord?.recordingState == AudioRecord.RECORDSTATE_RECORDING) {
val read = audioRecord!!.read(buffer, 0, buffer.size)
if (read > 0) {
val byteBuffer = ByteBuffer.allocate(read * 2).order(ByteOrder.LITTLE_ENDIAN)
buffer.take(read).forEach { byteBuffer.putShort(it) }
webSocket?.send(byteBuffer.array().toByteString())
}
}
}.start()
}
}
ByteOrder.LITTLE_ENDIAN — mandatory. Deepgram expects LE PCM. If sending BE, transcription works but with noticeably worse quality.
What to Do with interim_results
Deepgram returns two types: is_final: false (interim) and is_final: true (final). Right UI pattern:
- Interim display in gray or italics — user sees recognition happening
- On
is_final: truereplace all previous interim of this utterance with final text -
speech_final: truemeans end of pause — good moment to start phrase processing
Common mistake — accumulate all interim as separate lines. Causes duplication. Must store current interim-buffer and update in-place.
Nova-2 Parameters Affecting Quality
utterance_end_ms: 1000 — Deepgram finalizes utterance after 1 second silence itself. Useful for dictation without explicit "stop" command.
diarize: true — speaker separation, adds speaker to each word.
punctuate: true — auto-punctuation. Without it text comes without periods and commas.
smart_format: true — formats numbers, dates, phones. "twenty-fifth March" → "25 March".
Timeline
Basic integration WebSocket + AudioRecord/AVAudioEngine + text output — 4–7 days. Adding diarization, network switch handling (reconnect), background mode, result export — 8–14 days.







