What is Deepgram and how is it different from other solutions?

Deepgram is an ASR service with streaming support and the Nova-2 model, providing a median latency of about 300 ms from the end of a phrase to text. Unlike Whisper, Deepgram works in asynchronous mode, allowing final results with sub-second latency.

Which platforms are supported for Deepgram integration?

Deepgram can be integrated on iOS (via URLSessionWebSocketTask), Android (OkHttp WebSocket), and cross-platform frameworks (Flutter, React Native). Audio capture uses AVAudioEngine on iOS and AudioRecord on Android.

How to handle interim results in a mobile app?

Interim results are flagged with is_final: false and should be displayed in gray or italics. When is_final: true is received, replace all previous interim text for that utterance with the final text. Maintain a current interim buffer and update it in-place to avoid duplication.

Which Deepgram Nova-2 parameters affect transcription quality?

Critical parameters: model=Nova-2, encoding=linear16, sample_rate=16000, interim_results=true. Additional: utterance_end_ms (finalization on pause), diarize (speaker separation), punctuate (auto-punctuation), smart_format (number/date formatting).

What is Deepgram and how is it different from other solutions?

Deepgram is an ASR service with streaming support and the Nova-2 model, providing a median latency of about 300 ms from the end of a phrase to text. Unlike Whisper, Deepgram works in asynchronous mode, allowing final results with sub-second latency.

Which platforms are supported for Deepgram integration?

Deepgram can be integrated on iOS (via URLSessionWebSocketTask), Android (OkHttp WebSocket), and cross-platform frameworks (Flutter, React Native). Audio capture uses AVAudioEngine on iOS and AudioRecord on Android.

How to handle interim results in a mobile app?

Interim results are flagged with is_final: false and should be displayed in gray or italics. When is_final: true is received, replace all previous interim text for that utterance with the final text. Maintain a current interim buffer and update it in-place to avoid duplication.

Which Deepgram Nova-2 parameters affect transcription quality?

Critical parameters: model=Nova-2, encoding=linear16, sample_rate=16000, interim_results=true. Additional: utterance_end_ms (finalization on pause), diarize (speaker separation), punctuate (auto-punctuation), smart_format (number/date formatting).

Mobile App Transcription with Deepgram Nova-2

Q: How long does a basic Deepgram integration take?

A basic integration of WebSocket, audio capture, and text output takes 4–7 days. Adding diarization, network switching handling (reconnect), background mode, and result export adds another 8–14 days.

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

Mobile App Transcription with Deepgram Nova-2

Medium

~3-5 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
746
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
969
Development of a mobile application for the FLAVORS company
563

Show more works

Imagine: a user speaks into the microphone, and text appears on screen with a delay of less than half a second. This is a non-trivial engineering task, but with Deepgram Nova-2 it is solved. The model gives a median latency of about 300 ms and a WER of about 5% on Russian. Such indicators are unavailable to classic batch solutions. In this article, we'll cover how to set up WebSocket transcription on iOS and android, which parameters are critical, and how to avoid typical mistakes.

Typical difficulties include unstable connections, duplication of interim results, and incorrect audio codec configuration. Our team, with 10+ years of experience in ASR implementation, guarantees a stable connection and low latency. On a real project, a client complained about interim duplication: each new word was appended to the previous one. After implementing the buffer replacement pattern, the problem disappeared completely.

Comparison of Deepgram Nova-2 and Whisper

Deepgram Nova-2 provides low latency on streaming: median of about 300 ms from the end of a phrase to text. Whisper cannot do that in principle – it is synchronous. If the task is "user speaks – text appears on screen" with sub-second delay, it's Deepgram.

Characteristic	Deepgram Nova-2	Whisper (synchronous)
Final latency	300–500 ms	2–5 seconds
Streaming	Asynchronous, streaming	Synchronous, batch
Interim results	Yes	No
Russian support	Excellent	Good
Price (per hour)	Upon request	Free (self-hosted)

For a mobile scenario, Deepgram wins by 6–10 times in speed. Additionally, Nova-2 achieves WER ≤5% on Russian, while Whisper large-v2 is around 7%.

Configuring the connection protocol

Deepgram works via WebSocket. Endpoint:

wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true

Parameters are critical: encoding=linear16 means raw PCM 16-bit little-endian. Any other format without explicit codec specification risks a 1008 Policy Violation. interim_results=true enables partial results – they create the real-time feel.

iOS: AVAudioEngine + URLSessionWebSocketTask

class DeepgramStreamer {
    private var audioEngine = AVAudioEngine()
    private var webSocket: URLSessionWebSocketTask?

    func start() throws {
        let session = URLSession(configuration: .default)
        var request = URLRequest(url: URL(string: "wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true")!)
        request.setValue("Token \(apiKey)", forHTTPHeaderField: "Authorization")
        webSocket = session.webSocketTask(with: request)
        webSocket?.resume()

        receiveLoop()

        let inputNode = audioEngine.inputNode
        let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: 1, interleaved: false)!
        inputNode.installTap(onBus: 0, bufferSize: 4096, format: format) { buffer, _ in
            guard let channelData = buffer.int16ChannelData else { return }
            let frameLength = Int(buffer.frameLength)
            let data = Data(bytes: channelData[0], count: frameLength * 2)
            self.webSocket?.send(.data(data)) { _ in }
        }
        try audioEngine.start()
    }

    private func receiveLoop() {
        webSocket?.receive { [weak self] result in
            if case .success(let message) = result, case .string(let text) = message {
                // Decode Deepgram JSON response
                self?.handleTranscript(text)
            }
            self?.receiveLoop()
        }
    }
}

Important detail: AVAudioEngine.inputNode on iOS 16+ requires explicit microphone permission via AVAudioSession.sharedInstance().requestRecordPermission. And обязательно AVAudioSession.setCategory(.record, mode: .measurement) – the .measurement mode disables AEC and AGC, which can distort the signal for transcription.

Android: AudioRecord + OkHttp WebSocket

class DeepgramStreamer(private val apiKey: String) {
    private val client = OkHttpClient()
    private var webSocket: WebSocket? = null
    private var audioRecord: AudioRecord? = null

    fun start(onTranscript: (String, Boolean) -> Unit) {
        val request = Request.Builder()
            .url("wss://api.deepgram.com/v1/listen?model=nova-2&language=ru&encoding=linear16&sample_rate=16000&channels=1&interim_results=true")
            .header("Authorization", "Token $apiKey")
            .build()

        webSocket = client.newWebSocket(request, object : WebSocketListener() {
            override fun onMessage(webSocket: WebSocket, text: String) {
                val json = JSONObject(text)
                val channel = json.getJSONObject("channel")
                val alternatives = channel.getJSONArray("alternatives")
                val transcript = alternatives.getJSONObject(0).getString("transcript")
                val isFinal = json.getBoolean("is_final")
                if (transcript.isNotEmpty()) onTranscript(transcript, isFinal)
            }
        })

        val bufferSize = AudioRecord.getMinBufferSize(16000, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT)
        audioRecord = AudioRecord(MediaRecorder.AudioSource.MIC, 16000, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, bufferSize)
        audioRecord?.startRecording()

        Thread {
            val buffer = ShortArray(bufferSize / 2)
            while (audioRecord?.recordingState == AudioRecord.RECORDSTATE_RECORDING) {
                val read = audioRecord!!.read(buffer, 0, buffer.size)
                if (read > 0) {
                    val byteBuffer = ByteBuffer.allocate(read * 2).order(ByteOrder.LITTLE_ENDIAN)
                    buffer.take(read).forEach { byteBuffer.putShort(it) }
                    webSocket?.send(byteBuffer.array().toByteString())
                }
            }
        }.start()
    }
}

ByteOrder.LITTLE_ENDIAN is mandatory. Deepgram expects LE PCM. Sending BE will work but with noticeably worse quality.

How to avoid typical streaming audio mistakes?

Interim duplication: never accumulate all interim as separate lines. Store the current utterance in a buffer and overwrite it with each new interim. When is_final: true arrives, finalize the buffer.
Connection loss: implement reconnect with exponential backoff (1,2,4,8 sec). Deepgram does not support session resumption, so after reconnect you must start a new stream.
Incorrect sample rate: strictly use 16000 Hz. Higher rates increase traffic without quality gain, lower rates degrade recognition.

Handling interim results

Deepgram returns two types of messages: with is_final: false (interim) and is_final: true (final). Correct UI pattern:

Display interim in gray or italics – the user sees recognition in progress
When is_final: true is received, replace all previous interim of that utterance with the final text
speech_final: true indicates the end of a pause – a good moment to start processing the phrase

Nova-2 parameters that affect quality

Parameter	Value	Description
model	nova-2	Recognition model
encoding	linear16	Audio encoding
sample_rate	16000	Sample rate
interim_results	true	Enable partial results
utterance_end_ms	1000	Finalization on pause
diarize	false	Speaker separation
punctuate	true	Auto-punctuation
smart_format	true	Formatting numbers and dates

utterance_end_ms: 1000 – Deepgram automatically finalizes utterance after 1 second of silence. Useful for dictation without explicit "stop" commands.
diarize: true – speaker separation, adds speaker to each word.
punctuate: true – auto-punctuation. Without it, text lacks periods and commas.
smart_format: true – formats numbers, dates, phones. "twenty-fifth of March" → "25 March".

What's included in the work

Setting up WebSocket connection with authorization
Audio capture via AVAudioEngine / AudioRecord in correct format
Processing interim and final results without duplication
Reconnect on network drop with exponential backoff
Testing on real devices (iOS/Android)
Documentation and team training

Work process

Analysis – we study the application architecture and transcription requirements
Design – we choose the model, protocol, and parameters
Implementation – code integration of WebSocket, audio capture, UI
Testing – load testing, verification on different devices and networks
Deployment – publishing in App Store and Google Play

Timeline

Basic integration of WebSocket + AudioRecord/AVAudioEngine + text output – 4–7 days. Adding diarization, network switching handling (reconnect), background mode, and result export – 8–14 days.

Savings on developing your own ASR engine can reach 70%. Transcription cost is low – significantly cheaper than alternatives with batch processing.

Get a consultation on integrating Deepgram for your mobile project. We will analyze the architecture, choose the optimal configuration, and implement low-latency transcription turnkey. Order a demo version with low latency today. Contact us to discuss your project.

Integrating Deepgram allowed us to reduce transcription latency from 5 seconds to 300 ms – client feedback from the financial sector.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.