Two-Way Audio with IoT Device via Mobile App
Intercom, baby monitor, two-way radio — common denominator: phone hears device and simultaneously speaks to it. Unlike ordinary VoIP call between two phones, one side here is embedded Linux microcomputer (ESP32, Raspberry Pi, NXP i.MX) that doesn't understand SIP stack or WebRTC without additional software. This changes architecture choice.
WebRTC — Main Choice for Minimal Latency
Round-trip latency (RTT) for intelligible speech — not more than 300-400 ms. HLS and RTMP unsuitable. SIP possible but burdened with protocol overhead. WebRTC created for exactly this scenario.
IoT device side: libwebrtc on Linux or specialized solutions: aiortc (Python), Pion (Go), GStreamer with webrtcbin plugin. Pion — minimalist and easy to deploy on Raspberry Pi. GStreamer webrtcbin — if device already uses GStreamer for video pipeline.
Mobile app side:
iOS: GoogleWebRTC (pod 'WebRTC-SDK') or native WebRTCFramework. Create RTCPeerConnection with audio track:
let audioConstraints = RTCMediaConstraints(mandatoryConstraints: nil, optionalConstraints: nil)
let audioSource = factory.audioSource(with: audioConstraints)
let audioTrack = factory.audioTrack(with: audioSource, trackId: "audio0")
peerConnection.add(audioTrack, streamIds: ["stream0"])
Configure RTCAudioSession to .voiceChat category — automatically enables echo cancellation and background noise suppression (AEC/NS) built into WebRTC.
Android: io.getstream:stream-webrtc-android or official WebRTC from Google. AudioManager.MODE_IN_COMMUNICATION — mandatory for proper audio routing (earpiece/speakerphone).
Flutter: flutter_webrtc. Configure mediaConstraints for audio:
final Map<String, dynamic> mediaConstraints = {
'audio': {
'echoCancellation': true,
'noiseSuppression': true,
'autoGainControl': true,
}
};
Echo Cancellation: Main Pain Point of Two-Way Audio
Without AEC (Acoustic Echo Cancellation): phone's microphone picks up sound from speaker (or device hears itself) — user hears echo with 200 ms delay. Unusable.
WebRTC contains built-in AEC3 (third generation). Works automatically with correct audio session category. Problem arises when:
- IoT device doesn't support echo reference path — then AEC on device side ineffective. Solution: move AEC to server side (media server with processing enabled).
- Bluetooth headset + WebRTC — on Android
AudioManagerinCOMMUNICATIONmode switches BT profile to HFP (narrow 8 kHz). For wideband audio need A2DP but it doesn't support recording. Compromise: either low quality with BT or AirPods/wired headphones.
SIP as Alternative
If IoT device supports SIP (many IP intercoms: Grandstream, Panasonic, Commax), use SIP client on mobile.
iOS: PJSIP (C library) with Swift wrapper or Linphone SDK. Android: MjSip or same PJSIP via JNI, or ready Linphone SDK for Android. Flutter: sip_ua (Dart SIP, works via WebSocket transport).
SIP on mobile requires registration on Asterisk/FreeSWITCH server. Call from intercom → SIP INVITE → server → push notification on phone (via CallKit on iOS, ConnectionService/IncomingCallNotification on Android). Without push — notification doesn't arrive with closed app.
CallKit (iOS): incoming call looks like ordinary phone call — full-screen interface with intercom name. CXProvider, CXCallUpdate — standard integration. Mandatory voip Background Mode in Info.plist + APNs VoIP certificate.
Android ConnectionService: CallKit analogue. TelecomManager.addNewIncomingCall() — shows system incoming call interface. Works with Android 6+.
Environmental Noise and Aggressive Noise Suppression
Outside wind, construction nearby — IoT device sends noisy stream. Additional noise suppression: RTCRtpSender with RTCDefaultVideoEncoderFactory — audio-only. WebRTC RNNoise integrated into native WebRTC and enabled via AudioProcessing::Config::NoiseSuppression.
For serious processing on server: Janus with janus_audiobridge.janus_plugin module applies noise suppression before mixing.
Testing
Main problem: reproduce NAT traversal in test environment. Use Coturn in Docker for local TURN testing. Test on symmetric NAT (corporate network with strict rules) — mandatory. Without TURN server roughly 15-20% connections won't establish.
Timeline: two-way WebRTC audio with IoT device (Linux/Pion) + iOS or Android client — 5-7 workdays. With SIP integration and CallKit — 8-12 days.







