How Voice Data Travels: With Internet vs Without Internet ππ
A developer's deep dive into what actually happens when you make a phone call So you're building a voice call feature in your app. You pick up a library, maybe WebRTC or a third-party SDK, and things just... work. But then a question hits you mid-implementation: "Wait β how is voice data actually being sent? And how is this different from a regular phone call?" That exact thought led me down a rabbit hole. This article breaks it all down β in plain English, with real technical depth underneath. When you speak into a phone, your voice is just air vibrations (analog signal). Before it can travel anywhere β through towers or internet β it must be converted into digital data. Both call types do this. The difference is how that data travels afterward. Your Voice (Analog) β Digitize + Compress β βββββββββββββ βββββββββββββββββββ β No Internetβ β With Internet β β GSM/VoLTE β β WebRTC/VoIP β βββββββββββββ βββββββββββββββββββ A regular phone call uses your telecom operator's infrastructure β towers, cables, switching centers β completely independent of the internet. You speak π€ β Microphone captures analog audio β ADC (Analog-to-Digital Converter) β digital signal β Codec compresses it (AMR / AMR-WB / EVS) β Sent to nearest Cell Tower π‘ β Telecom Core Network (routes the call) β Receiver's Cell Tower π‘ β Receiver's phone decodes β plays audio π This is the compression algorithm used in traditional calls. It's smart β it adapts the bitrate based on network conditions. AMR Mode Bitrate Quality AMR 4.75 4.75 kbps Low (weak signal) AMR 12.2 12.2 kbps High (strong signal) AMR-WB (HD Voice) 23.85 kbps HD quality Under the hood, voice is not sent as one big audio file. It's split into tiny chunks β each chunk represents about 20 milliseconds of audio. [20ms chunk] β [20ms chunk] β [20ms chunk] β [20ms chunk] β ... #1 #2 #3 #4 Each frame looks something like this conceptually: { "type": "voice_frame", "codec": "AMR", "sequence": 101, "timestamp": 2003400, "payload": "" } β οΈ In reality it's binary, not JSON β but this structure represents what's inside each packet. Old GSM (2G/3G) β Circuit Switching A dedicated "pipe" is reserved just for your call Like booking a private road β no one else uses it during your call Very stable, but inefficient (resources wasted during silence) VoLTE (4G/5G) β Packet Switching (but controlled) Voice is broken into packets like internet data But the network gives it priority (QoS β Quality of Service) Lower latency, HD quality, still uses telecom infrastructure Apps like WhatsApp, Google Meet, and Discord use the internet to carry voice. The key technology here is WebRTC (Web Real-Time Communication) β an open standard built into browsers and mobile OSes. You speak π€ β Microphone captures analog audio β ADC β digital signal β Opus Codec compresses it β Packetized into UDP packets β Sent via Internet (WiFi / 4G / 5G) β STUN/TURN Server (for NAT traversal) β Peer-to-Peer connection (WebRTC) β Receiver reassembles packets β decodes β plays audio π Opus is the go-to codec for internet voice/audio. It's open-source, low-latency, and adaptive. Feature Opus Bitrate range 6 kbps β 510 kbps Latency ~20ms Handles packet loss? β Yes (built-in FEC) Quality at low bitrate Excellent Used by WhatsApp, Discord, Zoom, WebRTC Opus has Forward Error Correction (FEC) built in β meaning it sends redundant data so if a packet is lost, it can still reconstruct the audio. That's why internet calls still sound okay even with minor packet loss. This is one of the most important decisions in real-time audio. TCP (used in HTTP, file downloads): Guarantees delivery β if a packet is lost, it resends it Problem: Resending takes time β delay β unacceptable in real-time voice UDP (used in WebRTC voice): No guarantee of delivery No resending lost packets But it's fast β packets go out and don't wait In voice calls, a 200ms old audio packet is useless anyway. Better to skip it and keep playing forward than wait for a retry. TCP mindset: "Wait, I need packet #47 before I continue" β (for voice) UDP mindset: "Packet #47 is gone? Fine, move on." β (for voice) Signaling β Both peers exchange metadata (IP, codec support) via a server ICE (Interactive Connectivity Establishment) β Finding the best network path STUN Server β Figures out your public IP (you're usually behind a router/NAT) TURN Server β Relays traffic if direct P2P fails (firewall situations) DTLS Handshake β Encrypted connection established SRTP β Voice packets flow securely, peer-to-peer Caller Signaling Server Receiver | | | |----offer (SDP)----------->| | | |-------offer (SDP)--------->| | || | | || { "type": "audio_packet", "codec": "opus", "ssrc": 3892741023, "sequence": 4821, "timestamp": 96000, "payload": "" } This is an RTP (Real-time Transport Protocol) packet. WebRTC wraps it in SRTP (Secure RTP) for encryption. Feature Normal Call π Internet Call π Network Telecom (Jio, Airtel) Internet (WiFi / Mobile data) Protocol GSM / VoLTE WebRTC (RTP over UDP) Codec AMR / AMR-WB / EVS Opus Latency ~100β150ms ~150β300ms (network-dependent) Data path Operator controlled Peer-to-peer (mostly) Delivery Guaranteed (circuit/priority) Best-effort (UDP) Encryption Limited (operator can see) E2E Encrypted (DTLS + SRTP) Packet loss handling Network-level QoS Opus FEC + NACK Works without data? β Yes β No Cost Per minute or bundled Uses ~0.3β0.5 MB/min Emergency calls β Works β Cannot call 112/911 Ever heard someone sound like a robot during a WhatsApp call? Here's exactly why: Some UDP packets don't arrive. If too many are lost in a row, the audio decoder has gaps β robotic or stuttering sound. Packets arrive out of order or unevenly spaced. WebRTC uses a jitter buffer to smooth this out β but if jitter is too high, the buffer overflows or the audio gets chopped. Sent: [P1]--[P2]--[P3]--[P4]--[P5] Received: [P1]------[P3][P2]----[P5] β P4 lost, P2 P3 swapped When you're moving (driving, walking), your phone switches between towers or WiFi β 4G. During handoff, packets drop β brief audio glitch. Your internet is shared. If someone starts a big download in parallel, your voice packets compete for bandwidth β delay spikes. If you're building a voice feature, here are the key decisions: Use WebRTC if: Building for web/mobile app Need P2P, low cost at scale Want E2E encryption Don't need emergency call support Use VoIP / SIP if: Need PSTN (real phone number) integration Need to call regular phones Enterprise telephony Use a managed SDK if: Fast shipping matters Examples: Twilio, Agora, Daily.co, Vonage // Get user's microphone const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); // Create peer connection const pc = new RTCPeerConnection({ iceServers: [{ urls: 'stun:stun.l.google.com:19302' }] }); // Add audio track to connection stream.getTracks().forEach(track => pc.addTrack(track, stream)); // Create and send offer const offer = await pc.createOffer(); await pc.setLocalDescription(offer); // β Send offer to other peer via your signaling server // When you receive their answer: await pc.setRemoteDescription(new RTCSessionDescription(answer)); // Get audio stats const stats = await pc.getStats(); stats.forEach(report => { if (report.type === 'inbound-rtp' && report.kind === 'audio') { console.log('Packets lost:', report.packetsLost); console.log('Jitter:', report.jitter); console.log('Round trip time:', report.roundTripTime); } }); Both call types: Voice β Digitize β Compress β Send in 20ms chunks β Decode β Play Without Internet (Normal Call): Codec: AMR | Path: Telecom towers | Protocol: GSM/VoLTE | Stable + Guaranteed With Internet (WhatsApp/WebRTC): Codec: Opus | Path: Internet P2P | Protocol: RTP over UDP | Flexible + Encrypted The biggest conceptual difference: Normal call = a dedicated pipe reserved just for you (like booking a private road) Internet call = many small packets racing through shared roads, reassembled on arrival WebRTC Official Docs RFC 3550 β RTP Specification Opus Codec How NAT Traversal Works (STUN/TURN/ICE) MDN β RTCPeerConnection If this helped you understand what's actually happening under the hood when you make a call, drop a β€οΈ. And if you're building something with WebRTC, feel free to ask questions in the comments! Tags: #webrtc #voip #networking #javascript #webdev #beginners
