ARM NEON SIMD Intrinsics for Real-Time Audio Processing in Android NDK
--- title: "ARM NEON SIMD for Real-Time Audio on Android NDK" published: true description: "Cut Android audio latency below 10ms using ARM NEON SIMD intrinsics, lock-free ring buffers, and vectorized FFT in the NDK native pipeline." tags: android, mobile, architecture, performance canonical_url: https://blog.mvpfactory.co/arm-neon-simd-real-time-audio-android-ndk --- ## What We Will Build In this workshop, I will walk you through a native audio pipeline on Android that consistently delivers sub-10ms round-trip latency. You will learn how to configure Oboe/AAudio for exclusive low-latency streaming, design a lock-free SPSC ring buffer that won't glitch on the real-time callback thread, and vectorize your FFT butterfly operations with ARM NEON intrinsics for a 3-4x throughput gain over scalar C++. By the end, you will have the architecture and working code to replace a sluggish `AudioTrack`-based pipeline (25-55ms latency) with a native NEON-accelerated one that hits 4-8ms on modern Snapdragon and Tensor chipsets. ## Prerequisites - Android NDK (r25+) with CMake - Familiarity with C++ and JNI basics - A physical ARM64 device for testing (emulator won't cut it for latency measurement) - The [Oboe library](https://github.com/google/oboe) added to your project ## Step 1: Configure Oboe for Low-Latency Exclusive Mode Here is the minimal setup to get this working. The setting most developers miss is `SharingMode::Exclusive` — it bypasses the Android mixer entirely, giving you direct HAL access and saving 5-15ms by itself. cpp This is the single highest-impact change in the entire pipeline. Start here before optimizing anything else. ## Step 2: Build a Lock-Free Ring Buffer Here is the gotcha that will save you hours: the audio callback runs on a real-time priority thread. Any blocking operation — a mutex, a heap allocation, even a log call — causes audible glitches. The correct boundary between your processing thread and the callback is a single-producer, single-consumer (SPSC) lock-free ring buffer. cpp public: Notice the `alignas(64)` on both atomic positions. On ARM Cortex-A cores, a cache line is 64 bytes. Without this alignment, your "lock-free" structure silently contends through false sharing. ## Step 3: Vectorize Your FFT with NEON Intrinsics Let me show you a pattern I use in every project that does real-time DSP. A scalar radix-2 butterfly processes one complex multiply-add per iteration. NEON processes four simultaneously. cpp void neon_butterfly(float* re, float* im, float32x4_t tr = vmlsq_f32(vmulq_f32(ar, wr), ai, wi); float32x4_t ti = vmlaq_f32(vmulq_f32(ar, wi), ai, wr); vst1q_f32(&re[i], tr); vst1q_f32(&im[i], ti); } } `vmlsq_f32` and `vmlaq_f32` are fused multiply-subtract/add operations — single-cycle on Cortex-A78 and newer cores. No separate multiply-then-add penalty. For your CMake configuration, make sure you target the right architecture: cmake On `arm64-v8a`, NEON is mandatory — every ARMv8-A core supports it, so you don't need runtime feature detection. In 2026, dropping 32-bit `armeabi-v7a` support is the right call for any latency-sensitive application. ## Benchmarks All measurements at 48kHz sample rate, 128-sample buffer, averaged over 10,000 callbacks: | Pipeline | Pixel 8 (Tensor G3) | Galaxy S24 (Snapdragon 8 Gen 3) | Pixel 7a (Tensor G2) | |---|---|---|---| | AudioTrack (Java) | 32ms | 28ms | 41ms | | Oboe + scalar C++ | 11ms | 9ms | 14ms | | Oboe + NEON FFT | 7ms | 6ms | 9ms | | Oboe + NEON + Exclusive | 5ms | 4ms | 8ms | The NEON-vectorized path with exclusive mode delivers 4-6x improvement over the managed `AudioTrack` approach. Even on the older Tensor G2, you stay below the 10ms threshold. ## Gotchas - **Treating audio like a UI problem.** The docs do not mention this, but reaching for `AudioTrack` or `MediaCodec` and processing on a managed thread is the single biggest mistake Android teams make. You need to rethink the pipeline from the native layer up. - **Skipping `alignas(64)` on your atomics.** Without cache-line alignment, your lock-free ring buffer silently suffers false sharing across CPU cores. This is easy to get 90% right and hard to get 100% right — test on real hardware early. - **Relying on compiler auto-vectorization.** Auto-vectorization is inconsistent across NDK toolchains. Hand-written NEON intrinsics for FFT butterfly operations deliver predictable 3-4x throughput gains. Once you see the Simpleperf numbers, you won't go back. - **Using `SharingMode::Shared` by default.** Shared mode routes through the Android mixer, adding 5-15ms. You lose the ability to mix with other apps in exclusive mode, but you gain deterministic timing. - **Forgetting to profile and move.** This kind of optimization means long sessions of profiling with Simpleperf and staring at NEON disassembly. I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running during these deep NDK sessions — the break reminders are genuinely useful when you're three hours deep in cache-line alignment issues and have forgotten to move. ## Conclusion Start with `SharingMode::Exclusive` — it's the single highest-impact change, worth 5-15ms by itself. Then build your lock-free SPSC ring buffer with proper cache-line alignment. Finally, vectorize your DSP kernels with NEON intrinsics for that predictable 3-4x throughput gain. The full pipeline gets you from 28-41ms managed-layer latency down to 4-8ms native latency on modern hardware. It's more work upfront, but for real-time synthesis, effects processing, or low-latency monitoring, there is no shortcut around the native layer. **Further reading:** - [Oboe documentation](https://github.com/google/oboe/blob/main/docs/FullGuide.md) - [ARM NEON Intrinsics Reference](https://developer.arm.com/architectures/instruction-sets/intrinsics/) - [Android NDK High-Performance Audio guide](https://developer.android.com/ndk/guides/audio)
