How to prevent Meta Avatar Lip Sync from wrecking your VR Social app's performance

If you're building a multiplayer VR app on Meta Quest with Meta Avatars and you're struggling to hit 72fps (or 90fps), there's a good chance the lip sync system is eating a significant chunk of your frame budget. Here's how we found it and fixed it.

The Problem

We were profiling our Unity VR app on Quest using simpleperf (Android NDK's CPU profiler) and noticed something alarming in the call graph:

FluidRealtimeAvatarVoice.Update        13.4% of main thread
  └─ SendMicrophoneData
      └─ ProcessAudioSamples
          └─ ovrAvatar2LipSync_FeedAudio
              └─ ovrBody_GetVisemes    ← neural net inference, ~1.5ms

13.4% of the main thread — just for lip sync. On Quest's 13.9ms frame budget at 72fps, that's 1.5ms consumed by a single function call: ovrBody_GetVisemes, which runs a neural network to convert audio into mouth shapes (visemes).

The Meta Avatar SDK calls ProcessAudioSamples synchronously on whatever thread invokes it. If you're calling it from the main thread, the neural net inference blocks your main thread until it's done. Every frame.

How This Happens (It's Easy to Miss)

The Meta Avatar SDK Samples actually has a built-in solution for this. OvrAvatarLipSyncContext has an OnAudioFilterRead callback:

// Inside OvrAvatarLipSyncContext (Meta's code)
protected virtual void OnAudioFilterRead(float[] data, int channels)
{
    if (_audioSourceType == LipSyncAudioSourceType.AudioSource)
    {
        ProcessAudioSamples(data, channels);
    }
}

OnAudioFilterRead is called by Unity on the audio thread, not the main thread. So if you use Meta's default LipSyncAudioSourceType.AudioSource mode with their LipSyncMicInput sample — where mic audio flows through a Unity AudioSource — the lip sync processing runs off the main thread automatically. This is how Meta intended it to work.

But here's the trap. If you use a third-party multiplayer framework like Normcore, Photon Voice, Vivox, or any custom voice chat system, those frameworks manage their own microphone and audio pipeline. They don't route mic audio through a Unity AudioSource. Instead, they give you raw audio data via a callback — and you call ProcessAudioSamples directly from that callback, which runs on the main thread.

Meta's SDK exposes LipSyncAudioSourceType.Manual for exactly this case. The documentation says to call ProcessAudioSamples with your audio data. What it doesn't mention is that this means the heavy neural net inference now runs synchronously on your main thread instead of on the audio thread.

This is the situation many multiplayer VR apps end up in: using the Manual lip sync path because their voice system requires it, not realizing they've moved 1.5ms of neural net inference onto the main thread.

Why This Matters on Quest

At 72fps, your total frame budget is 13.9ms. At 90fps, it's 11.1ms. Here's what our frame looked like:

Component Cost
Lip sync (ovrBody_GetVisemes) 1.5ms
Avatar SDK update 0.9ms
Input system 0.8ms
OVR platform 0.4ms
Rendering 1.7ms
XR compositor wait 2.7ms
Everything else ~5ms

Lip sync was the single biggest CPU cost under our control. The avatar SDK, input system, and rendering pipeline are all necessary. But spending 1.5ms on lip sync — more than the entire rendering pipeline — felt wrong.

The Discovery

We verified that OvrAvatarVisemeContext.FeedAudio (the method that calls the native lip sync) uses lock (_contextLock):

private void FeedAudio(float[] data, int offset, int count, int channels)
{
    lock (_contextLock)  // Thread-safe!
    {
        // ... marshal audio data to native ...
        var result = CAPI.ovrAvatar2LipSync_FeedAudio(_context, format, offsetAddress, samples);
        // ...
    }
}

That lock means the Meta SDK designed this to be called from any thread. The native function doesn't touch Unity APIs — it just processes an audio buffer and updates internal viseme state. There's no reason it has to run on the main thread.

The Fix

Instead of calling ProcessAudioSamples directly from the voice data callback (which runs on the main thread), we:

  1. Copy the audio data to a pooled buffer (fast, ~microseconds)
  2. Enqueue it for a background thread
  3. Process it on a dedicated worker thread
public class ThreadedMetaAvatarLipSync : OvrAvatarLipSyncContext
{
    private readonly ConcurrentQueue<float[]> _audioQueue = new();
    private readonly ConcurrentBag<float[]> _bufferPool = new();
    private Thread _processingThread;
    private volatile bool _running;
    private const int MaxQueuedBuffers = 3;

    public void FeedVoiceData(float[] data)
    {
        // Drop oldest if backing up
        while (_audioQueue.Count >= MaxQueuedBuffers)
            if (_audioQueue.TryDequeue(out var old))
                _bufferPool.Add(old);

        // Copy to pooled buffer (the source array is reused by the caller)
        if (!_bufferPool.TryTake(out var buffer) || buffer.Length != data.Length)
            buffer = new float[data.Length];
        Array.Copy(data, buffer, data.Length);
        _audioQueue.Enqueue(buffer);
    }

    private void Start()
    {
        _running = true;
        _processingThread = new Thread(() =>
        {
            while (_running)
            {
                if (_audioQueue.TryDequeue(out var data))
                {
                    ProcessAudioSamples(data, 1); // Heavy work happens here, off main thread
                    _bufferPool.Add(data);
                }
                else Thread.Sleep(1);
            }
        }) { Name = "LipSyncAudioProcessor", IsBackground = true };
        _processingThread.Start();
    }

    private void OnDestroy()
    {
        _running = false;
        _processingThread?.Join(500);
    }
}

Then in your voice setup, instead of:

voice.voiceData += (data) => ProcessAudioSamples(data, 1);  // Blocks main thread!

You do:

voice.voiceData += threadedLipSync.FeedVoiceData;  // Returns instantly

The Results

We profiled before and after using simpleperf on a Quest 3:

Metric Before After
Lip sync on main thread 13.4% (~1.5ms) 0%
LipSyncAudioProcessor thread N/A 0.88%
Total CPU samples (10s) 12,261 11,766
Frame budget recovered ~1.5ms

The neural net inference moved entirely off the main thread. The background thread runs at less than 1% total CPU because it's no longer competing with the main thread for scheduling.

Trade-offs

Latency: There's approximately 1 frame of additional lip sync latency. The visemes rendered on frame N are computed from frame N-1's audio. In practice, this is completely imperceptible — mouth animation at 72fps with a 1-frame delay looks identical to zero-delay.

Thread safety: The Meta SDK's lock (_contextLock) in OvrAvatarVisemeContext.FeedAudio synchronizes writes from our background thread with reads from OvrAvatarManager.Update on the main thread. We've tested this and so far so good.

Buffer pooling: We reuse float[] buffers via a ConcurrentBag to avoid GC allocations. The audio data array from the voice source is reused by the caller, so we must copy it — but the copy is fast and the pooling eliminates allocation pressure.

Queue overflow: If the background thread falls behind (e.g., during a CPU spike), we drop the oldest audio buffers rather than letting the queue grow unbounded. Capped at 3 queued buffers.

How We Found It

We used Android NDK's simpleperf to profile on-device:

# Record 10 seconds of CPU samples at 1000Hz with call graphs
cd $ANDROID_NDK/simpleperf
python3 app_profiler.py -p com.your.app \
  -r "-e task-clock:u -f 1000 --duration 10 -g" \
  -lib path/to/unstripped/symbols/

# View the call tree
python3 report.py -g --sort symbol --symfs binary_cache \
  --tids <main_thread_id> | head -40

# Generate interactive flamegraph
python3 report_html.py -o profile.html

This was far more useful than Unity's built-in profiler for this kind of investigation because:

  • It works on release builds (with <profileable android:shell="true" /> in your manifest)
  • It captures native code (the lip sync inference is in libovrbody.so)
  • It shows the actual call graph with C# function names resolved via IL2CPP symbols
  • No instrumentation overhead that distorts the results

Why Not Just Use the AudioSource Path?

You might wonder: why not route your multiplayer voice audio through a Unity AudioSource and let Meta's OnAudioFilterRead handle it on the audio thread?

In theory you could — create an AudioSource, write your voice data into an AudioClip, and let the audio pipeline process it. But this adds complexity, latency, and potential audio issues. Multiplayer voice frameworks like Normcore already handle audio preprocessing (noise suppression, echo cancellation, gain control) and streaming — inserting a Unity AudioSource into that pipeline just to get lip sync on the audio thread would be fighting the architecture.

This threaded approach is simpler: keep your existing voice pipeline exactly as-is, swap one line to queue audio instead of processing it synchronously, and get the same off-main-thread benefit without touching your audio architecture.

Key Takeaway

If you're using Meta Avatars with lip sync and a multiplayer voice framework on Quest, you're probably calling ProcessAudioSamples on the main thread. Check your profiler for ovrBody_GetVisemes — if it's on UnityMain, this simple threading change can recover 1-2ms per frame.

using System;
using System.Collections.Concurrent;
using System.Threading;
using UnityEngine;
using Oculus.Avatar2;

/// <summary>
/// Drop-in replacement for synchronous Meta Avatar lip sync that moves the expensive
/// viseme neural network inference (ovrBody_GetVisemes) off the Unity main thread.
///
/// The default Meta Avatar SDK calls ProcessAudioSamples synchronously on the main thread,
/// which triggers ovrAvatar2LipSync_FeedAudio -> ovrBody_GetVisemes. On Quest hardware,
/// this costs ~1.5ms per frame — a significant chunk of the 13.9ms budget at 72fps.
///
/// This implementation queues audio data on the main thread (microseconds) and processes
/// it on a dedicated background thread. The Meta Avatar SDK's internal lock (_contextLock
/// in OvrAvatarVisemeContext) ensures thread-safe access to the viseme state.
///
/// Trade-off: ~1 frame of lip sync latency. Imperceptible for mouth animation.
///
/// Usage:
///   Replace your existing OvrAvatarLipSyncContext subclass with this one.
///   Wire up the voice data source to call FeedVoiceData(float[] data) each frame.
///
/// Requirements:
///   - Meta Avatars SDK (com.meta.xr.sdk.avatars)
///   - A voice data source that provides float[] PCM audio each frame
/// </summary>
public class ThreadedMetaAvatarLipSync : OvrAvatarLipSyncContext
{
    [Tooltip("Maximum number of audio buffers to queue. If the worker thread falls behind, oldest buffers are dropped.")]
    [SerializeField] private int maxQueuedBuffers = 3;

    private readonly ConcurrentQueue<float[]> _audioQueue = new();
    private readonly ConcurrentBag<float[]> _bufferPool = new();
    private Thread _processingThread;
    private volatile bool _running;

    /// <summary>
    /// Call this from your voice data callback (e.g., RealtimeAvatarVoice.voiceData event).
    /// The audio data is copied to an internal buffer and queued for background processing.
    /// This method is fast (~microseconds) and safe to call from the main thread.
    /// </summary>
    /// <param name="data">PCM audio samples from the microphone</param>
    public void FeedVoiceData(float[] data)
    {
        // Drop oldest buffers if the worker thread is falling behind
        while (_audioQueue.Count >= maxQueuedBuffers)
        {
            if (_audioQueue.TryDequeue(out var old))
                _bufferPool.Add(old);
        }

        // Reuse a pooled buffer or allocate a new one
        if (!_bufferPool.TryTake(out var buffer) || buffer.Length != data.Length)
            buffer = new float[data.Length];

        // Copy audio data — the source array is reused by the caller
        Array.Copy(data, buffer, data.Length);
        _audioQueue.Enqueue(buffer);
    }

    /// <summary>
    /// Starts the background processing thread.
    /// Call base.Start() or ensure this runs after OvrAvatarLipSyncContext initializes.
    /// </summary>
    protected virtual void Start()
    {
        _running = true;
        _processingThread = new Thread(ProcessingLoop)
        {
            Name = "LipSyncAudioProcessor",
            IsBackground = true
        };
        _processingThread.Start();
    }

    protected virtual void OnDestroy()
    {
        _running = false;
        _processingThread?.Join(500);
    }

    /// <summary>
    /// Background thread loop. Dequeues audio buffers and calls ProcessAudioSamples,
    /// which triggers the Meta SDK's native ovrAvatar2LipSync_FeedAudio -> ovrBody_GetVisemes
    /// pipeline. The SDK's internal lock ensures thread-safe viseme state updates.
    /// </summary>
    private void ProcessingLoop()
    {
        while (_running)
        {
            if (_audioQueue.TryDequeue(out var data))
            {
                try
                {
                    // This calls into Meta's native lip sync pipeline.
                    // OvrAvatarVisemeContext.FeedAudio uses lock(_contextLock) internally,
                    // making this safe to call from a background thread.
                    ProcessAudioSamples(data, 1);
                }
                catch (Exception e)
                {
                    Debug.LogException(e);
                }

                // Return buffer to pool for reuse (avoids GC allocations)
                _bufferPool.Add(data);
            }
            else
            {
                // No audio to process — sleep briefly to avoid spinning
                Thread.Sleep(1);
            }
        }
    }
}

Subscribe to Notes of an XR Hacker

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe