Most teams shipping their first LLM integration into a React Native app land in one of three holes within the first month after launch. We have shipped around these holes enough times to recognize them on the way in. This piece is the field guide.
It is not exhaustive. There are plenty of ways for a mobile AI feature to fail. These are the three that come up most often, and the engineering patterns that close them.
Failure 1: The model call is in the tap path
The symptom. The user taps "Summarize," waits 1.4 seconds while a spinner spins, and then sees the result. By the second tap, they have stopped tapping.
The cause. The naive implementation puts the model call directly inside the tap handler. The handler awaits the cloud round-trip. The UI is blocked for the duration. On a fast connection on a fresh device, this is 600 to 800 milliseconds. On a cellular connection in a tunnel, on a phone that has been awake for 6 hours, it is anywhere from 1.4 to 3 seconds. The user does not know how long the call will take; they only know that the app feels slow.
The pattern that works. Three habits, none of them dramatic.
The first is to stream the response. Both OpenAI and Anthropic support streaming. Streaming does not make the call faster end to end; it makes the first token arrive in 300 to 500 milliseconds instead of 1,400. The perception of latency is dominated by time-to-first-token, not total time. On React Native, this means using a fetch-with-ReadableStream pattern, or the official SDK's streaming helpers, and writing the partial response into local state as it arrives.
The second is to never put the model call in the synchronous tap path. The tap fires off an async request; the UI immediately transitions to a "generating" state that is itself useful (a typing indicator, a partial preview). If the request takes 3 seconds, the user is looking at a useful screen during those 3 seconds, not a blocked one.
The third is to put a timeout on every model call. Five seconds is a reasonable default for streaming starts; 30 seconds for total completion. When the timeout fires, the UI degrades gracefully (see Failure 2) rather than hanging forever.
The code shape looks roughly like this. We are not going to paste a full implementation; this is a sketch.
async function startSummary(content: string) {
setState({ phase: 'generating', partial: '' });
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30_000);
try {
const stream = await fetchSummaryStream(content, controller.signal);
for await (const chunk of stream) {
setState((s) => ({ ...s, partial: s.partial + chunk }));
}
setState((s) => ({ ...s, phase: 'done' }));
} catch (err) {
setState({ phase: 'error', error: err });
} finally {
clearTimeout(timeoutId);
}
}
The exact streaming primitive (ReadableStream, an SDK helper, a custom EventSource shim for React Native) is a stack decision. The principle is the same across all of them.
Failure 2: There is no fallback when the network drops mid-stream
The symptom. The user is on the subway. They tap "Summarize." The first three tokens arrive. Then the train enters a tunnel. The stream hangs. The user sees half a sentence, no progress, no recovery. They force-quit the app. They never use the feature again.
The cause. A streaming model call is exactly as fragile as the network connection it rides on. Mobile connections drop more often than developers raised on broadband instinctively assume. A 60-second response with 30 seconds of streaming has roughly a 5% chance of failing mid-stream in normal cellular conditions, and a much higher chance during peak congestion or in tunnels.
A naive implementation either:
- Hangs forever (no timeout, no abort)
- Throws an unhandled error (no try/catch around the stream consumer)
- Resets to the empty state when retried (loses the partial response the user already saw)
None of those are acceptable in production.
The pattern that works. Three habits again.
The first is to treat the partial response as a first-class piece of state. When the stream errors out, you do not throw the partial response away. You show it to the user with a "stream interrupted" indicator and a retry button that resumes from where you left off (or, if the model does not support resume, regenerates from a prompt that includes the partial response as context to avoid contradicting it).
The second is to design the failure-mode UX before you ship. Every model-driven surface needs at least three visual states beyond the success state: generating, partial-with-error, and complete-failure. Most teams ship the success state, half of the generating state, and none of the others. Then the first cellular drop becomes a production incident.
The third is to log every stream error with enough context to debug it later: timestamp, geolocation if you have it, network type (Wi-Fi, cellular, generation), the prompt hash, the byte count of the partial response, the error class. Mobile errors are hard to reproduce. A team that ships AI to mobile without this telemetry is debugging blind.
The error class taxonomy worth distinguishing:
| Class |
What happened |
How to handle |
| Timeout |
Server did not respond within the timeout window |
Retry with backoff up to N times, then surface error |
| Network drop mid-stream |
Stream connection lost after some bytes received |
Preserve partial, offer resume or retry |
| Server error (5xx) |
Provider had a transient issue |
Retry with backoff, fall back to secondary vendor on 3rd failure |
| Rate limit (429) |
You are sending too many requests |
Backoff with jitter; signal to the user briefly |
| Content moderation block |
Provider refused to complete the call |
Show a moderation-aware error, do not retry |
| Auth error (401, 403) |
Token issue, not user-facing |
Refresh token, retry once; if it persists, alert internal monitoring |
If your error handling collapses all of these into a single "Sorry, something went wrong," you are losing information that would help you fix the underlying issues.
Failure 3: Token cost spirals because the integration has no caching
The symptom. The first month of usage looks fine. The second month's bill arrives. It is 12x what the team modeled. Finance is unhappy. Engineering is asked to explain.
The cause. A naive integration calls the model every time the user touches the feature. A marketplace app with 100,000 monthly active users, where the same 5,000 listings are rendered repeatedly across browse, search, detail view, and shared link previews, will multiply the model calls per listing by 4x to 10x before the team notices.
This is the failure mode our Mobile AI Cost Model post is designed to predict in advance. But even with the prediction, the engineering pattern matters.
The pattern that works. Cache aggressively, invalidate carefully.
The first decision is the cache key. For content-augmentation features (summaries, rewrites, structured extracts), the key is usually the canonical version of the source content. If the source content has a content hash, use that. If not, derive one. The key is not the user's identity; the key is the content. Every user who looks at the same listing should see the same summary, generated once.
The second decision is the cache backing store. For a server-side cache, Redis works. Postgres works if you do not have Redis. The mobile app should not be involved in the caching; the API endpoint that wraps the model call serves cached responses transparently. The client does not know whether it is hitting cache.
The third decision is invalidation. The cache invalidates on the events that mean the source content changed. For a marketplace listing, that is "the seller edited the listing." It is not "every read." It is not "every hour." A cache that invalidates on every read is not a cache.
The pseudocode shape:
async function getSummary(listingId: string) {
const listing = await db.listings.get(listingId);
const cacheKey = `summary:${listingId}:${listing.contentHash}`;
const cached = await cache.get(cacheKey);
if (cached) return cached;
const summary = await model.complete(buildPrompt(listing));
await cache.set(cacheKey, summary, { ttl: '30d' });
return summary;
}
The contentHash portion of the key is what does the work. When the seller edits the listing, the hash changes; the next request misses the cache and regenerates; subsequent requests hit the new cache entry. No explicit invalidation step. The cache invalidates itself.
The same pattern compounds with prompt caching primitives that the model providers now ship (OpenAI prompt caching, Anthropic prompt caching). Those primitives cache the repeated prompt prefix on the provider's side, even when the response varies. Stacked together with your own output cache, the cost line on a content-augmentation feature can drop by an order of magnitude relative to a naive build.
The recurring shape
Three failures, one common cause: the team treated the model call as if it were a regular HTTP request to a regular backend. Model calls are slower, flakier, more expensive, and more variable than the rest of the request path the team is used to. The engineering patterns that work account for that.
If you are starting on a React Native AI integration in 2026, we wrote a longer-form Mobile AI Integration Guide that goes deeper on the integration patterns, the stack decisions, and the failure modes. If you would rather have someone do this with you for the first feature, our AI Integration Sprint ships one feature, end to end, in 2 to 4 weeks.
Shuhel Khan is the founder of Inseed. We integrate AI into existing mobile apps, build AI-native MVPs, and ship mobile development work for web-first companies. Last revised 2026-05-13.