How to Set Up a Local Failover for Your Coding Agent When API Rate Limits or Outages Strike

A practical guide to keeping your AI coding agent running when external LLM APIs throttle or fail, using durable local compute and automatic provider fallback.

TL;DR: Run your coding agent in a durable local sandbox with Modal and Restate for stateful execution, route LLM calls through a gateway that automatically fails over to backup providers on rate limits or outages, and test these paths before production. This keeps the agent productive when external APIs degrade.

TL;DR: Run your coding agent in a durable local sandbox with Modal and Restate for stateful execution, route LLM calls through a gateway that automatically fails over to backup providers on rate limits or outages, and test these paths before production. This keeps the agent productive when external APIs degrade.

Map Your Agent's External Dependencies and Failure Modes

Start by listing every external LLM and tool API your coding agent touches, then classify each endpoint by whether it typically fails with throttling or a hard outage so you can assign the right retry or fallback logic. AI agents can introduce errors just like humans, and they often lack the domain authority and system-specific knowledge needed to recover gracefully from provider issues. Treat rate limiting and provider outages as distinct failure modes: rate limits are throttling signals that may resolve quickly, while outages require a true fallback path.

Inventory every provider in a structured map that records the endpoint, its failure mode, and the intended resilience strategy. A concise manifest makes the agent’s blast radius explicit and prevents a single provider incident from bringing down your entire AI app. Include both LLM providers and auxiliary tool APIs, because any external call can stall the agent if its failure mode is undefined.

AGENT_DEPS = {
    "llm-primary": {
        "endpoint": "https://api.openai.com/v1/chat/completions",
        "failure_mode": "rate_limit",
        "strategy": "exponential_backoff"
    },
    "llm-fallback": {
        "endpoint": "http://localhost:11434/v1/chat/completions",
        "failure_mode": "outage",
        "strategy": "immediate_failover"
    },
    "tool-static-analysis": {
        "endpoint": "http://localhost:8080/analyze",
        "failure_mode": "outage",
        "strategy": "queue_and_retry"
    }
}

Use this map to drive your agent’s execution loop. When a call triggers a rate-limit response, reference the manifest to trigger a brief backoff; when the primary returns a network or outage error, switch to the fallback endpoint defined for outage mode. Keep the manifest in version control next to the agent code so changes to dependencies or failure modes require an explicit update.

Run Agent Compute in Durable Local Sandboxes

Host the agent’s code sandbox and serverless compute locally with Modal, and use Restate to manage execution state so the agent survives external API failures. This decouples the agent’s runtime from provider uptime by keeping execution and context resilient even when upstream endpoints are unreachable.

With Modal, you define a local sandbox as a serverless function where the agent develops and runs code. The function runs in an isolated environment that you control, so tool calls and compilation steps continue regardless of external service health. Running the sandbox locally means the agent’s build and test loops are not blocked by network issues between your infrastructure and a remote provider:

import modal

app = modal.App("agent-sandbox")

@app.function()
def run_agent_step(code: str):
    # sandboxed execution for the coding agent
    exec(code, {"__builtins__": __builtins__})
    return "step complete"

Restate handles the durable execution layer. You register a workflow service that orchestrates agent steps, automatically persisting state and retrying on transient errors. The workflow engine stores execution state durably, so you can restart the agent process without losing the current task context:

import * as restate from "@restatedev/restate-sdk";

const agentWorkflow = restate.workflow({
  name: "coding-agent",
  handlers: {
    run: async (ctx, request) => {
      // durable execution: state survives crashes/outages
      const result = await ctx.run(() => callModalSandbox(request.code));
      return result;
    },
  },
});

restate.endpoint().bind(agentWorkflow);

Because Restate provides idempotency, retries, and scalable orchestration for agent context and workflows, the agent resumes exactly where it left off once external APIs return. While the LLM endpoint is unreachable, the Modal sandbox stays alive and Restate pauses the workflow without dropping context, resuming automatically when the provider recovers.

Configure Automatic LLM Provider Fallback

Route every LLM request through a gateway that supports automatic provider failover, so a rate-limit error or outage from your primary model instantly triggers a backup without interrupting your coding agent. This eliminates single-provider dependency and prevents a single incident from bringing down your entire AI application.

A practical implementation uses an LLM router with a prioritized fallback chain: a primary cloud provider, a secondary cloud provider, and finally a local endpoint such as Ollama or vLLM. The router should advance to the next tier only on provider failures such as rate limits, network errors, or outages, ensuring you do not waste local compute on transient primary glitches.

Here is a minimal Python example using litellm.Router to define the tiered fallback:

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "primary", "litellm_params": {"model": "openai/gpt-4o", "api_key": "sk-..."}},
        {"model_name": "backup-cloud", "litellm_params": {"model": "anthropic/claude-3-5-sonnet-20241022", "api_key": "sk-..."}},
        {"model_name": "backup-local", "litellm_params": {"model": "ollama/codellama", "api_base": "http://localhost:11434"}},
    ],
    fallbacks=[{"primary": ["backup-cloud", "backup-local"]}],
)

response = await router.acompletion(
    model="primary",
    messages=[{"role": "user", "content": "Refactor this function"}]
)

If the primary provider encounters rate limits, network errors, or an outage, the router automatically retries the request against the backup cloud model, then the local model if needed. Keep local context windows and token limits in mind when chaining to a smaller endpoint, and monitor fallback frequency with gateway logs so you can adjust quotas or capacity before the final tier becomes overloaded.

Harden the Local Loop with Retries and Timeouts

Wrap every external LLM call in a short timeout and a circuit breaker so the agent fails fast instead of hanging when the primary provider degrades. Persist the agent context to Restate before each invocation so retries and fallback transitions remain idempotent and state is never lost.

Conflicting or missing timeouts are a known source of outages, so enforcing a single hard ceiling keeps the agent predictable. A common approach is to cap request latency: if the primary provider does not respond within a few seconds, raise immediately and route to the fallback rather than consuming threads or GPU quotas while waiting. The snippet below persists the current tool and conversation state, then caps the wait at five seconds:

import asyncio

async def generate_with_guardrails(prompt, primary, fallback):
    # Persist context to Restate before any external call
    await state_store.save("agent_context", prompt.context)
    
    try:
        return await asyncio.wait_for(primary.complete(prompt), timeout=5.0)
    except asyncio.TimeoutError:
        return await fallback.complete(prompt)

For cascading errors, add a circuit breaker that counts consecutive failures and blocks the primary until it recovers:

class CircuitBreaker:
    def __init__(self, threshold=3):
        self.failures = 0
        self.threshold = threshold

    async def run(self, fn, *args):
        if self.failures >= self.threshold:
            raise RuntimeError("circuit open")
        try:
            result = await fn(*args)
            self.failures = 0
            return result
        except Exception:
            self.failures += 1
            raise

By saving state in Restate up front, you guarantee that when Restate's built-in retry semantics invoke the fallback or replay the step, duplicate operations cannot corrupt local tool state or replay side effects.

Validate Failover Paths Before Production

Test every failover path in staging before your coding agent serves production traffic. Simulating provider failures and rate limits there exposes routing errors and state-loss bugs that would otherwise trigger outages.

Reliability testing can catch agent errors before they cause outages. A common approach is to introduce artificial faults against the primary LLM upstream and verify that the gateway automatically routes requests to the backup provider. Use a proxy tool to simulate a complete timeout against the primary endpoint:

toxiproxy-cli toxic add -t timeout -a timeout=0 primary_llm

With the primary upstream black-holed, submit a coding task through your gateway and assert that the fallback model returns a valid completion within your acceptable latency threshold. Then verify that the durable local sandbox continues processing without dropping context or duplicating work. Confirm the agent process remains active and inspect recent logs for successful recovery:

systemctl is-active "$AGENT_SERVICE" && \
journalctl -u "$AGENT_SERVICE" --since "1 minute ago" --no-pager

You should also simulate rate-limit events by throttling the primary upstream and confirming the gateway promotes the secondary provider before the agent surface times out. Finally, run static analysis and dependency scans on any code produced while the fallback model is active. Confirm that linting, type-checking, and security constraints still hold, because a backup provider may generate output with different formatting or dependency patterns than your primary model. Repeating this validation after every gateway or model update keeps the failover path trustworthy.

FAQ

What is the difference between provider fallback and local failover?

Provider fallback automatically routes LLM requests to backup cloud providers when the primary encounters rate limits or outages. Local failover keeps the agent's compute and state in a durable sandbox—such as Modal with Restate—so the agent continues operating even when all external providers are unreachable.

Do I need an LLM Gateway to implement fallback?

An LLM Gateway is the common pattern for automatic routing, but you can also implement client-side logic. The critical rule is to avoid a single provider incident bringing down your entire AI app.

How does durable execution help during an API outage?

Durable execution platforms provide resilience, idempotency, and retries for agent workflows. Restate, for example, ensures that pending tasks and agent context survive restarts and can resume once services recover.

Should I run a local LLM or just switch to another cloud provider?

A common approach is to tier your fallbacks: a secondary cloud provider first, then a local or edge model for complete autonomy. Test both paths to ensure code quality remains acceptable during failover.

How do I test failover without waiting for a real outage?

Use reliability testing to simulate failures in staging. Inject rate-limit responses and provider downtime to verify that your gateway and durable local environment handle the switch correctly.

References for further reading

_Sources consulted while researching this guide, included so you can verify the details and go deeper. Listing them is not a claim that every line was independently fact-checked._


I packaged the setup above into a ready-to-use kit — Go-Local Failover Kit: Emergency Local Inference for Coding Agents (16 Items) — for anyone who'd rather copy-paste than wire it from scratch: https://unfairhq.gumroad.com/l/yusudf.

Last updated: 2026-06-28