Solana

Solana Beta is live. Try BoltRPC Solana endpoints free - start your trial now.

RPC Failover: How to Build Reliable Blockchain Infrastructure

How to implement RPC failover for your dApp using ethers.js FallbackProvider, health checks, circuit breakers and multi-provider strategies. Prevent downtime before it happens.

BoltRPC
BoltRPC Team
8 min read
RPC Failover: How to Build Reliable Blockchain Infrastructure

RPC Failover: How to Build Reliable Blockchain Infrastructure

Your RPC provider will go down at some point. Maintenance windows, network incidents, infrastructure overload: no provider has perfect uptime. The question is not whether it will happen, but whether your application will survive it.

This guide covers every layer of RPC failover: from the built-in ethers.js FallbackProvider to manual circuit breakers, health checks, and multi-provider routing patterns.


Why RPC Reliability Matters More Than You Think

A single RPC endpoint is a single point of failure. When that endpoint becomes unavailable:

  • Read calls fail: balances, contract state, transaction receipts
  • Write calls fail: transactions cannot be submitted or confirmed
  • WebSocket subscriptions disconnect: real-time event listeners go silent
  • Users see errors, retries, or blank screens

For DeFi protocols, this can mean missed liquidations, failed limit orders, or users unable to withdraw funds during volatile markets. For any production dApp, it means user churn.

The fix is not complicated. Add a fallback.


Layer 1: ethers.js FallbackProvider

ethers.js v6 includes FallbackProvider, a built-in wrapper that manages multiple providers and routes requests automatically.

import { ethers } from 'ethers';

const primary = new ethers.JsonRpcProvider(
  'https://eu.endpoints.matrixed.link/rpc/ethereum?auth=YOUR_KEY'
);

const fallback = new ethers.JsonRpcProvider(
  'https://your-secondary-rpc-url/ethereum'
);

const provider = new ethers.FallbackProvider([
  { provider: primary,  priority: 1, weight: 2, stallTimeout: 2000 },
  { provider: fallback, priority: 2, weight: 1, stallTimeout: 3000 },
]);

How FallbackProvider works

  • priority: Lower number = tried first. Primary (1) is always tried before fallback (2).
  • weight: When both providers return results, higher weight = more trusted. If results disagree, weighted voting determines which to use.
  • stallTimeout: How long (ms) to wait before trying the next provider. If primary does not respond within 2000ms, fallback is tried.
// You use it exactly like a regular provider
const balance = await provider.getBalance('0xAddress');
const block = await provider.getBlockNumber();

FallbackProvider handles routing transparently. Your application code does not change.


Layer 2: Quorum-Based Verification

For high-value applications, configure FallbackProvider to require agreement from multiple providers before trusting a result:

const provider = new ethers.FallbackProvider(
  [
    { provider: provider1, priority: 1, weight: 1 },
    { provider: provider2, priority: 1, weight: 1 },
    { provider: provider3, priority: 1, weight: 1 },
  ],
  1, // network quorum — how many networks must agree
  2  // result quorum — minimum weight required to trust a result
);

With 3 providers each at weight 1 and a quorum of 2, at least 2 providers must agree on the result. A compromised or incorrect provider cannot mislead your application alone.

Use quorum for: reading high-value state (vault balances, oracle prices, governance votes). Not needed for standard queries.


Layer 3: Manual Circuit Breaker

FallbackProvider handles per-request failover. For sustained outages, add a circuit breaker that stops sending traffic to a failed provider until it recovers.

class RpcCircuitBreaker {
  constructor(provider, options = {}) {
    this.provider = provider;
    this.failures = 0;
    this.lastFailure = null;
    this.state = 'closed'; // closed = healthy, open = failed, half-open = testing

    this.threshold = options.threshold || 5;       // failures before opening
    this.timeout = options.timeout || 30000;        // ms before trying again
    this.successThreshold = options.successThreshold || 2; // successes to close
    this.consecutiveSuccesses = 0;
  }

  async call(method, ...args) {
    if (this.state === 'open') {
      const timeSinceFailure = Date.now() - this.lastFailure;

      if (timeSinceFailure < this.timeout) {
        throw new Error('Circuit open — provider unavailable');
      }

      // Timeout elapsed — try again
      this.state = 'half-open';
    }

    try {
      const result = await this.provider[method](...args);
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;

    if (this.state === 'half-open') {
      this.consecutiveSuccesses++;
      if (this.consecutiveSuccesses >= this.successThreshold) {
        this.state = 'closed';
        this.consecutiveSuccesses = 0;
        console.log('Circuit closed — provider recovered');
      }
    }
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.threshold) {
      this.state = 'open';
      console.error(`Circuit opened after ${this.failures} failures`);
    }
  }

  isHealthy() {
    return this.state === 'closed';
  }
}

// Usage
const breaker = new RpcCircuitBreaker(primaryProvider, {
  threshold: 5,
  timeout: 30000
});

try {
  const balance = await breaker.call('getBalance', address);
} catch (error) {
  if (error.message.includes('Circuit open')) {
    // Route to fallback
    const balance = await fallbackProvider.getBalance(address);
  }
}

Layer 4: Health Checks

Proactively monitor your RPC endpoints rather than discovering failures when users hit errors.

class RpcHealthMonitor {
  constructor(endpoints) {
    this.endpoints = endpoints;
    this.status = {};
    this.latency = {};

    // Initialize all as unknown
    endpoints.forEach(e => {
      this.status[e.name] = 'unknown';
      this.latency[e.name] = null;
    });
  }

  async checkEndpoint(endpoint) {
    const start = Date.now();

    try {
      const provider = new ethers.JsonRpcProvider(endpoint.url);
      await provider.getBlockNumber();

      this.latency[endpoint.name] = Date.now() - start;
      this.status[endpoint.name] = 'healthy';
    } catch (error) {
      this.status[endpoint.name] = 'unhealthy';
      this.latency[endpoint.name] = null;
      console.error(`${endpoint.name} health check failed:`, error.message);
    }
  }

  async checkAll() {
    await Promise.all(this.endpoints.map(e => this.checkEndpoint(e)));
    return this.getStatus();
  }

  getStatus() {
    return this.endpoints.map(e => ({
      name: e.name,
      status: this.status[e.name],
      latencyMs: this.latency[e.name],
    }));
  }

  getBestEndpoint() {
    return this.endpoints
      .filter(e => this.status[e.name] === 'healthy')
      .sort((a, b) => (this.latency[a.name] || 9999) - (this.latency[b.name] || 9999))[0];
  }
}

// Run health checks every 60 seconds
const monitor = new RpcHealthMonitor([
  { name: 'primary', url: 'https://eu.endpoints.matrixed.link/rpc/ethereum?auth=YOUR_KEY' },
  { name: 'fallback', url: 'https://your-fallback-rpc' },
]);

setInterval(() => monitor.checkAll(), 60000);
monitor.checkAll(); // Initial check

Layer 5: WebSocket Failover

WebSocket connections need their own failover logic. FallbackProvider does not handle WebSocket subscriptions.

class ReliableWebSocket {
  constructor(endpoints) {
    this.endpoints = endpoints;
    this.currentIndex = 0;
    this.provider = null;
    this.listeners = new Map();
    this.reconnectDelay = 1000;
    this.maxDelay = 30000;
  }

  async connect() {
    const endpoint = this.endpoints[this.currentIndex];
    console.log(`Connecting to ${endpoint.name}...`);

    this.provider = new ethers.WebSocketProvider(endpoint.url);

    this.provider.websocket.on('close', () => {
      console.log(`${endpoint.name} disconnected. Failing over...`);
      this.failover();
    });

    this.provider.websocket.on('error', (err) => {
      console.error(`${endpoint.name} error:`, err.message);
    });

    // Re-attach all registered listeners to the new provider
    this.reattachListeners();

    console.log(`Connected to ${endpoint.name}`);
    this.reconnectDelay = 1000; // Reset on success
  }

  failover() {
    // Try next endpoint
    this.currentIndex = (this.currentIndex + 1) % this.endpoints.length;

    setTimeout(() => {
      this.connect().catch(() => {
        this.reconnectDelay = Math.min(this.reconnectDelay * 2, this.maxDelay);
        this.failover();
      });
    }, this.reconnectDelay);
  }

  onBlock(handler) {
    this.listeners.set('block', handler);
    this.provider?.on('block', handler);
  }

  reattachListeners() {
    this.listeners.forEach((handler, event) => {
      this.provider.on(event, handler);
    });
  }
}

// Usage
const wsManager = new ReliableWebSocket([
  { name: 'BoltRPC',   url: 'wss://eu.endpoints.matrixed.link/ws/ethereum?auth=YOUR_KEY' },
  { name: 'Fallback',  url: 'wss://your-fallback-wss-url' },
]);

await wsManager.connect();

wsManager.onBlock((blockNumber) => {
  console.log('New block:', blockNumber);
});

Retry Logic for Transient Failures

Not every error needs a failover. Network hiccups, temporary overload, and rate limit responses are often transient. A retry after a short delay is enough.

async function withRetry(fn, options = {}) {
  const maxRetries = options.maxRetries || 3;
  const baseDelay = options.baseDelay || 500;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      const isLastAttempt = attempt === maxRetries;
      const isRetryable = isRetryableError(error);

      if (isLastAttempt || !isRetryable) throw error;

      const delay = baseDelay * Math.pow(2, attempt); // exponential backoff
      console.warn(`Attempt ${attempt + 1} failed. Retrying in ${delay}ms...`);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

function isRetryableError(error) {
  // Retry on network errors, timeouts, rate limits
  const retryableCodes = ['NETWORK_ERROR', 'TIMEOUT', 'SERVER_ERROR'];
  const retryableMessages = ['rate limit', '429', '503', 'timeout'];

  if (retryableCodes.includes(error.code)) return true;
  if (retryableMessages.some(msg => error.message?.toLowerCase().includes(msg))) return true;

  return false;
}

// Usage
const balance = await withRetry(
  () => provider.getBalance(address),
  { maxRetries: 3, baseDelay: 500 }
);

The Full Production Stack

Request

withRetry(): handles transient failures

FallbackProvider: routes between primary and secondary

CircuitBreaker: stops traffic to failed providers

HealthMonitor: proactive alerting before users see failures

Blockchain

Most dApps only need retry + FallbackProvider. Add circuit breakers and health monitors when your application handles significant user funds or has strict uptime requirements.


BoltRPC Infrastructure

BoltRPC runs redundant node clusters with automatic failover on all 22 supported networks. ISO/IEC 27001:2022 certified infrastructure via Matrixed.Link. Built for high-throughput workloads, trusted by Chainlink, Tiingo, Gains Network, Enjin processing 2 billion daily requests.

Even with a reliable provider, configure a secondary fallback in your application. Defense in depth costs almost nothing to implement and protects against scenarios outside any provider’s control.

Start your free 2-week trial: trial.boltrpc.io

Related: How to Connect to Ethereum with ethers.js | Multi-Chain RPC Guide


FAQ

Does FallbackProvider add latency?

Only if the primary provider is slow or unavailable. Requests are sent to the primary first. The fallback is only contacted if the primary stalls (after stallTimeout) or fails. In normal operation, latency is identical to using the primary directly.

Should I always use a fallback provider?

For production applications handling real user funds or with SLA requirements, yes. For personal projects or development environments, a single provider is usually fine.

How do I know when my primary RPC is down?

Health checks (as shown above) catch issues proactively. Alternatively, monitor your FallbackProvider usage. A spike in fallback usage indicates your primary is degraded. Most observability tools (Datadog, Prometheus) can alert on this.

Can FallbackProvider handle WebSocket failover?

No. FallbackProvider manages HTTP providers only. For WebSocket subscriptions, implement custom reconnection and failover logic as shown in the WebSocket section above.

What is the difference between failover and load balancing?

Failover means switching to a backup when the primary fails. Load balancing distributes requests across multiple providers simultaneously. FallbackProvider is a failover tool. True load balancing requires custom routing logic that sends different requests to different providers based on load or response time.

Frequently asked questions

Ready to build with high-performance RPC?

Start your free trial today. No credit card required. Access 20+ networks instantly.

Disclaimer: The content in this article is for informational purposes only and does not constitute financial, legal, or technical advice. Code examples and configurations are provided as-is. Always verify information with official documentation and test thoroughly in your own environment before deploying to production.

Continue reading