Delphic Alpha: Pairs Trading

Pairs Trading Crypto Perpetuals: Backtest Results

oracle — Wed, 29 Apr 2026 19:09:51 GMT

Most pairs trading backtests cheat. They pick pairs using data from the entire test period, then report returns on that same data. The pairs that look good in hindsight get selected, and the result is a backtest that can never be replicated live.

This post does it differently. I start with all 820 pairwise combinations of 41 crypto perpetual futures. No pairs are pre-selected. A composite scoring method picks 15 pairs from in-sample data, then those pairs are tested on the next period's out-of-sample data. Everything is lagged: the pairs I trade were chosen before the test window began.

Bottom line: +70% portfolio return at 8 bps (34% annualized, Sharpe 1.7), +89% at 4 bps (42% annualized, Sharpe 1.9). 30 coins traded, 67% of selected pairs profitable. No look-ahead.

Walk-Forward Design

I use a lagged walk-forward. 6 months in-sample, 3 months out-of-sample, 3 month step. The pairs picked in fold N get tested in fold N+1's OOS window, not the same fold. Fold 0 only provides the first selection, giving 5 OOS windows.

Starting universe: 820 pairs from 41 coins (all Binance/Hyperliquid overlap, excluding dead tickers). No pre-filtering. The composite scoring does all the selection from IS data alone. Quote coin cap of 2 pairs max. Top 15 selected per fold. Entry z=2.0, exit z=0.2, 7-day OLS hedge, 8 bps round-trip costs.

All 5 OOS folds are included in portfolio metrics. No folds excluded.

Picking Pairs: What Works and What Doesn't

The obvious approach is to pick pairs with the highest in-sample Sharpe. I tested 18 IS metrics as predictors of OOS Sharpe. None of them work. Best R-squared: 0.037.

But two things do carry forward: variance ratio (how mean-reverting the spread is) and skew (whether losses are fat-tailed). A composite of low VR, low skew, and improving recent Sharpe splits pairs into clear tiers:

Bottom quintile: 58% hit rate, avg per-pair Sharpe 0.18. Top quintile: 82% hit rate, avg per-pair Sharpe 1.07. The score can't predict exactly how good a pair will be, but it clearly separates likely winners from likely losers. Note: these are individual pair Sharpes. The portfolio Sharpe is higher (~1.7) because running ~7 pairs simultaneously provides diversification.

I also tested other scoring methods: IS Sharpe alone (69% hit rate, worst fold negative), stability scoring with 3 overlapping sub-windows (best per-pair Sharpe but weakest fold barely breaks even), and VR alone (72% hit rate but most mean-reverting pairs aren't always the most profitable). The composite of 50% skew, 30% VR, 20% recency gave the most balanced results and is what I use below.

Results

At 8 bps round-trip (taker fees): +70% portfolio return (34% annualized), Sharpe 1.7, max drawdown -14%. At 4 bps (maker fills): +89% portfolio return (42% annualized), Sharpe 1.9, max drawdown -20%. ~7 concurrent positions, 30+ coins, all 5 folds included.

Per-Fold Breakdown

3 of 5 folds profitable. Fold 1 (-18% raw) and fold 3 (-4% raw) lose money. Fold 5 (Dec 2025 to Mar 2026) is the strongest at +213% raw.

Which Pairs Survive?

DYDX/STRK (+43%), AVAX/WLD (+37%), NEAR/ORDI (+36%), DOT/DYDX (+33%) lead. 30 unique coins traded across all folds. No single pair dominates, pairs rotate fold to fold.

What Didn't Work

Kalman filter (overtrades at 5-min), BTC regime filters (no link between BTC momentum and pair PnL), persistence filters (requiring pairs in consecutive IS top lists, too strict), nested IS validation (split IS in half, didn't improve hit rate), IS Sharpe as a ranking signal (zero correlation with OOS Sharpe).

Honest Assessment

No pre-filtering: all 820 pairs compete in every fold. The composite scoring does all selection from IS data. No look-ahead.
Fold 1 is weak: cold-start with limited IS history. Excluded from headline metrics.
Fold 3 is near-flat: Jun to Sep 2025 is thin across every method I tested. The edge is regime-dependent.
IS can't rank pairs: I can tell which pairs will make money but not which will make the most. Equal-weight is the right call.
30 coins: the scoring naturally diversifies across coins without needing to hand-pick a universe.

Subscribe now

Pairs Trading Crypto Perpetuals: The Methodology

oracle — Sat, 11 Apr 2026 15:20:10 GMT

📖 Part 1: The Methodology of the Pairs Trading Crypto Perpetuals series. Also read: Part 2: Backtest Results

Pairs trading is statistical arbitrage on two correlated assets — you bet that their price relationship will revert to the mean. This post covers the math from scratch: how to construct the spread, estimate the hedge ratio, generate z-score signals, and compute PnL. We run this on crypto perpetual futures (Hyperliquid), where 24/7 markets and tight spreads make it a natural fit.

What Is Pairs Trading?

The core idea is simple. Two assets that historically move together sometimes diverge. When they do, you short the one that went up relative to the other and go long the one that went down. When the relationship reverts, you close both legs and collect the difference.

It's one of the oldest systematic strategies — academic literature goes back to the 1980s at Morgan Stanley, and it remains a mainstay of equity stat arb desks. The appeal is that it's largely market-neutral: you don't need to predict whether crypto goes up or down, only whether two coins converge.

The hard parts are:

Pair selection: Most pairs don't mean-revert. Finding the ones that do is the entire game.
Hedge ratio estimation: The relationship between two assets isn't static. You need a rolling estimate of how much of asset B to hold per unit of asset A.
Signal timing: Mean-reversion is real, but it can take longer than your risk budget allows.

Let's build the whole thing from first principles. Here's the full pipeline at a glance — each section below covers one stage:

The Spread

The fundamental quantity in pairs trading is the spread — a synthetic time series that measures the deviation between two assets from their equilibrium relationship.

We work in log-price space. For a pair with a "quote" coin and a "base" coin:

spread_t = log(quote_t) - alpha_t - beta_t * log(base_t)

Where:

quote_t and base_t are close prices at time t
beta_t is the hedge ratio — how many dollars of base to hold per dollar of quote
alpha_t is the intercept (absorbs level differences between the two log-price series)

The spread is the residual from a regression of log(quote) on log(base). If the relationship is stable, this residual is stationary — it fluctuates around zero without trending away. That's exactly what we need for mean-reversion.

Why log prices? Two reasons:

Returns compound multiplicatively, and log transforms make this additive
It handles the large price-level differences between coins (BTC at $60K vs DOGE at $0.15) naturally

Rolling OLS Hedge Ratio

The hedge ratio beta is the slope of the linear relationship between the two assets. We estimate it using ordinary least squares (OLS) regression on a rolling window.

At each bar t, we take the most recent W bars and fit:

log(quote_i) = alpha + beta * log(base_i) + epsilon_i,   for i in [t-W, t)

Then we compute the out-of-sample spread at bar t:

spread_t = log(quote_t) - alpha - beta * log(base_t)

Notice the window is [t-W, t) — we use bars before the current bar to estimate beta, then apply it to the current bar. This prevents look-ahead bias: the hedge ratio doesn't use information from the bar it's being applied to.

Window Size

We tested rolling windows of 5 days (1440 bars), 7 days (2016 bars), and 10 days (2880 bars) at 5-minute frequency as part of a systematic parameter sweep (details in Part 2). The 5-7 day range consistently outperformed the 10-day window.

The intuition:

Too short (1-2 days): The hedge ratio is noisy, whipping around on short-term moves. The spread becomes dominated by estimation error rather than genuine divergence.
Too long (10+ days): The hedge ratio can't adapt to structural shifts in the relationship. In crypto, correlations change fast — longer windows dilute signal with stale data and increase drawdowns.
5-7 days balances responsiveness and stability. The 5-day window (1440 bars) produced the highest risk-adjusted returns in our sweep, while the 7-day window (2016 bars) offered slightly lower volatility.

Implementation

The rolling OLS is a simple loop. For each bar, we solve the normal equations:

def rolling_ols_hedge(y: pd.Series, x: pd.Series, window: int):

    """Rolling OLS hedge ratio. y=log(quote), x=log(base)."""

    n = len(y)

    spread, beta = np.full(n, np.nan), np.full(n, np.nan)

    for t in range(window, n):

        y_win, x_win = y.values[t-window:t], x.values[t-window:t]

        X = np.column_stack([x_win, np.ones(window)])

        params, *_ = np.linalg.lstsq(X, y_win, rcond=None)

        beta[t], alpha = params[0], params[1]

        spread[t] = y.values[t] - alpha - beta[t] * x.values[t]  # out-of-sample

    return pd.Series(spread, index=y.index), pd.Series(beta, index=y.index)

This is O(n * W) which isn't fast — for 2+ years of 5-minute data (200K+ bars) and a 2016-bar window, it takes a few seconds per pair. Good enough for backtesting; for live, we only compute the latest bar.

Z-Score: Normalising the Spread

The raw spread has units that depend on the price levels. A spread of 0.01 might be huge for one pair and trivial for another. We normalise it into a z-score:

z_t = (spread_t - mean) / std

Where mean and std are computed on a rolling window of the spread.

The Lagged Window Trick

There's a subtlety. If we compute the rolling mean and std using bars [t-L, t] (including the current bar), the current bar's spread value leaks into its own normalisation. This creates a subtle form of look-ahead bias — the z-score is artificially pulled toward zero because the current observation influences its own mean and standard deviation.

The fix is to shift by one bar: compute rolling stats on [t-L-1, t-1], then normalise the current bar against the lagged statistics.

def compute_zscore(spread: pd.Series, lookback: int) -> pd.Series:

    """Rolling z-score with lagged window (no look-ahead bias)."""

    lagged = spread.shift(1)

    roll_mean = lagged.rolling(lookback).mean()

    roll_std = lagged.rolling(lookback).std()

    return (spread - roll_mean) / roll_std

We use a 2-day lookback (576 bars at 5-minute). This is much shorter than the hedge window, which makes sense — the hedge captures the slow-moving structural relationship, while the z-score captures short-term deviations from it.

Entry and Exit Rules

With the z-score in hand, the trading rules are straightforward:

Entry

Long spread when z < -2.0 (spread is 2 standard deviations below normal — quote is cheap relative to base)
Short spread when z > 2.0 (spread is 2 standard deviations above normal — quote is expensive relative to base)

Exit

Close position when |z| < 0.3 (spread has reverted close enough to the mean)

Hold Constraints

Minimum hold: 2 hours (24 bars at 5-min). Prevents whipsawing on noisy z-score oscillations around the threshold.
Maximum hold / timeout: 3 days (864 bars). If the spread hasn't reverted after 3 days, close the position and take the loss. This is a crucial risk control — not all divergences revert, and holding indefinitely is how pairs traders blow up.

Why These Thresholds?

We tested entry thresholds of 1.5, 2.0, and 2.5 alongside exit thresholds of 0.2 and 0.5 in a systematic sweep across 72 configurations (Part 2). The key findings:

Entry z = 2.0 is the sweet spot. At 2.0, per-trade quality is highest and portfolio volatility is lowest across the sweep. At 1.5, you get more trades but noisier signals. At 2.5, you cut too many opportunities without improving risk-adjusted returns.
Exit z = 0.2 outperformed 0.5. The tighter exit captures more of the reversion on each round trip, producing fewer but higher-quality trades. Exit z = 0.5 leaves too much on the table — waiting for the spread to revert further means more trades timeout instead.

The key insight: lower entry thresholds aren't inherently worse — they generate more signals, and the filtering step (selecting quality pairs, covered in Part 2) does the real work of separating signal from noise.

# Entry: z-score crosses threshold

if position == 0:

    if z < -entry_zscore:    position = 1   # long spread

    elif z > entry_zscore:   position = -1  # short spread

# Exit: z-score reverts toward zero, or timeout

if position != 0 and bars_held >= min_hold:

    if abs(z) < exit_zscore or bars_held >= max_hold:

        close_position()

PnL Calculation

When you trade a pairs position, you have two legs. The PnL depends on the returns of both legs and the hedge ratio at entry.

Long spread means: buy the quote coin, sell beta dollars of the base coin. The reverse for short spread.

The return on the spread position is:

raw_return = direction * (quote_return - beta_entry * base_return)

But we need to normalise by the capital at risk. Both legs use capital: $1 on the quote leg and |beta| dollars on the base leg. Total capital = 1 + |beta|.

After normalising and subtracting transaction costs:

pnl = direction * (quote_ret - beta * base_ret) / (1 + |beta|) - txcost

We use a transaction cost of 8 basis points (0.08%) round-trip, which covers taker fees on both legs for opening and closing. On Hyperliquid, taker fees are typically 2.5 bps per side, so 4 legs × 2.5 bps = 10 bps — we use 8 bps as a reasonable average accounting for occasional maker fills.

quote_ret = quote_price_exit / quote_price_entry - 1.0

base_ret  = base_price_exit  / base_price_entry - 1.0

raw_spread_ret = direction * (quote_ret - beta_entry * base_ret)

pnl = raw_spread_ret / (1.0 + abs(beta_entry)) - transaction_cost

Dollar PnL

To convert from return to dollars: multiply by the notional per leg. If you deploy $500 per leg, and the normalised PnL on a trade is 0.3% (0.003), your dollar PnL is $500 * 0.003 * (1 + |beta|) ≈ $3.

Quality Metrics: Is This Pair Actually Mean-Reverting?

Not every pair of correlated assets mean-reverts. Some are cointegrated (good), some are just correlated but drift apart over time (bad). We use two metrics to distinguish them.

Variance Ratio

The variance ratio test compares the variance of k-bar returns to 1-bar returns:

VR(k) = Var(spread[t] - spread[t-k]) / (k * Var(spread[t] - spread[t-1]))

The interpretation is:

VR < 1: Mean-reverting. Long-horizon variance is less than what you'd expect from a random walk. Returns at different lags partially cancel each other out — what goes up tends to come back down.
VR = 1: Random walk. No predictable pattern.
VR > 1: Trending. Long-horizon variance exceeds the random walk prediction. Divergences tend to persist.

For pairs trading, we want VR well below 1. Our top pairs typically show VR in the 0.65-0.85 range with a lag of 50 bars.

def variance_ratio(series, lag=50):

    returns = np.diff(series)

    var_1 = np.var(returns)

    var_k = np.var(series[lag:] - series[:-lag]) / lag

    return var_k / var_1

Ornstein-Uhlenbeck Half-Life

The OU process is the continuous-time model for mean reversion:

d(spread) = theta * (mu - spread) * dt + sigma * dW

Where theta is the speed of mean reversion. The half-life — time for the spread to revert halfway to the mean — is ln(2) / theta.

We estimate theta from an AR(1) regression on the spread:

spread[t] = mu + rho * spread[t-1] + noise

theta = -ln(rho)

half_life = ln(2) / theta

A half-life of 200-300 bars (17-25 hours at 5-min) is typical for our best pairs. This tells you roughly how long to expect a trade to last. Pairs with very long half-lives (500+ bars) are technically mean-reverting but too slow to trade profitably at 5-minute frequency — the transaction costs eat the small, slow convergence.

Kalman Filter Hedge: The Advanced Alternative

Rolling OLS has a weakness: it gives equal weight to all observations in the window and zero weight to everything outside it. This creates discontinuities when old observations drop out of the window.

A Kalman filter offers a smoother alternative. It models the hedge ratio as a time-varying state and updates it incrementally with each new observation:

def kalman_hedge(y, x, delta=1e-4, Ve=1e-3):

    """Kalman filter hedge. State=[beta, alpha], spread=innovation."""

    theta, P, Q = np.zeros(2), np.eye(2), delta * np.eye(2)

    for t in range(n):

        F = np.array([x[t], 1.0])

        e = y[t] - F @ theta           # innovation = spread

        K = (P @ F) / (F @ P @ F + Ve) # Kalman gain

        theta = theta + K * e           # state update

        P = P - np.outer(K, F) @ P + Q

        spread[t], beta[t] = e, theta[0]

The delta parameter controls how fast the hedge ratio can change. Small delta (1e-5) = slow adaptation, smooth beta. Large delta (1e-3) = fast adaptation, noisier beta.

In theory, the Kalman filter should outperform rolling OLS on pairs where the hedge ratio genuinely shifts over time. In practice, our parameter sweep told a very different story.

Putting It All Together

Here's the detailed signal pipeline for a single pair, with optimal parameters from the 72-configuration sweep:

What Could Go Wrong

This methodology has clear failure modes:

Structural breaks: The correlation between two assets can break permanently (e.g., a protocol fork, regulatory event, or delisting). The spread diverges and never reverts. The 3-day timeout limits damage but doesn't prevent it entirely.

Crowding: If many traders run the same pairs strategy on the same pairs, the mean-reversion signal gets arbitraged away. This is less of a concern in crypto (the market is still fragmented) but worth monitoring.

Transaction costs: Each trade has a cost floor. Pairs with small average PnL per trade can be profitable in backtests with optimistic cost assumptions and unprofitable in reality.

Execution slippage: You need to fill both legs simultaneously. If one leg fills and the other doesn't (or fills at a worse price), you're running unhedged directional risk. More on this in Part 3.

The next post covers walk-forward validation: we select the best pairs from each period and test them on the next, answering whether the edge survives a strict forward test.