Pair Trading - II - Regression and Ornstein-Uhlenbeck process(O-U Model)

Posted by Chris IO on September 10, 2017

This is a continuation of the Pair Trading Series. For the first article which discusses how to find out potential trading pairs using clustering algorithms DBSCAN, you may go back to here.

In this article, I would focus on the analysis of trading pairs by modelling a mean-reverting spread. There are a few ways to define “spread” between prices, I would list a fews that I have come across:

1. Simple price differences between two products:

This only gives a sense of how two products move relative to one another. Practically, unless two products have similar price scales, the simple difference has not much meaning.

2. Percentage Change:

Normalize the simple price difference from a chosen point in time, then measure the spread as if a product itself. This has the advantage of comparison with previous data. The downside is, it is not representative of the pair relationship. For example, assume there are stock A valued at $1 and stock B valued at $10 at time t0. At time t1, both appreciated 10% so $1.1 and $11 for stock A and stock B respectively. However, the spread would appreciated 10% from $9 to $9.9 as well. Practically speaking, simple price difference/its standardized form both would not converge to a particular level.

3. Relative Percentage Change:

Using the previous example, an improved form would be to measure the percentage change of both products. So if both stock A and stock B move up 10%, their spread movement would be 0%. This is a more realistic approach to model spread. In pair trading, where we long one product and short the other, profit and loss would be dependent on this relative percentage change given the portfolio is constructed evenly among the two. For example, assume one million stock A is longed and one million stock B is short sold, stock A has a 10% gain while stock B has a 5% gain. Then the pnL would be 0.5x1.1+0.5x0.95 = +%2.5. So a spread of (5%) results in a +2.5% gain in the entire portfolio. From this example, the percentage spread is representative of gain from a vanilla long/short.

4. Price Ratio:

Directly divide the price of one stock by the other one. This measures the geometric change of both prices. This is the most common use of spread.

In this article, I would implement a model that assumes a mean-reverting behavior hidden in the price differential defined by (3). I took reference from the paper written by Y Chen, W Ren and X Lu from Stanford.

The paper illustrated two approaches to model spread, one adopts O-U model and regression and the other one implemented Kalman’s Filter and Expectataion Maximization. I would implement the first approach in this article. I would follow the paper step-by-step with comment on codes. At the end, a profit and loss path would be simulated to measure how the stock pair is performing.

In the last article, one of the stock pair discovered by the clustering algorithm DBSCAN was 0386.HK and 0857.HK. Let’s use this as an example for our analysis:

import os
import pandas as pd
target = ['0386.HK','0857.HK']
path = os.path.join(os.getcwd()+'/'+os.path.pardir,"part1/StockData/")
df = pd.DataFrame()
openprice = pd.DataFrame()
for i in target:
    data = pd.read_csv(path+'{}.csv'.format(i))
    a = data['Adj Close']
    b = data['Open']
    openprice = pd.concat([openprice,b.rename('{}'.format(i))],axis=1)
    df = pd.concat([df,a.rename('{}'.format(i))],axis=1)


             0386.HK   0857.HK
2015-09-04  4.421465  5.657228
2015-09-07  4.366312  5.618280
            0386.HK  0857.HK
2015-09-04     4.93     5.94
2015-09-07     4.72     5.75

Let’s get our dataset ready by reading in the stock data downloaded from Yahoo Finance. The first dataframe is closed price which would be used for signal generation, while the second is the opening price which would be used for profit and loss simulation. Recalled from last article that a crawler was writtern to get the information. You may download the data(As of 04-09-2017) from here.

import statsmodels.api as sm
def rollingreg(df,window=window,target=target):
    alpha,beta = [],[]
    for i in range(df.shape[0]-window+1):
        X = sm.add_constant(list(df['{}'.format(target[0])][i:i+window]))  ### target[0] is X
        Y = list(df['{}'.format(target[1])][i:i+window])
        model = sm.OLS(Y,X)
        result =
    alpha = pd.DataFrame(data={'alpha':alpha},index=df.index[window-1:])
    beta = pd.DataFrame(data={'beta':beta},index=df.index[window-1:])
    new_df = pd.concat([df,alpha,beta],axis=1)
    new_df['residual'] = new_df['{}'.format(target[1])] - new_df['alpha'] + new_df['beta']*new_df['{}'.format(target[0])]
    return new_df


0386.HK 0857.HK alpha beta residual
2015-11-27 4.470203 5.459994 NaN NaN NaN
2015-11-30 4.460812 5.430428 NaN NaN NaN
2015-12-01 4.582897 5.568406 1.546656 0.864888 7.985442
2015-12-02 4.601679 5.617684 1.451807 0.883489 8.231409
2015-12-03 4.526550 5.588117 1.352018 0.903162 8.324308

Here is a function that does rolling regression. Since pandas rolling regression function only returns the beta, I defined my own one that fully returns alpha, beta and the residual. Because I defined a rolling window of 60 days, the regression parameters only have values starting from the 60th row.

def step1(df=df):
    new_df = df.pct_change().iloc[1:]
    return rollingreg(new_df)
0386.HK 0857.HK alpha beta residual
2015-11-30 -0.002101 -0.005415 NaN NaN NaN
2015-12-01 0.027368 0.025408 NaN NaN NaN
2015-12-02 0.004098 0.008850 -0.000668 0.907991 0.013239
2015-12-03 -0.016326 -0.005263 -0.000581 0.905394 -0.019464
2015-12-04 -0.002075 -0.022928 -0.000675 0.922321 -0.024166

Here we first convert the stock-price series into its percentage change by normalizing against its price on the first row. From the regression parameters, we verify that alpha is very close to zero, safely to be ignored for now. Since the model expects a constant beta, stock pairs with an versatile beta within our targeted trading time span would impose a bigger risk on the trades. The paper written by Y Chen mentions that a stable rolling beta for 5 days would be a baseline signal. In reality, this would be case dependent.

import matplotlib.pyplot as plt
df = step1()


Unfortunately, the stock pair 0386.HK and 0857.HK does not demonstrate a stable beta in the past two years. especially in recent months. It oscillated from 0.7 to 1 with a sudden drop of 0.5 in the past six months. However, for demonstration purpose I would continue to use this as an example.

Now, after the first regression is done, the rolling sum of residual term would be used for another round of regression against itself with one day delay :

(R = a + b x R(delay=1) + error)

This is equivalent to autoregression with degree 1 / AR(1).

import numpy as np
def step2(df=df):
    df2 = step1(df)
    #df has alpha,beta and residual
    new_df = pd.DataFrame()
    new_df['sum_residual'] = df2['residual'].rolling(window=window).sum()
    new_df['sum_residual_delay'] = new_df['sum_residual'].shift(1) #regressor ~ X
    target = ['sum_residual_delay','sum_residual'] #[X,Y]
    new_df = rollingreg(new_df,target=target) #rerun rolling regression with updated alpha and beta and residual
    new_df['var'] = new_df['residual'].rolling(window=window).std(skipna=True)**2
    new_df['equsigma'] = new_df['var']/(1 - new_df['beta']**2)
    new_df['mu'] = new_df['alpha']/(1- new_df['beta'])
    #new_df['speed'] = -np.log(new_df['beta'])*252
    #new_df['std'] = np.sqrt(new_df['equsigma']*2*new_df['speed'])
    return new_df


sum_residual sum_residual_delay alpha beta residual var equsigma mu
2016-08-19 0.163828 0.242463 0.050941 0.721709 0.287875 NaN NaN 0.183050
2016-08-22 0.141429 0.163828 0.047130 0.741717 0.215813 NaN NaN 0.182475
2016-08-23 0.138764 0.141429 0.050200 0.717386 0.190023 0.014419 0.029708 0.177627
2016-08-24 0.130538 0.138764 0.046922 0.735626 0.185694 0.013847 0.030176 0.177485
2016-08-25 0.117359 0.130538 0.043429 0.757017 0.172749 0.013978 0.032742 0.178732

Parameters Explained:

sum_residual is the 60-day rolling sum of the residuals we get from the first regression of stock price(percentage change). It would be the Y in the coming regression.

Column sum_residual_delay is sum_residual with one-day delay. It would be the X in the coming regression.

Column alpha, beta and residual are the regression parameters resulted from regressing sum_residual on sum_residual_delay.

mu and equisignma are directly related to our signal generation:

mu: The mean of residual we would expect. equisignma: The standard deviation of mu we would expect —

Whenever the residual term exceeds the level of mu plus some degree of equisignma (z-score depending on the risk level), we can short sell the spread. Recall how regression is done :

Y = a + bx + error

Concretely, short selling the spread means short Y long X (Y-X+). On the other hand, when the residual term plunges through the level of mu minus some degree of equisignma, we long the spread, long Y short X (Y+X-).

Picking a z-score of 1, Let’s see how the pnl performs:

import matplotlib.pyplot as plt
def pnl(new_df):
    ###create a signal for long/short/hold
    def signal_gen(x):
        if x['residual'] - x['mu'] > x['equsigma']:
            return 1
        elif x['residual'] - x['mu'] < x['equsigma'] and x['residual'] - x['mu'] > -x['equsigma']:
            return 0
        elif x['residual'] - x['mu'] < -x['equsigma']:
            return -1
            return 2
    openprice['signal'] = new_df.apply(signal_gen,axis=1) #signal -1/0/1, shifted
    holding = 0  #current status
    pnL_histroy = []  ##pnl 
    entry = [] #store X,Y and their price ratio for contract entry
    shift = openprice.index[1:]  #for opening contract
    for i in range(1,new_df.shape[0]):
        if openprice.loc[shift[i-1],'signal'] == 1: #excessive spread
            if holding == 0:  #if not holding any contract,
                # because of excessive spread, short Y long X
                #entry [X,Y, Y/X]
                entry = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]],openprice.loc[shift[i],target[1]]/openprice.loc[shift[i],target[0]]] #target[0] is X,1 is Y
                holding = -1 #now hold a short contract
                pnL_histroy.append(0) #no pnl at this day
            elif holding == -1:  #if holding an existing short contract
                #update pnl
                pnl = (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] + openprice.loc[shift[i],target[1]] - entry[1] #target[0] is X,1 is Y
                #update entry to price of current day, keep entry ratio
                entry[:2] = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]]] #target[0] is X,1 is Y
            elif holding == 1: #if holding an existing long contract
                #update pnl
                pnl = openprice.loc[shift[i],target[1]] - entry[1] + (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] #target[0] is X,1 is Y
                ##now open a short contract
                holding = -1
                entry = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]],openprice.loc[shift[i],target[1]]/openprice.loc[shift[i],target[0]]] #target[0] is X,1 is Y
                print('error for holding')
        elif openprice.loc[shift[i-1],'signal'] == -1: 
            if holding == -1: 
                pnl = (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] + openprice.loc[shift[i],target[1]] - entry[1] #target[0] is X,1 is Y
                holding = 1
                entry = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]],openprice.loc[shift[i],target[1]]/openprice.loc[shift[i],target[0]]] #target[0] is X,1 is Y
            elif holding == 0:
                holding = 1
                entry = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]],openprice.loc[shift[i],target[1]]/openprice.loc[shift[i],target[0]]] #target[0] is X,1 is Y
            elif holding == 1:
                pnl = openprice.loc[shift[i],target[1]] - entry[1] + (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] #target[0] is X,1 is Y
                entry[:2] = [openprice.loc[shift[i],target[0]],openprice.loc[shift[i],target[1]]] #target[0] is X,1 is Y
                print('error for holding')
        elif openprice.loc[shift[i-1],'signal'] == 0:
            if holding == 0:
            elif holding == 1:
                pnl = openprice.loc[shift[i],target[1]] - entry[1] + (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] #target[0] is X,1 is Y
                holding = 0
            elif holding == -1:
                pnl = (entry[0] - openprice.loc[shift[i],target[0]])*entry[2] + openprice.loc[shift[i],target[1]] - entry[1] #target[0] is X,1 is Y
                holding = 0
                print('error for holding')
        else: ###sginal 2 or na are not in the timeframe of analysis, ignore

    y = shift.to_datetime()[-len(pnL_histroy):]
    pnl, = plt.plot(y,pnL_histroy,label='pnl')
    cumsum, = plt.plot(y,np.cumsum(pnL_histroy),label='cumulative pnl')
    result = {'profit':sum(pnL_histroy),
    'win ratio':sum([1 for i in pnL_histroy if i>0])/(sum([1 for i in pnL_histroy if i<0])+sum([1 for i in pnL_histroy if i>0])),
    'maximum day gain':max(pnL_histroy),'maximum day loss':min(pnL_histroy)}
    return result


{'maximum day gain': 0.29080,
 'maximum day loss': -0.19394,
 'profit': 0.03968,
 'win ratio': 0.4350}

After the PnL simulation, the pair performs quite poorly for trading. It echoes the fact that the pair does not possess a stable beta.