Robust standard errors python

Содержание

Plan for the Day
The Problem
Usual Practice
The Trap
Visualizing the Problem
What’s Wrong If You Use the Default SE Without I.I.D. Errors?
The Price of Being Wrong
Robust Standard Error To The Rescue!
Which Robust Standard Error Should I Use?
Case 1: The Error Term Has An Individual Specific Component
Where’s the Practical Guide?
Case 2: The Error Term Has a Time Specific Component
Case 3: The Error Term Has Both a Firm and a Time Effect
Codes
Case 1: Clustering on 1 dimension
Case 2: Fama-Macbeth
Case 3: Clustering on 2 dimensions
Case 4: Fixed Effect + Clustering
Enjoy your newly found robust world!
Python Tutorial: How to Use RLM with HAC Standard Errors for Robust Regression

Plan for the Day

In this post, I would consolidate a 69 page paper on robust standard error into a cheatsheet. This paper, published by Professor Mitchell Petersen in 2009, has accrued more than 7,879 citations as of today. It remains the bible for choosing the correct robust standard error.

The Problem

Usual Practice

In any Stats 101 class, your professor might have taught you to type “reg Y X” in Stata or R:

You proceed to test your hypothesis with the reported point estimate and standard error. But 99% of the time, this would be wrong.

The Trap

For OLS to give unbiased and consistent estimates, we need the error term epsilon to be independently and identically distributed:

Independent means that no serial or cross-correlations are permitted:

Serial-correlation: for the same individual, residuals across different time periods are correlated;
Cross-correlation: different individual residuals are correlated, within and/or across periods.

Identical means that all the residuals have the same variance (a.k.a. homoscedasticity).

Visualizing the Problem

Let’s visualize the i.i.d. assumption in a variance-covariance matrix.

No serial correlation: all off-diagonal entries in the red bubbles need to be 0;
No cross-correlation: all diagonal entries need to be the same — all entries in the green rectangles need to be 0;
Homoscedasticity: diagonal entries need to be the same constant.

What’s Wrong If You Use the Default SE Without I.I.D. Errors?

Deriving the SE expression:

Default standard error is the last line in (3). But to get us from the 1st to the last line, we need to make extra assumptions:

We need the independence assumption to move us from the 1st line to the 2nd line in (3). Visually, all entries in the green rectangles AND all the off-diagonal entries in the red bubbles to be 0.
We need the identically distributed assumption to move us from the 2nd line to the 3rd line. Visually, all the diagonal entries to be exactly the same.

Default SE is right under VERY limited circumstances!

The Price of Being Wrong

We don’t know whether the reported SE would over or underestimate the true SE. Thus, we might end up with:

Statistically significant result, when there’s no effect in reality. As a result, the software and product team might have to work hours on some prototype that have no effect whatsoever on the company’s bottom line.
Statistically insignificant result, when there’s a significant effect in reality. This could have been a break for you. Missed opportunity. Too bad 🙁

In reality, false positive is more likely. There’s no shortage of newbie machine learning students proclaiming that they’ve found some pattern/signal to beat the market. Yet once deployed, their models perform disastrously. Part of the reason is that they’ve never thought about serial or cross correlation of the residuals.

When this happens, the default standard error can be 11 times smaller than the true standard error — leading to a gross over-estimation of their signal’s statistical significance.

Robust Standard Error To The Rescue!

A correctly specified robust standard error would get rid of the bias — or at least, ameliorate it. Armed with an robust standard error, you can then safely proceed to the inference stage.

There are many robust standard errors out there. Picking the wrong remedy might exacerbate the problem!

Which Robust Standard Error Should I Use?

It depends on the variance-covariance structure. Ask yourself, does your residual suffer from cross-correlation, serial correlation or both? Recall that:

Cross correlation: within the same time period, different individual residuals might be correlated;
Serial correlation: for the same individual, residuals for different time periods might be correlated.

Case 1: The Error Term Has An Individual Specific Component

Suppose this is the true state of the world:

Assuming independence across individuals, the correct standard error would be:

Compare it with (3), we have the extra term, which is circled in red. Whether the reported OLS standard error over or underestimate the true standard error depends on the sign of the correlation coefficients, which then gets magnified by the number of time period T.

Where’s the Practical Guide?

Based on more theory and simulation result, Petersen shows that:

You shouldn’t use:

Fama-MacBeth Standard Errors: it is designed to deal with serial correlation, not cross correlation between individual firms.
Newey-West Standard Errors: it is designed to account for serial correlation of unknown form in the residuals of a single times series.

You should use:

Clustered Standard Errors: specifically, you should cluster your standard error on firms. Refer to the end of the blogpost for codes.

Case 2: The Error Term Has a Time Specific Component

Suppose this is the true state of the world:

The correct standard error is essentially the same as (7), once you exchange N and T.

You should use:

Fama-MacBeth Standard Errors: since that’s what it is constructed to do. Refer to the end of the blogpost for Stata code.

Case 3: The Error Term Has Both a Firm and a Time Effect

Suppose this is the true state of the world:

You should use:

Clustered standard error: the clustering should be done on 2 dimensions — firm by year. Note that this is not the true standard errors, it simply produce less biased standard error. The bias is more pronounced when there are only a few clusters on a single dimension.

Codes

Petersen’s detailed Stata, R and SAS instruction and test data can be found here. For my own record, I’m compiling the list of Stata code here:

Case 1: Clustering on 1 dimension

regress dependent_variable independent_variables, robust cluster(cluster_variable)

Case 2: Fama-Macbeth

tsset firm_identifier time_identifier

fm dependent_variable independent_variables, byfm(by_variable)

Case 3: Clustering on 2 dimensions

cluster2 dependent_variable independent_variables, fcluster(cluster_variable_one) tcluster(cluster_variable_two)

Case 4: Fixed Effect + Clustering

xtreg dependent_variable independent_variables, robust cluster(cluster_variable_one)

Enjoy your newly found robust world!

Источник

Python Tutorial: How to Use RLM with HAC Standard Errors for Robust Regression

Robust regression is a method used to estimate the relationship between an explanatory variable and a response variable. It is used when the data is affected by outliers or influential observations that can distort the results obtained from ordinary least squares regression. One popular technique to carry out robust regression in Python is to use the robust linear model (RLM) module from the statsmodels library.

The RLM module can be extended to consider heteroscedasticity and autocorrelation-consistent (HAC) standard errors. HAC standard errors allow for the computation of standard errors that are robust to autocorrelation while heteroscedasticity-robust standard errors adjusts the standard errors for variances that do not necessarily stay constant throughout the data.

Below are the steps for using RLM with HAC standard errors to perform robust regression.

The statsmodels library is required for the implementation of RLM. To install this library, use the following command:

Load the dataset to be analyzed into your Python environment. It is essential to understand the nature of the data and the variables involved in order to determine the best method of analysis.

To use RLM with HAC standard errors, the following modules and functions must be imported:

import statsmodels.api as sm from statsmodels.sandwich import cov_hac

Before fitting the model, the variables must be specified. To do this, create an instance of the RLM by passing in the response variable and the explanatory variables in arrays. For example, if the response variable is y and the explanatory variables are x1 , x2 , and x3 , the following code can be used to create the RLM instance:

X = df[["x1", "x2", "x3"]] y = df["y"] rlm_model = sm.RLM(y, X)

To fit the model, use the fit() method of the RLM object. For example, if the RLM object is named rlm_model , the following code would fit the model:

results = rlm_model.fit(cov_type='HAC',cov_kwds=)

The cov_type argument specifies that HAC standard errors are to be used, while maxlags indicates the maximum number of lags to consider. The output will display the regression coefficients, standard errors, t-statistics, and p-values.

Once the model is fit, it is important to analyze the results in order to determine the significance of the variables and the overall fit of the model. The summary() function can be used to print out a summary of the regression results, including the coefficient estimates, corresponding standard errors, and p-values.

This guide provides a simple step-by-step process for implementing RLM with HAC standard errors in Python. The use of robust regression methods such as this can be essential in data analysis whenever the data is non-normal or has outliers that may distort the results. With the flexibility and ease of use of Python, it is easy to implement these methods in any data analysis project.

Источник