Residuals vs fitted plot python

Python linear regression diagnostic plots similar to R

I’m trying to get diagnostic plots for a linear regression in Python and I was wondering if there’s a quick way to do this. In R, you can use the code snippet below which’ll give you a residuals vs. fitted plot, normal Q-Q plot, scale-location, residuals vs leverage plot.

Is there a quick way to do this in python? There’s a great blog post that describes how you can use Python code to get the same plots as R would give you but it requires quite a bit of code (compared to the R approach at least). Link: https://underthecurve.github.io/jekyll/update/2016/07/01/one-regression-six-ways.html#Python

You can create a function / module and then import it and use a one-liner like my_plot(formula, data) . This is what R does under the hood as well. Some R code that might (not sure, sorry) be the source of plot : github.com/SurajGupta/r-source/blob/master/src/library/stats/R/…

1 Answer 1

I prefer to storing everything in pandas and plot with DataFrame.plot() whenever possible:

from matplotlib import pyplot as plt from pandas.core.frame import DataFrame import scipy.stats as stats import statsmodels.api as sm def linear_regression(df: DataFrame) -> DataFrame: """Perform a univariate regression and store results in a new data frame. Args: df (DataFrame): orginal data set with x and y. Returns: DataFrame: another dataframe with raw data and results. """ mod = sm.OLS(endog=df['y'], exog=df['x']).fit() influence = mod.get_influence() res = df.copy() res['resid'] = mod.resid res['fittedvalues'] = mod.fittedvalues res['resid_std'] = mod.resid_pearson res['leverage'] = influence.hat_matrix_diag return res def plot_diagnosis(df: DataFrame): fig, axes = plt.subplots(nrows=2, ncols=2) plt.style.use('seaborn') # Residual against fitted values. df.plot.scatter( x='fittedvalues', y='resid', ax=axes[0, 0] ) axes[0, 0].axhline(y=0, color='grey', linestyle='dashed') axes[0, 0].set_xlabel('Fitted Values') axes[0, 0].set_ylabel('Residuals') axes[0, 0].set_title('Residuals vs Fitted') # qqplot sm.qqplot( df['resid'], dist=stats.t, fit=True, line='45', ax=axes[0, 1], c='#4C72B0' ) axes[0, 1].set_title('Normal Q-Q') # The scale-location plot. df.plot.scatter( x='fittedvalues', y='resid_std', ax=axes[1, 0] ) axes[1, 0].axhline(y=0, color='grey', linestyle='dashed') axes[1, 0].set_xlabel('Fitted values') axes[1, 0].set_ylabel('Sqrt(|standardized residuals|)') axes[1, 0].set_title('Scale-Location') # Standardized residuals vs. leverage df.plot.scatter( x='leverage', y='resid_std', ax=axes[1, 1] ) axes[1, 1].axhline(y=0, color='grey', linestyle='dashed') axes[1, 1].set_xlabel('Leverage') axes[1, 1].set_ylabel('Sqrt(|standardized residuals|)') axes[1, 1].set_title('Residuals vs Leverage') plt.tight_layout() plt.show()

There are still many features missing, but it provides a good start. I learnt how to extract influence statistics here, Access standardized residuals, cook’s values, hatvalues (leverage) etc. easily in Python?

Читайте также: Java file root windows

By the way, there is a package, dynobo/lmdiag, having all the features.

Источник