Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels #477

Open
yuz0101 opened this issue Oct 12, 2022 · 15 comments

Comments

@yuz0101
Copy link

yuz0101 commented Oct 12, 2022

Hi there, it seems that we can only use one cov_type. While comparing results based on a model clustered by a variable between linearmodels and Stata, I found that Stata's std. errors are robust adjusted while linearmodels are not (All coefficients are the same).

Please may I know how I can get the same results as Stata by using linearmodels?

@bashtage
Copy link
Owner

@yuz0101
Copy link
Author

yuz0101 commented Oct 12, 2022

Hi bashtage, thanks for your quick reply. I follow that. For example,

mod = PanelOLS.from_formula(formula, data) reg = mod.fit(cov_type='clustered', clusters=data['var'])

The standard errors from the reg are not robust adjusted.

If I change it to reg = mod.fit(cov_type='robust'), then results are not based on the clustered by var one.

Could you please help me with it? How can I get both robust- and clustered-results? I appreciate your time.

@bashtage
Copy link
Owner

What is formula? The definition of robust depends on whether entity effects are included. Clustered std errors are robust to heteroskedasticity.

@yuz0101
Copy link
Author

yuz0101 commented Oct 12, 2022

Thanks for that. y ~ 1 + a*b*c + a + b + c is the formula I'm using. There is no entity effects in my regression model. : )

@bashtage
Copy link
Owner

What Stata command are you using?

@yuz0101
Copy link
Author

yuz0101 commented Oct 12, 2022

gen abc = a*b*c
reg y abc a b c, vce (cluster var)

@bashtage
Copy link
Owner

What you have looks correct to me

mod = PanelOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])

You could also use

mod = PooledOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])

and the results should be the same.

@yuz0101
Copy link
Author

yuz0101 commented Oct 12, 2022

Thank you Kevin, for all your efforts.

I still got different results. I noticed that the std. error from Stata is robust (pls have a look at the scrnshot below). Sorry for the mess data shown below, they are shown in different orders. Would you please look at the last line of pic 1 (black background) and the first line of pic 2 (white background). They are the same variable.

The results from linearmodels:
image

The results from Stata:
image

@bashtage
Copy link
Owner

What happends when you take the ratio of the parameter variance from stata to that from linear models? Stata has a log of magic small sample adjustments it makes. If this ratio is the same for all parameters, this indicates that it is a scalar adjustment.

I did a quick check and this looks to be the issue. What does changing the value of debiased do?

@yuz0101
Copy link
Author

yuz0101 commented Oct 12, 2022

I just did a check, and most of the std. errors from Sata is 70% of the std. errors from linearmodels. For the debiased, I did the check and found nothing affecting the results.

@yuz0101 yuz0101 changed the title PanelOLS: How can i get results based on both the heteroskedasity adjusted standard errors and clustered model PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels Oct 12, 2022
@bashtage
Copy link
Owner

bashtage commented Oct 12, 2022 via email

@yuz0101
Copy link
Author

yuz0101 commented Oct 13, 2022

Hi Kevin, you are right. Massive thanks for your time. The std, errors in my model are clustered by only one dummy variable. However, I tried that with xtreg and found that it produced errors in xtreg, mainly due to the usage of clusters. Then I tried to cluster the id, which completely worked in both linearmodels and Stata (using a command of reg), and they produced the same results. However, the same setting with xtreg in Stata produces different results from using reg and using linearmodels.

Q: Is it reasonable to cluster the std. errors with a dummy variable rather than a variable consisting of a number of clusters?

@vgreg
Copy link

vgreg commented Jan 19, 2024

Hi Kevin, adding my issue here because it is most likely related. By the way, thanks for all your work on linearmodels (and the awesome documentation!), it was the missing piece to my Python workflow.

I'm preparing notes for my students in which I was hoping to replicate Mitch Petersen's results using his sample dataset, which he obtained with Stata using the following command:
regress dependent_variable independent_variables, robust cluster(cluster_variable)

For reference, his Stata code is here, and the sample data and associated estimations are here.

His sample dataset contains 500 firms but only 10 years, so that dimension is small. I'm able to replicate his results using statsmodels, but not with linearmodels. More specifically, I get the same results when clustering by firm, but not when clustering by year or by both firm and year. Are you aware of a small sample adjustment that is implemented in statsmodels but not in linearmodels? debiased does not affect my results.

Here is the code to replicate the difference:

import pandas as pd
import statsmodels.formula.api as smf
from linearmodels import PanelOLS

df = pd.read_table(
    "http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.txt",
    names=["firmid", "year", "x", "y"],
    delim_whitespace=True,
)

smf.ols(formula="y ~ x", data=df).fit(
    cov_type="cluster", cov_kwds={"groups": df["year"]}, use_t=True
).summary()

With statsmodels:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.208
Method:                 Least Squares   F-statistic:                     960.6
Date:                Fri, 19 Jan 2024   Prob (F-statistic):           1.86e-10
Time:                        06:51:42   Log-Likelihood:                -10573.
No. Observations:                5000   AIC:                         2.115e+04
Df Residuals:                    4998   BIC:                         2.116e+04
Df Model:                           1                                         
Covariance Type:              cluster                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0297      0.023      1.269      0.236      -0.023       0.083
x              1.0348      0.033     30.993      0.000       0.959       1.110
==============================================================================
Omnibus:                        4.912   Durbin-Watson:                   1.096
Prob(Omnibus):                  0.086   Jarque-Bera (JB):                4.862
Skew:                           0.070   Prob(JB):                       0.0880
Kurtosis:                       3.063   Cond. No.                         1.01
==============================================================================

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
df = df.set_index(["firmid", "year"])
mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True)
res.summary

With linearmodels:

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.2078
Estimator:                   PanelOLS   R-squared (Between):              0.2208
No. Observations:                5000   R-squared (Within):               0.1907
Date:                Fri, Jan 19 2024   R-squared (Overall):              0.2078
Time:                        06:51:42   Log-likelihood                -1.057e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      1310.7
Entities:                         500   P-value                           0.0000
Avg Obs:                      10.0000   Distribution:                  F(1,4998)
Min Obs:                      10.0000                                           
Max Obs:                      10.0000   F-statistic (robust):             1067.1
                                        P-value                           0.0000
Time periods:                      10   Distribution:                  F(1,4998)
Avg Obs:                       500.00                                           
Min Obs:                       500.00                                           
Max Obs:                       500.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.0297     0.0222     1.3376     0.1811     -0.0138      0.0732
x              1.0348     0.0317     32.667     0.0000      0.9727      1.0969
==============================================================================

@bashtage
Copy link
Owner

You probably need group_debias=True.

mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)
res.summary
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                      y   R-squared:                        0.2078
Estimator:                   PanelOLS   R-squared (Between):              0.2208
No. Observations:                5000   R-squared (Within):               0.1907
Date:                Fri, Jan 19 2024   R-squared (Overall):              0.2078
Time:                        12:00:52   Log-likelihood                -1.057e+04
Cov. Estimator:             Clustered
                                        F-statistic:                      1310.7
Entities:                         500   P-value                           0.0000
Avg Obs:                      10.0000   Distribution:                  F(1,4998)
Min Obs:                      10.0000
Max Obs:                      10.0000   F-statistic (robust):             960.59
                                        P-value                           0.0000
Time periods:                      10   Distribution:                  F(1,4998)
Avg Obs:                       500.00
Min Obs:                       500.00
Max Obs:                       500.00

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.0297     0.0234     1.2691     0.2045     -0.0162      0.0755
x              1.0348     0.0334     30.993     0.0000      0.9694      1.1003
==============================================================================

@vgreg
Copy link

vgreg commented Jan 19, 2024

Thanks, that's what I was missing! I get the same results now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants