Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement 2+ fixed-effect with multi-way clustering OLS #551

Open
EBoisseauSierra opened this issue Sep 11, 2023 · 0 comments
Open

Comments

@EBoisseauSierra
Copy link

EBoisseauSierra commented Sep 11, 2023

Context

I would effectively like to replicate Stata's reghdfe command, which is based on Correia (2016) estimator.

This estimator allows (fast) converging with greater-than-2 levels of fixed-effect, as well as multi-way clustering.

Expected solution

I am unsure how this would look like — either a new IV estimation module, or somehow piggybacking on AbsorbingLS…

Alternatives

I can achieve most of it via a naive statsmodels implementation, as shown in the dummy example below:

  • Stata 17
webuse nlswork
reghdfe ln_wage hours tenure race, absorb(age union) vce(cluster ind_code)
stdout
(dropped 1 singleton observations)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression                            Number of obs   =     18,890
Absorbing 2 HDFE groups                           F(   3,     11) =      46.81
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.1766
                                                  Adj R-squared   =     0.1752
                                                  Within R-sq.    =     0.1083
Number of clusters (ind_code) =         12        Root MSE        =     0.4240

                              (Std. err. adjusted for 12 clusters in ind_code)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0028301   .0015474     1.83   0.095    -.0005757    .0062359
      tenure |   .0327574   .0034951     9.37   0.000     .0250647      .04045
        race |  -.1454669   .0250923    -5.80   0.000    -.2006945   -.0902392
       _cons |   1.712627   .1019591    16.80   0.000     1.488217    1.937038
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
         age |        30           0          30     |
       union |         2           1           1     |
-----------------------------------------------------+
  • Python (3.11.4, numpy 1.25.2, statsmodels 0.14.0):
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm

df = sd.webuse("nlswork")
dataset = df.dropna(
    subset=[
        "ln_wage",
        "hours",
        "tenure",
        "race",
        "age",
        "union",
        "ind_code",
    ],
    how="any",
)

model = sm.ols(
    formula="ln_wage ~ hours + tenure + race + C(age) + C(union)",
    data=dataset,
)
results = model.fit(
    cov_type="cluster",
    cov_kwds={"groups": np.array(dataset["ind_code"])},
    use_t=True,
)
results.summary().as_text()
stdout
OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.177
Model:                            OLS   Adj. R-squared:                  0.175
Method:                 Least Squares   F-statistic:                 2.026e+11
Date:                Fri, 08 Sep 2023   Prob (F-statistic):           5.39e-61
Time:                        11:06:12   Log-Likelihood:                -10579.
No. Observations:               18891   AIC:                         2.123e+04
Df Residuals:                   18856   BIC:                         2.150e+04
Df Model:                          34                                         
Covariance Type:              cluster                                         
===================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           1.1509      0.035     32.655      0.000       1.073       1.228
C(age)[T.17.0]      0.2170      0.168      1.295      0.222      -0.152       0.586
C(age)[T.18.0]      0.2016      0.087      2.312      0.041       0.010       0.393
C(age)[T.19.0]      0.2886      0.081      3.581      0.004       0.111       0.466
C(age)[T.20.0]      0.3644      0.073      4.965      0.000       0.203       0.526
C(age)[T.21.0]      0.4269      0.080      5.361      0.000       0.252       0.602
C(age)[T.22.0]      0.4735      0.089      5.312      0.000       0.277       0.670
C(age)[T.23.0]      0.4947      0.087      5.685      0.000       0.303       0.686
C(age)[T.24.0]      0.4986      0.087      5.709      0.000       0.306       0.691
C(age)[T.25.0]      0.4958      0.088      5.662      0.000       0.303       0.688
C(age)[T.26.0]      0.5486      0.090      6.092      0.000       0.350       0.747
C(age)[T.27.0]      0.5226      0.076      6.872      0.000       0.355       0.690
C(age)[T.28.0]      0.5165      0.080      6.463      0.000       0.341       0.692
C(age)[T.29.0]      0.5147      0.087      5.946      0.000       0.324       0.705
C(age)[T.30.0]      0.5126      0.084      6.131      0.000       0.329       0.697
C(age)[T.31.0]      0.5692      0.075      7.635      0.000       0.405       0.733
C(age)[T.32.0]      0.5452      0.083      6.535      0.000       0.362       0.729
C(age)[T.33.0]      0.5457      0.075      7.245      0.000       0.380       0.712
C(age)[T.34.0]      0.5255      0.073      7.227      0.000       0.365       0.685
C(age)[T.35.0]      0.5492      0.074      7.463      0.000       0.387       0.711
C(age)[T.36.0]      0.5488      0.081      6.775      0.000       0.371       0.727
C(age)[T.37.0]      0.5506      0.077      7.184      0.000       0.382       0.719
C(age)[T.38.0]      0.5283      0.079      6.705      0.000       0.355       0.702
C(age)[T.39.0]      0.5477      0.083      6.625      0.000       0.366       0.730
C(age)[T.40.0]      0.5431      0.088      6.147      0.000       0.349       0.738
C(age)[T.41.0]      0.5340      0.063      8.520      0.000       0.396       0.672
C(age)[T.42.0]      0.5426      0.094      5.800      0.000       0.337       0.748
C(age)[T.43.0]      0.4841      0.092      5.277      0.000       0.282       0.686
C(age)[T.44.0]      0.4711      0.090      5.243      0.000       0.273       0.669
C(age)[T.45.0]      0.6426      0.098      6.553      0.000       0.427       0.858
C(age)[T.46.0]      0.9095      0.174      5.237      0.000       0.527       1.292
C(union)[T.1.0]     0.1809      0.038      4.810      0.001       0.098       0.264
hours               0.0028      0.002      1.829      0.095      -0.001       0.006
tenure              0.0328      0.003      9.372      0.000       0.025       0.040
race               -0.1455      0.025     -5.797      0.000      -0.201      -0.090
==============================================================================
Omnibus:                     1064.590   Durbin-Watson:                   0.868
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2692.841
Skew:                           0.329   Prob(JB):                         0.00
Kurtosis:                       4.729   Cond. No.                     2.93e+04
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
[2] The condition number is large, 2.93e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

However, this approach falls short for 2 reasons:

  1. It is much slower with large datasets than reghdfe (cf. below),
  2. It fails to converge in Python when (one of) the fixed-effect dimension have a too-high cardinality.

(Note: One may argue that the latter point hints at a faulty analysis design… but I'd prefer to not discuss the “whether it is correct to do so”.)

A model showing contrast in computation time is ln_wage ~ hours + tenure + race + C(age) + C(idcode), clustering SE by ind_code:

  • Stata 17: 0.35 seconds
  • Python 3.11: 115.79 s
Reproducible code
  • Python
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm
import time

df = sd.webuse("nlswork")
dataset = df.dropna(
    subset=[
        "ln_wage",
        "hours",
        "tenure",
        "race",
        "age",
        "idcode",
        "ind_code",
    ],
    how="any",
)

start_time = time.perf_counter()
model = sm.ols(
    formula="ln_wage ~ hours + tenure + race + C(age) + C(idcode)",
    data=dataset,
)
results = model.fit(
    cov_type="cluster",
    cov_kwds={"groups": np.array(dataset["ind_code"])},
    use_t=True,
)
end_time = time.perf_counter()

print(f"Model fitting duration: {end_time - start_time:.2f} seconds.")
results.summary().as_text()
  • Stata
webuse nlswork
set rmsg on
reghdfe ln_wage hours tenure race, absorb(age idcode) vce(cluster ind_code)

A model showing Python unable to converge is ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade):

  • Stata: 0.51s
  • Python: stopped after 20 mins.

Other

I'd be happy to raise or contribute to a PR for it, although I might fall short on the theory of the stats.

Linked to #259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant