Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results for binomial model from glm() and nlreg() #57

Open
droodman opened this issue Mar 13, 2024 · 2 comments
Open

Different results for binomial model from glm() and nlreg() #57

droodman opened this issue Mar 13, 2024 · 2 comments

Comments

@droodman
Copy link

I'm getting different results where I think there should be a match, unless I'm doing something wrong. This uses the file https://www.stata-press.com/data/r18/lbw.dta.

julia> df = DataFrame(load("lbw.dta"));

julia> glm(@formula(low ~ age+lwt+smoke+ptl+ht+ui  + race ), df, Binomial(), LogitLink(), contrasts=Dict(:race=>DummyCoding()))
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

low ~ 1 + age + lwt + smoke + ptl + ht + ui + race

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      z  Pr(>|z|)    Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   0.461224   1.20457      0.38    0.7018  -1.8997       2.82215
age          -0.0271003  0.0364499   -0.74    0.4572  -0.0985408    0.0443402
lwt          -0.0151508  0.00692576  -2.19    0.0287  -0.0287251   -0.00157657
smoke         0.923345   0.400821     2.30    0.0212   0.13775      1.70894
ptl           0.541837   0.346247     1.56    0.1176  -0.136795     1.22047
ht            1.83252    0.691624     2.65    0.0081   0.47696      3.18808
ui            0.758513   0.459374     1.65    0.0987  -0.141843     1.65887
race: 2       1.26265    0.526405     2.40    0.0165   0.230912     2.29438
race: 3       0.862079   0.439147     1.96    0.0496   0.00136736   1.72279
──────────────────────────────────────────────────────────────────────────────

julia> nlreg(df, @formula(low ~ age+lwt+smoke+ptl+ht+ui  + fe(race)), Binomial(), LogitLink())
                Generalized Linear Fixed Effect Model
=====================================================================
Distribution:          "Binomial"   Link:                 "LogitLink"
Number of obs:                189   Degrees of freedom:             9
Deviance:                 734.954   Pseudo-R2:                 -2.132
Pseudo-Adj. R2:            -2.200   Iterations:                     8
Converged:                   true
=====================================================================
         Estimate  Std.Error  t value Pr(>|t|)  Lower 95%   Upper 95%
---------------------------------------------------------------------
age    -0.0138435  0.0371528 -0.37261    0.710 -0.0866616   0.0589746
lwt    -0.0128319 0.00695652 -1.84459    0.067 -0.0264665 0.000802576
smoke     1.06455   0.410814  2.59131    0.010   0.259366     1.86973
ptl      0.482117    0.34589  1.39385    0.165  -0.195814     1.16005
ht        2.12143   0.736751  2.87944    0.004   0.677426     3.56544
ui       0.837479   0.467677  1.79072    0.075 -0.0791521     1.75411
=====================================================================
@jmboehm
Copy link
Owner

jmboehm commented Apr 15, 2024

Thanks for the report, and sorry that it took a while to respond. I'm trying to at least ensure that GLFixedEffectModels is not giving wrong answers, and your issue got me worried :)

The difference is due to GLFixedEffectModels.jl using different starting values than GLM.jl. Try the following, which yields for me the same estimates as GLM:

nlreg(df, @formula(low ~ age+lwt+smoke+ptl+ht+ui  + fe(race)), Binomial(), LogitLink(), start = zeros(Float64, 6))

Picking good starting values is not straightforward, see e.g. here . R's glm does something complicated, I believe. I haven't quite understood what GLM.jl is doing, but it seems closer to choosing zeros. I could change the default start values to zero, which would help in this situation, but it's quite possible that there will be other situations where the outcome is worse (right now it's using 0.1 .* ones(...), which is probably not a good idea).

@droodman
Copy link
Author

droodman commented Apr 15, 2024

Would it make sense to call GLM to pick the starting values?
My cmp package for Stata does something like that. Before fitting a multi-equation model, where the equations could be linear, probit, tobit, etc., it runs the relevant Stata command on each equation separately to pick starting values for coefficients. Indeed, I might even make reghdfejl do that before calling GLFixedEffectModels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants