Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize input for Choice Models #208

Open
Eh2406 opened this issue Mar 20, 2018 · 2 comments
Open

Normalize input for Choice Models #208

Eh2406 opened this issue Mar 20, 2018 · 2 comments

Comments

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 20, 2018

A number of times we have accidentally compared the magnitude of coefficients in the yaml files that represent MNLDiscreteChoiceModel instances. This is of course a mistake as 0.001 is a large coefficient for nonres_sqft and a small coefficient for frac_developed. In addition the magick 3's problem; The code puts a hard cut off for coefficients at -3 and 3. This is a grate default for normalized variables i.e. ones with std ~=1 mean ~=0 but way to small/big for other columns. If coefficients are made comparable then we can also consider adding L1 or L2 regularization.

My proposal is that when fitting a model subtract the mean and divide by the std for each column. In the yaml file store the training mean, training std, and the coefficients of the transformed columns. Then when predicting with a model we transform with the stored mean and std. Use of the Models will be unchanged, but the stored coefficients will be comparable with each other.

Thoughts?

@smmaurer
Copy link
Member

smmaurer commented Apr 1, 2018

This sounds like a promising feature! I am cautiously enthusiastic.

Some points in favor: I believe that widely divergent coefficients are a problem not just for interpretation, but also for speed and accuracy of the parameter estimation. (The search for optimal values is harder if some are far from the starting point and if the sensitivities vary.)

Advice I've heard is that the best practice is to manually scale the input data so that the fitted coefficients are of similar magnitude. But this is not convenient, especially in a semi-automated context like building an UrbanSim model. Automatically normalizing the input data would help.

Some points of caution: We'd need to be very clear in the documentation and in the output that the fitted parameters apply to transformed data. I don't think this is a common approach. And it should be an optional setting.

Our roadmap is to move the statistics logic out of the UrbanSim repository and into ChoiceModels, but it seems fine to implement this feature here and include it in a point release. ChoiceModels is a ways off from being ready, and the shift will be disruptive enough that we should save it for a major version bump of UrbanSim.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Apr 2, 2018

Advice I've heard is that the best practice is to manually scale the input data so that the fitted coefficients are of similar magnitude.

I know I have heard andrew gelman express that viewpoint in the past, but a quick google found a more nuanced blog post of his.

Some points of caution: We'd need to be very clear in the documentation and in the output that the fitted parameters apply to transformed data. I don't think this is a common approach. And it should be an optional setting.

Ok, so if we make it an optional setting, a way to make it clearer may be to have the section in the yaml be called 'Normalized Coefficient' instead of 'Coefficient'. If we do some math, we could eavan record both in the yaml, leading persons to look at the docs to determine the difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants