Skip to content
Joey Sham edited this page Aug 5, 2017 · 6 revisions

How It Works

Getting the data

The data is gathered from three different places: NBA.com Stats, Basketball Reference, and hoopshype.

NBA.com was scraped using nba_py to gather player statistics (including advanced stats, misc stats, etc.). Basketball Reference was scraped using basketballcrawler to gather player age, current salary, etc. Then hoopshype was scraped to gather players' future salary. As Basketball Reference doesn't care to be scraped often, the information is saved in players.json. It's read when prepare_data is run, and combined with the nba_py and hoopshype data, then the data is stored in raw_data.json.

Cleaning

As it turns out, a lot of the columns of data needed to be removed. Players who has played less than 15 games are removed as their high stats skewed the models.

Modeling

Each model uses their stats as an input, and salary as output. The idea is to fit the model to each player stats, and predict their value. The output is scaled from the min to the max contract price for 2017-18 season. An average is also done (and considered to be a separate model), which averages Bayes Ridge, Lasso, and ElasticNet output equally.

Player Value/Worth

Some of the methods used comes with coefficients that explains how the model works, and why specific players are ahead of others.
For example, Ridge seems to favor Personal Fouls Drawn and Free Throws Made, so players who draws fouls and makes a lot of free throws would be deemed to be worth more. Thus, DeRozan (for all his ability to draw fouls and get to the line) is considered to be the most valuable. Linear Regression seem to value FG3M and FGM, so players with volume would be considered more valuable.

By comparing their calculated worth to their future salary, it is possible to find undervalued players as well. As the most valuable player is dependent on the model, the undervalued players are dependent on the model as well.

Shortcomings

It is important to note that this only analyzes stats based on past year performance, which is very isolated. It doesn't take into account team strength (though many models take wins into account). For example, Curry and Durant would have better stats on separate teams, and although their stats are still very impressive, the models don't take into account their stats are lowered than what it could've been. Therefore, their salary value is lowered. The models also don't take into account potential. For example, KAT put up amazing stats, and he is only 21. Giannis Antetokounmpo is only 22, and sky is the limit. However, their young age and potential for growth should make them be worth more, but the model does not take that into account.

Contribution

If you want to contribute, please see CONTRIBUTING guidelines.

Clone this wiki locally