Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Leakage #249

Open
nova-land opened this issue May 28, 2023 · 14 comments
Open

Data Leakage #249

nova-land opened this issue May 28, 2023 · 14 comments

Comments

@nova-land
Copy link

The use of tf.keras.utils.normalize will provide invalid test result by normalising the whole dataset.

An evaluation script is required to verify the accuracy of the model

@kyleskom
Copy link
Owner

kyleskom commented Jun 4, 2023

I don't understand what the issue here is?

@chriseling
Copy link
Contributor

chriseling commented Jun 4, 2023 via email

@kyleskom
Copy link
Owner

kyleskom commented Jun 5, 2023

Ill take a look when I revisit this next season

@kyleskom
Copy link
Owner

kyleskom commented Oct 8, 2023

Hi looking for more info on what the potential fix for this would be. Thank you.

@nova-land
Copy link
Author

You will need to separate train and test data when you are using tf.keras.utils.normalize. But normally you should use Scaler in scikit-learn to separate train and test data, fit the train data then transform both train and test data.

@STRATZ-Ken
Copy link

STRATZ-Ken commented Oct 11, 2023

I am not sure I agree with @nova-land. The idea of normalize is to set the data for the entire dataset equally. Imagine you have a data set that has values of [3,1,0.50] and you normalize this. It would change to [1, .33, .165]. If your next dataset has a higher value, it would adjust based on the highest data on the column.

There are keras layers you can do which will normalize the data inside the model itself, which would not require this function to be called. Or you can normalize the data when it comes in, setting max values. For example, if a player scores 56 points, and your goal is predict how many points a player is going to score from 0 to 50 (Your force normalizing here), then the max he can score is 50. Just an example.

I am not an expert here, but you have to make sure you have this code in your training set. Then when your ready to predict, you load these values and send the predictions through the normalize function as well.

if not os.path.exists(model_dir + '/scaler.pkl'):
        joblib.dump(min_max_scaler, model_dir + '/scaler.pkl')

@STRATZ-Ken
Copy link

STRATZ-Ken commented Oct 11, 2023

Here is information on the normalize layer. You would add this before your first dense layer, this will normalize the incoming data and store its weights inside the model file itself. Then you would not have to make any changes to the data or even call MinMax normalize within the file itself.

https://keras.io/api/layers/normalization_layers/batch_normalization/

Also worth noting, this is for the NN model, not XGBoost.

@Gxent
Copy link

Gxent commented Oct 11, 2023

but then which would be better xg or nn model?

@STRATZ-Ken
Copy link

Better is not a good word at all to use in models. There are a million factors. That question cannot be answered.

@Gxent
Copy link

Gxent commented Oct 11, 2023

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

@Gxent
Copy link

Gxent commented Oct 11, 2023

so I always relied on over and under

@cafeTechne
Copy link

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

How's this working out for you now?

@Gxent
Copy link

Gxent commented Jan 2, 2024

this year wasn't so good

@cafeTechne
Copy link

this year wasn't so good

So you're not seeing 55% win rates with this strategy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants