Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential bug in prediction of MIXL models #14

Open
alvarogutyerrez opened this issue Apr 24, 2023 · 4 comments
Open

potential bug in prediction of MIXL models #14

alvarogutyerrez opened this issue Apr 24, 2023 · 4 comments

Comments

@alvarogutyerrez
Copy link

Dear Cristian Arteaga,
First of all, thank you for developing such a great package. I have been using it extensively and I think it works just great!

However, I think I found a weird behavior (most likely a bug) when predicting probabilities from a MIXL model.

I was trying to re-create the Log-likelihood (LL) of the model from its predicted probabilities using equation (6) in your article:

image

Unfortunately, noticed that I was not able to replicate them, and the differences were rather large. (see the first chunk of code below). Additionally, weirdly enough, when using only 1 draw, I am able to replicate the LL value. (see the second chunk of code below)

I haven't had time to go through your source code, but I think there is something wrong with the .predict() method for the MIXL models, since, for MNL models, I checked that my method of retrieving the LL works just fine (not included here).

First chunk

#%%
# 
import pandas as pd
import numpy as np
from xlogit.utils import wide_to_long
from xlogit import MixedLogit
# from your example
df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')
# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]
df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')
df['TT'], df['CO'] = df['TT']/100, df['CO']/100  # Scale variables
annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))
df.loc[annual_pass, 'CO'] = 0  # Cost zero for pass holders
varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']


# Model building 
model = MixedLogit()
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
           avail=df['AV'],
          panels=df["ID"], randvars={'TT': 'n'}, 
          n_draws=100,
          optim_method='L-BFGS-B')


# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                panels=df["ID"], 
                n_draws=100,
                return_proba = True) 
# Recovering the predicted probabilities
pred_proba = predictions[1]

# transform the df['CHOICE'] variable into a dummy variable
chosen = np.array(df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2})).reshape(-1, 1)
# Select the probability of the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Compute the negative log-likelihood
recreated_LL = np.sum(np.log(proba_chosen))
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)

# recreated LL: -5293.024645448918
# model's LL  : -4360.226616589964

Second chunk

# The same again but with 1 draw only.
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
           avail=df['AV'],
          panels=df["ID"], randvars={'TT': 'n'}, 
          n_draws=1,
          optim_method='L-BFGS-B')


# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                panels=df["ID"], 
                n_draws=1,
                return_proba = True) 
# Recovering the predicted probabilities
pred_proba = predictions[1]

# transform the df['CHOICE'] variable into a dummy variable
chosen = np.array(df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2})).reshape(-1, 1)
# Select the probability of the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Compute the negative log-likelihood
recreated_LL = np.sum(np.log(proba_chosen))
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)


#recreated LL: -5331.206129776281
#model's LL  : -5331.206129776281

Thank you in advance!

Álvaro

@arteagac
Copy link
Owner

Hi @alvarogutyerrez ,

I am glad you find xlogit useful and thank you so much for reporting this. I think this is not a bug, as the potential issue is that your computation of the log-likelihood ignores the panel data structure of the dataset. For panel data, you need to compute the probability for each individual as the product of the probabilities of the choice situations faced by the individual. For example, note that if I remove the panel data from the model estimation, I get the exact same LL values you get (see code below). By the way, the predict function in xlogit returns probabilities at the choice situation level, instead of the individual level. This means that, for prediction, xlogit does not use the panel information, as predicting at the panel level is inconvenient for individuals that exhibit different choices when faced with different choice situations, which is a common scenario. Let me know if you have any questions or want to discuss further about this.

import pandas as pd
import numpy as np
from xlogit.utils import wide_to_long
from xlogit import MixedLogit
# from your example
df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')
# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]
df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')
df['TT'], df['CO'] = df['TT']/100, df['CO']/100  # Scale variables
annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))
df.loc[annual_pass, 'CO'] = 0  # Cost zero for pass holders
varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']


# Model building 
model = MixedLogit()
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
           avail=df['AV'],
          #panels=df["ID"], 
          randvars={'TT': 'n'}, 
          n_draws=100,
          optim_method='L-BFGS-B')


# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                #panels=df["ID"], 
                n_draws=100,
                return_proba = True) 
# Recovering the predicted probabilities
pred_proba = predictions[1]

# transform the df['CHOICE'] variable into a dummy variable
chosen = np.array(df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2})).reshape(-1, 1)
# Select the probability of the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Compute the negative log-likelihood
recreated_LL = np.sum(np.log(proba_chosen))
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)

# recreated LL: -5215.277641249061
# model's LL  : -5215.277641249061

@alvarogutyerrez
Copy link
Author

alvarogutyerrez commented Apr 25, 2023

Dear @arteagac,

thank you for your quick answer. You are actually right, I forgot to compute the product of choice probabilities that belong to each individual when reconstructing the simulated LL. Unfortunately, once I did it, I still could not retrieve the sample (simulated) LL value produced by xlogit (See the first chunk of code below).

Additionally, all of this is because I want to analyze the performance of my model on out-of-sample data. This is why I am doing all this "replication" of the simulated LL value, but it would be much easier if I could simply evaluate a different sample using an already-trained model, but I haven't been able to do so. Actually, even when setting the maximum iterations number to 0, xlogit still computes one iteration using the new data, and consequently, it changes the parameters from the original model (see second chunk of code). For instance, I know that the R package mlogit has the option estimate = False, to perform that I am trying to do.

Is it possible that xlogit can only evaluate the LL model without fitting the model?

Thank you in advance!

Reconstructing LL using product of LL of each individual

import pandas as pd
import numpy as np
from xlogit.utils import wide_to_long
from xlogit import MixedLogit
# from your example
df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')
# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]
df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')
df['TT'], df['CO'] = df['TT']/100, df['CO']/100  # Scale variables
annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))
df.loc[annual_pass, 'CO'] = 0  # Cost zero for pass holders
varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']


# Model building 
model = MixedLogit()
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
          avail=df['AV'],
          panels=df["ID"], 
          randvars={'TT': 'n'}, 
          n_draws=100,
          optim_method='L-BFGS-B')

model.summary()
# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                panels=df["ID"], 
                n_draws=100,
                return_proba = True) 

# Recovering the predicted probabilities
pred_proba = predictions[1] # this is wide format 
# recover the chosen alternative in wide format
chosen = df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2}).to_numpy().reshape(-1, 1)
# Select the predicted probability for the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Retrieve the ID for each individual
id_ind = df_wide['ID'].to_numpy().reshape(-1, 1)
# Create unique ID for each individual
id_ind_unique = np.unique(id_ind)
# Compute the product of the probabilities for each individual
proba_chosen_product = np.array([np.prod(proba_chosen[id_ind == i]) for i in id_ind_unique])
# Compute the log-likelihood
recreated_LL = np.log(proba_chosen_product).sum()
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)

#recreated LL: -5293.024645448918
#model's LL  : -4360.226616589964

Even maxiter=0 re-fit 1 iteration of the model.

#only first 20 individuals
new_data = df[df['ID'] < 20]
model.summary()


#Optimization terminated successfully.
#    Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
#    Iterations: 16
#    Function evaluations: 20
#Estimation time= 3.2 seconds
#---------------------------------------------------------------------------
#Coefficient              Estimate      Std.Err.         z-val         P>|z|
#---------------------------------------------------------------------------
#ASC_CAR                 0.2696863     0.0546818     4.9319181      8.34e-07 ***
#ASC_TRAIN              -0.6065124     0.0740324    -8.1925319      3.03e-16 ***
#CO                     -1.6537694     0.0778409   -21.2455072      5.05e-97 ***
#TT                     -3.0977383     0.1373519   -22.5532926     1.16e-108 ***
#sd.TT                   3.7587865     0.1632462    23.0252672     5.23e-113 ***
#---------------------------------------------------------------------------
#Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Log-Likelihood= -4360.227
#AIC= 8730.453
#BIC= 8764.553


model.fit(X=new_data[varnames], 
          y=new_data['CHOICE'], 
          varnames=varnames,
          alts=new_data['alt'], 
          ids=new_data['custom_id'],
          avail=new_data['AV'],
          panels=new_data["ID"], 
          randvars={'TT': 'n'}, 
          n_draws=100,
          optim_method='L-BFGS-B',
          maxiter=0 #<--- even with this, the model is still re-estimated
          )
model.summary()


#**** The optimization did not converge after 1 iterations. ****
#Message: STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
#    Message: STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
#    Iterations: 1
#    Function evaluations: 4
#Estimation time= 0.0 seconds
#---------------------------------------------------------------------------
#Coefficient              Estimate      Std.Err.         z-val         P>|z|
#---------------------------------------------------------------------------
#ASC_CAR                -2.6699303     0.5917456    -4.5119566      1.19e-05 ***
#ASC_TRAIN              -3.2855903     0.4631139    -7.0945627       3.3e-11 ***
#CO                     -9.6570581     4.2444106    -2.2752413        0.0241 *  
#TT                     -0.0129189     0.0461239    -0.2800912          0.78    
#sd.TT                   0.0168794           nan           nan           nan    
#---------------------------------------------------------------------------
#Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Log-Likelihood= -61.992
#AIC= 133.983
#BIC= 149.692

@arteagac
Copy link
Owner

arteagac commented Apr 26, 2023

Hi @alvarogutyerrez,

I will take a closer look at the computations of the predict function. Unfortunately, the current version of xlogit does not have a mechanism to run a single step of the log-likelihood function without estimating the model. However, if you need a temporary work-around, you can use the code below, which is simply a function that takes the input data, formats it, and executes a single run of the LL function. You can use this function after estimating the model, and pass the model object as shown below. In the meantime, I will take a look at the prediction code. Let me know if you have any questions.

# First estimate the model as usual using MixedLogit.fit
# Then use the function below to run a single execution of the log_likelihood function
def log_lik(model, X, y, varnames, alts, ids, randvars=None, isvars=None, weights=None, avail=None,  panels=None, n_draws=1000):
       # Pre-process and format inputs
       X, y, varnames, alts, isvars, ids, weights, panels, avail, scale_factor \
       = model._as_array(X, y, varnames, alts, isvars, ids, weights, panels, avail, None)

       model._validate_inputs(X, y, alts, varnames, isvars, ids, weights)

       betas, X, y, panels, draws, weights, avail, Xnames, scale = \
       model._setup_input_data(X, y, varnames, alts, ids, model.randvars, isvars=isvars, weights=weights, avail=avail,
                            panels=panels, init_coeff=None, random_state=None, n_draws=n_draws,
                            halton=True, verbose=1, predict_mode=False, halton_opts=None, scale_factor=None)
       from xlogit._choice_model import diff_nonchosen_chosen
       Xd, scale_d, avail = diff_nonchosen_chosen(X, y, scale, avail)  # Setup Xd as Xij - Xi*
       fargs = (Xd, panels, draws, weights, avail, scale_d, None)
       # Compute log-likelihood
       return -model._loglik_gradient(model.coeff_, *fargs, return_gradient=False)

# LL on original model
ll = log_lik(model,
             X=df[varnames],
             y=df['CHOICE'],
             varnames=varnames,
             alts=df['alt'], 
             ids=df['custom_id'],
             avail=df['AV'],
             panels=df["ID"],
             n_draws=100)
print(f"LL orig. model= {ll}")

# LL on new data
new_data = df[df['ID'] < 20]
ll_new = log_lik(model,
             X=new_data[varnames],
             y=new_data['CHOICE'],
             varnames=varnames,
             alts=new_data['alt'], 
             ids=new_data['custom_id'],
             avail=new_data['AV'],
             panels=new_data["ID"],
             n_draws=100)
print(f"LL new data= {ll_new}")
OUTPUT:
Optimization terminated successfully.
    Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
    Iterations: 16
    Function evaluations: 20
Estimation time= 1.6 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
ASC_CAR                 0.2696863     0.0546818     4.9319183      8.34e-07 ***
ASC_TRAIN              -0.6065124     0.0740324    -8.1925322      3.03e-16 ***
CO                     -1.6537694     0.0778409   -21.2455079      5.05e-97 ***
TT                     -3.0977383     0.1373519   -22.5532929     1.16e-108 ***
sd.TT                   3.7587865     0.1632462    23.0252673     5.23e-113 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -4360.227
AIC= 8730.453
BIC= 8764.553


LL orig. model= -4360.226616589964
LL new data= -76.1282895250552

@alvarogutyerrez
Copy link
Author

thank you so much for the log_lik() function!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants