Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

02.2 Comparison of isolated and CV methods.ipynb , TypeError: Could not convert ['0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty' 'DTDTDTDTDTDTDTDTDTDT'] to numeric #8

Open
AhmedmostafaElabdli1 opened this issue Apr 8, 2024 · 2 comments

Comments

@AhmedmostafaElabdli1
Copy link

image

and i get solution -> replace any data can not convert to numeric to NAN

this is solution

def average_values(name_list):
    flag = 1
    for i in name_list:
        df = pd.read_csv(i) 
        col = i[14:-4]  # Extract column name from file path

        df = df.apply(lambda x: pd.to_numeric(x,errors='coerce')) # this is my change , if any value can not convert to numeric , replace it by NAN

        temp = pd.DataFrame(df.mean(), columns=[col])  # Assign column name to DataFrame
        
        if flag:
            std = temp
            flag = 0
        else:
            std[col] = temp[col]
    tt = std.T
    return tt

##isolated

name_list=find_the_way('./isolated/','.csv')
iso=average_values(name_list)
iso = iso.drop(['Dataset' , 'ML algorithm'],axis=1)# this is my change

name_list=find_the_way('./crossval/','.csv')
cv=average_values(name_list)
cv=cv.drop(['Dataset' , 'ML_algorithm'], axis=1)# this is my change

it is true ? because the graph result not same paper

it is my result

image

paper result

image

@kahramankostas
Copy link
Owner

It appears that the issue stems from a library update. Previously, the df.mean() command disregarded non-numeric columns, but it seems this functionality has been removed in the newer version of pandas.
Your solution appears to be appropriate in this context. I will proceed to update the code accordingly when the opportunity arises.

Regarding the differences in the graphs, if you've augmented the feature array containing the root features (as inferred from the close resemblance between all cascaded results and the final results), such variations are expected.

The significance of this step lies in assessing how the inclusion of additional features impacts the results obtained with the root features. It's essential to ascertain whether the success observed is sustainable in an isolated dataset.
If the success can be maintained in the isolated dataset, it suggests that the feature may indeed be beneficial. However, if not, it indicates that the success observed in the cross-validation step might be a result of information leakage.

In this context, if you increase the root features uncontrollably (such as moving features from the iden list to the feature list), you will throw away the possibility of making this useful comparison.

@kahramankostas
Copy link
Owner

as far as I understand that in earlier versions of pandas the default value for df.mean(numeric_only=True/False) was True. I think this has now been changed to false. so if you fix the code as below (I have already added this fix) it should solve your problem:

df.mean() ---> df.mean( numeric_only=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants