Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows fatal exception: stack overflow when the sample goes over a certain amount of points #19

Open
fasensior opened this issue Jan 29, 2024 · 0 comments

Comments

@fasensior
Copy link

Hey!

Just ran into a crash error when trying to get over 1 million objects to cluster using Fast HDBSCAN. The total length of my file of objects is around 2.5M. I am just using two columns of the object. I import it using pandas.dataframe and making sure that they are in float64. If I subsample the whole dataset below 1.1M objects, everything works fine (and fast), but after this point it just crashes. The whole error it returns is

Windows fatal exception: stack overflow

Thread 0x0000d814 (most recent call first):
File "C:\Users\fasen\anaconda3\Lib\site-packages\zmq\utils\garbage.py", line 47 in run
File "C:\Users\fasen\anaconda3\Lib\threading.py", line 1038 in _bootstrap_inner
File "C:\Users\fasen\anaconda3\Lib\threading.py", line 995 in _bootstrap

Main thread:
Current thread 0x00010ba0 (most recent call first):
  File "C:\Users\fasen\anaconda3\Lib\site-packages\fast_hdbscan\hdbscan.py", line 168 in fast_hdbscan
  File "C:\Users\fasen\anaconda3\Lib\site-packages\fast_hdbscan\hdbscan.py", line 236 in fit
  File "c:\users\fasen\documents\universidad\master\1er_semestre\tecniques\p6\code.py", line 91 in <module>
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\py3compat.py", line 356 in compat_exec
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 473 in exec_code
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 615 in _exec_file
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 528 in runfile
  File "C:\Users\fasen\AppData\Local\Temp\ipykernel_68656\701512081.py", line 1 in <module>


Restarting kernel...


I also attach the code

from sklearnex import patch_sklearn
patch_sklearn()


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import time
import matplotlib as mpl
from matplotlib.colors import Normalize, LogNorm
from sklearn.cluster import DBSCAN,HDBSCAN
import fast_hdbscan

mpl.rcParams['figure.dpi'] = 400
start = time.time()

dataset = pd.read_csv("C:/Users/fasen/Documents/Universidad/master/1er_semestre/tecniques/p6/data_gaia_edr3_reduced.csv",
                      header=0, dtype = np.float64)
n=1300000
X_train = dataset.sample(n)
X_train = X_train[['VR','Vphi']]
# =============================================================================
# clustering = DBSCAN(eps=0.5, min_samples=4,algorithm='ball_tree', metric='haversine').fit(X_train)
# DBSCAN_dataset = X_train.copy()
# DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_ 
# =============================================================================
start = time.time()
clustering = fast_hdbscan.HDBSCAN(min_cluster_size=20).fit(X_train)
finish = time.time()
print('Computation time for ', n, ' samples of the total: ', start-finish, ' s')
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_
DBSCAN_dataset.Cluster.value_counts().to_frame()

The file I am using is a copy of the EDR3 of Gaia Data, just in case it is important.

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant