Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physical meaning of eps parameter in DBScan #467

Open
maclariz opened this issue Dec 20, 2023 · 6 comments
Open

Physical meaning of eps parameter in DBScan #467

maclariz opened this issue Dec 20, 2023 · 6 comments
Labels
theory This issue has a non-technical element

Comments

@maclariz
Copy link

Does anyone have any insight on the physical meaning of the eps parameter in the DBSCAN algorithm and how this relates to angular spread in a cluster?

@hakonanes
Copy link
Member

It's the minimal misorientation angle in radians between two points in a cluster. I.e. if two points are further away than eps, they cannot be part of the same cluster.

@maclariz
Copy link
Author

So, I just had a look at the usage in the paper I just had reviewer comments on.

eps was set to 15 degrees (converted to radians).

But the 3 sigma value for the deviation of orientations within each cluster varied from 1.4 degrees to 4.2 degrees (i.e. I calculated standard deviation of the angle in the axis angle pair between each quaternion in the cluster and the average orientation of the cluster, which means that 99.7% of orientations in any cluster lie within something between 1.4 and 4.2 degrees from the cluster average). In other words, the actual results are that the clusters are far tighter than the eps threshold. Similar analysis could (and probably should) be done in other cases. I think eps has to be treated as a useful adjustable parameter, but actually it's better to then analyse final results to see the real scatter in the data.

And looking again, the definition suggests that it is the maximal misorientation angle, not minimal.

Best wishes

Ian

@hakonanes
Copy link
Member

You are right, it's the maximum distance (sklearn docs). Sorry for being too quick there.

@CSSFrancis
Copy link
Member

CSSFrancis commented Dec 21, 2023

I've found this wikipedia section to be quite useful: https://en.wikipedia.org/wiki/DBSCAN#Abstract_algorithm

Just a note that there are 2 parameters which are important for DBScan. The eps value which is the maximum distance and the min_samples which actually describes the minimum number of points in some (circle, sphere, hypersphere depending on your dimensionality) for some core cluster to be identified. Then once a core cluster is identified the core clusters are merged and non-core points are added if they are within eps of the edge of a core cluster. In general this number min_samples should be equal greater than the number of dimensions+1. Actually n+1 is a good place to start.

For eps there are a couple of different ways described here which might be of interest. https://en.wikipedia.org/wiki/DBSCAN#Parameter_estimation. Additionally, methods like OPTICS might be of use which tend to be better if you have clusters which are of varying densities.

@pc494
Copy link
Member

pc494 commented Dec 23, 2023

Hi Ian, sorry I'm a bit late to this one.

When we wrote Density Based Clustering ... Johnstone et al. we thought about this a bit and in the end decided to start with a number that accurately answers the question:

"Given our (perceived) measurement errors, what's the furthest away two data points (i.e the smallest rotation that maps A to B) could be where you could still convince us they belonged in the same cluster"

which for our dataset was 0.05 (i.e. ~ 3 degrees) given the high-quality EBSD map we had at hand. Then run the algorithms, inspect the real/orientation space maps, and see where that leaves you.

15 degrees does seem a bit too generous though unless you're working with really really noisy data. You may well end up merging two physically distinct clusters if you're unlucky.

@maclariz
Copy link
Author

maclariz commented Jan 3, 2024

@pc494 took a break for the holidays there. I initially took a naive view that a small misorientation should be sensible, but on some datasets merely ended up with lots of clusters that were all versions of the same thing, with just minor misorientations between. Even with the 15 degree criterion, this dataset splits some laths of same orientation into two clusters, probably due to some sample bending. It is trivial to see they belong to the same cluster in a pole figure, however. A minor issue which may influence this, but unproven at this point, is that any orientation has two possible habit planes and the reality of samples could mean that there is a slight misorientation between two laths with the two different habit planes.

On this specific dataset I played with eps and found that reducing to 3 degrees (in radians) made no difference to results and exactly the same clusters were found. I played with min_samples and found that increasing this from 40 to 100 just deleted the smallest cluster and that lath, with no other change, so that was unhelpful. Decreasing from 40 to 10 found one extra cluster that was a variant indexing of an existing cluster with a slight tilt and just filled in a few points around the edges of one of the laths, so no significant benefit to overall interpretation.

With the benefit of this, I will revise paper text slightly and update the eps parameter seeing as the smaller one works here.

Ian

@pc494 pc494 added the theory This issue has a non-technical element label Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theory This issue has a non-technical element
Projects
None yet
Development

No branches or pull requests

4 participants