Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Better usage of NN Descent as knn build algo #5942

Open
jinsolp opened this issue Jun 18, 2024 · 0 comments
Open

[FEA] Better usage of NN Descent as knn build algo #5942

jinsolp opened this issue Jun 18, 2024 · 0 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@jinsolp
Copy link
Contributor

jinsolp commented Jun 18, 2024

Using NN Descent to build the knn graph for UMAP and HDBSCAN is working okay, but a few improvements can be made.
Related PRs:

  • HDBSCAN + NN Descent here
  • UMAP + NN Descent here

[Precision Issues with NN Descent]
NN Descent in RAFT uses fp16 to calculate distances. This makes it difficult to grab the distances calculated by NN Descent and compare it against fp32-calculated distances for additional operations outside of NN Descent such as making predictions. (also left as an issue here)

  • Problem
    • NN Descent calculates distances in fp16 (also only support L2Expanded metric)
  • Current status
    • Right now, we grab the indices only, and call refine() on those indices to calculate fp32 + L2SqrtExpanded metric distance
  • Possible solutions
    • Check why distance differences are causing dramatic drops in score
    • Support fp32 distance calculation in NN Descent -> we can just grab the distances from NN Descent instead of doing refine()

[NN Descent parameters not always resulting in a better score]
Since we are using refine(), increasing NN Descent parameters such as max_iterations or graph_degree should yield a better result (because it takes more neighbor candidates into consideration for refinement with a large graph degree, and it has a termination threshold so increasing max_iterations shouldn't really matter). However, this is not always the case and NN Descent is unstable for some specific test cases.

  • Problem
    • NN Descent does not work as well as the brute force with the current test configurations of test_membership_vector_circles and test_approximate_predict_blobs in cuml/python/cuml/tests/test_hdbscan.py. Increasing NN Descent parameters does not solve this problem
  • Current status
    • Separate tests for NN Descent for the two cases mentioned above.

[Using NN Descent with HDBSCAN]
HDBSCAN has two phases of building knn graph

  1. Just build the knn graph for top k
  2. Re-build the knn graph for top k using information (core dists) from step 1 to post-process the distance as they build.
    Step 1 can be done using NN Descent + refine, but step 2 cannot be done using NN Descent
  • Problem
    • Specifically, the core distances obtained after step 1 (which are L2SqrtExpanded fp32 distances from refine function) cannot be used to compare distances within NN Descent in step 2 because of the distance precisions
    • And the refine function doesn't have support for distance epilogues
  • Current status
    • Do the first knn with NN Descent, and the second NN Descent with brute force knn
  • Goal
    • use NN Descent for both phases of knn in HDBSCAN
  • Possible solutions
    • Add distance epilogue support for refine function so that we can refine based on the mutual reachability distance
    • Support fp32 distance calculation in NN Descent
@jinsolp jinsolp added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant