-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User flag to construct CAGRA index with dataset #200
Conversation
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* - `false` means `build` only builds the graph, but | ||
* the user is expected to update dataset separately. | ||
*/ | ||
bool add_data_on_build = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different parameter name here. add_data_on_build
is already defined for the index_params, and I believe it has a slightly different meaning. It's meant to be used together with extend
function; in IVF-* methods we allow adding any new data after clustering this way. In CAGRA in the current form, the user would only be allowed to add the data that is present in the graph (and not via extend
function).
Or, at the very least, use explicit using cuvs::neighbors::index_params::add_data_on_build;
instead of shadowing the parent definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build
flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build
& index.update_dataset()
when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset
and build
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can remove the flag add_data_on_build
from cagra::index_params
and directly use cuvs::neighbors::add_data_on_build
, without overriding it. I can update the documentation of cuvs::neighbors::add_data_on_build
stating that it has a different meaning for a CAGRA index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved add_data_on_build
to the ivf index_params
and added a new arg for cagra -- populate_data
. The PR is also green on CI right now.
@tarang-jain could you clarify the PR title and description? Currently the word "save" is used, and that makes me associate to serialization, but for serialization we already have the I assume that problem you are addressing is index construction: whether to construct the index with the dataset. In that case the description could be updated like: "currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead)." |
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* - `false` means `build` only builds the graph, but | ||
* the user is expected to update dataset separately. | ||
*/ | ||
bool add_data_on_build = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build
flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build
& index.update_dataset()
when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset
and build
.
@@ -63,7 +63,7 @@ struct ivf_pq_params { | |||
* auto pq_params = | |||
* cagra::graph_build_params::ivf_pq_params(dataset.extents()); | |||
* // modify/update index_params as needed | |||
* index_params.add_data_on_build = true; | |||
* pq_params.add_data_on_build = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that the example should change pq_params
! But lets change a different field, since add_data_on_build
is true by default
* pq_params.add_data_on_build = true; | |
* pq_params.kmeans_trainset_fraction = 0.1; |
@@ -79,7 +79,8 @@ struct index_params : cuvs::neighbors::index_params { | |||
/** Degree of output graph. */ | |||
size_t graph_degree = 64; | |||
/** | |||
* Specify compression parameters if compression is desired. | |||
* Specify compression parameters if compression is desired. If set, overrides the | |||
* add_data_on_build argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* add_data_on_build argument. | |
* add_data_on_build argument (and the compressed dataset is always added to the index). |
* `build` provided there is enough memory available. | ||
* - `false` means `build` only builds the graph and the user is expected to | ||
* update the dataset using cuvs::neighbors::cagra::update_dataset. CAGRA does not have `extent` | ||
* API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to put the new code example section (currently in build docstring) here.
@@ -448,6 +463,23 @@ struct index : cuvs::neighbors::index { | |||
* auto distances = raft::make_device_matrix<float>(res, n_queries, k); | |||
* cagra::search(res, search_params, index, queries, neighbors, distances); | |||
* @endcode | |||
* In the above example, the index is populated with the dataset and is ready for search. In | |||
* contrast, a user can only build the CAGRA graph and separately populate the index with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A feel that this code example would be better placed in the docstring of add_data_on_build
.
* contrast, a user can only build the CAGRA graph and separately populate the index with the | |
* contrast, a user can create an index that stores only the CAGRA graph and separately populate the index with the |
@@ -90,15 +90,6 @@ struct index_params { | |||
cuvs::distance::DistanceType metric = cuvs::distance::DistanceType::L2Expanded; | |||
/** The argument used by some distance metrics. */ | |||
float metric_arg = 2.0f; | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it removed from here? Is it only that we can repeat the docstring in each subclass? If that is the case, why not keep this and add the following with the docstring in the subclasses?
/** docstring */
using cuvs::neighbors::index_params;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_data_on_build
is added separately to the index params for IVF-Flat and IVF-PQ. It is not used in other algorithms.
Closing in favor of #213 |
Currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to
index_params
to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead).Closes #199