Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User flag to construct CAGRA index with dataset #200

Closed

Conversation

tarang-jain
Copy link

@tarang-jain tarang-jain commented Jun 25, 2024

Currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead).

Closes #199

@tarang-jain tarang-jain requested a review from a team as a code owner June 25, 2024 21:54
@github-actions github-actions bot added the cpp label Jun 25, 2024
@tarang-jain tarang-jain marked this pull request as draft June 25, 2024 21:55
@tarang-jain tarang-jain marked this pull request as ready for review June 25, 2024 22:36
* - `false` means `build` only builds the graph, but
* the user is expected to update dataset separately.
*/
bool add_data_on_build = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a different parameter name here. add_data_on_build is already defined for the index_params, and I believe it has a slightly different meaning. It's meant to be used together with extend function; in IVF-* methods we allow adding any new data after clustering this way. In CAGRA in the current form, the user would only be allowed to add the data that is present in the graph (and not via extend function).

Or, at the very least, use explicit using cuvs::neighbors::index_params::add_data_on_build; instead of shadowing the parent definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can remove the flag add_data_on_build from cagra::index_params and directly use cuvs::neighbors::add_data_on_build, without overriding it. I can update the documentation of cuvs::neighbors::add_data_on_build stating that it has a different meaning for a CAGRA index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved add_data_on_build to the ivf index_params and added a new arg for cagra -- populate_data. The PR is also green on CI right now.

@tfeher
Copy link
Contributor

tfeher commented Jun 26, 2024

@tarang-jain could you clarify the PR title and description? Currently the word "save" is used, and that makes me associate to serialization, but for serialization we already have the include_dataset flag.

I assume that problem you are addressing is index construction: whether to construct the index with the dataset. In that case the description could be updated like: "currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead)."

* - `false` means `build` only builds the graph, but
* the user is expected to update dataset separately.
*/
bool add_data_on_build = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jun 26, 2024
@tarang-jain tarang-jain changed the title User flag to save the dataset to the CAGRA index User flag to construct CAGRA index with dataset Jun 26, 2024
cpp/include/cuvs/neighbors/common.hpp Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/cagra.hpp Show resolved Hide resolved
cpp/include/cuvs/neighbors/cagra.hpp Show resolved Hide resolved
@@ -63,7 +63,7 @@ struct ivf_pq_params {
* auto pq_params =
* cagra::graph_build_params::ivf_pq_params(dataset.extents());
* // modify/update index_params as needed
* index_params.add_data_on_build = true;
* pq_params.add_data_on_build = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that the example should change pq_params! But lets change a different field, since add_data_on_build is true by default

Suggested change
* pq_params.add_data_on_build = true;
* pq_params.kmeans_trainset_fraction = 0.1;

@@ -79,7 +79,8 @@ struct index_params : cuvs::neighbors::index_params {
/** Degree of output graph. */
size_t graph_degree = 64;
/**
* Specify compression parameters if compression is desired.
* Specify compression parameters if compression is desired. If set, overrides the
* add_data_on_build argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* add_data_on_build argument.
* add_data_on_build argument (and the compressed dataset is always added to the index).

cpp/include/cuvs/neighbors/cagra.hpp Show resolved Hide resolved
* `build` provided there is enough memory available.
* - `false` means `build` only builds the graph and the user is expected to
* update the dataset using cuvs::neighbors::cagra::update_dataset. CAGRA does not have `extent`
* API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to put the new code example section (currently in build docstring) here.

@@ -448,6 +463,23 @@ struct index : cuvs::neighbors::index {
* auto distances = raft::make_device_matrix<float>(res, n_queries, k);
* cagra::search(res, search_params, index, queries, neighbors, distances);
* @endcode
* In the above example, the index is populated with the dataset and is ready for search. In
* contrast, a user can only build the CAGRA graph and separately populate the index with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A feel that this code example would be better placed in the docstring of add_data_on_build.

Suggested change
* contrast, a user can only build the CAGRA graph and separately populate the index with the
* contrast, a user can create an index that stores only the CAGRA graph and separately populate the index with the

@@ -90,15 +90,6 @@ struct index_params {
cuvs::distance::DistanceType metric = cuvs::distance::DistanceType::L2Expanded;
/** The argument used by some distance metrics. */
float metric_arg = 2.0f;
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it removed from here? Is it only that we can repeat the docstring in each subclass? If that is the case, why not keep this and add the following with the docstring in the subclasses?

/** docstring */
using cuvs::neighbors::index_params;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_data_on_build is added separately to the index params for IVF-Flat and IVF-PQ. It is not used in other algorithms.

@tarang-jain
Copy link
Author

Closing in favor of #213

@tarang-jain tarang-jain closed this Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] Expose CAGRA construct_index_with_dataset flag to public API
5 participants