Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary quantization #82

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

Binary quantization #82

wants to merge 20 commits into from

Conversation

irevoire
Copy link
Member

@irevoire irevoire commented Jul 4, 2024

Pull Request

Related issue

Fixes #<issue_number>

What does this PR do?

  • ...

Todo to improve relevancy

One issue we found out is that when binary quantizing vectors, we basically end up creating a bunch of clusters in two dimensions. It would look like this:
image

All vectors will end up on one of the four edges of the square.

The more dimensions we have, the more clusters we'll get.
The number of clusters is 2^nb_dimensions.

Maybe the internal computation in two means shouldn't use binary quantized vectors

arroy/src/distance/mod.rs

Lines 127 to 173 in 0e8fba2

fn two_means<D: Distance, R: Rng>(
rng: &mut R,
leafs: &ImmutableSubsetLeafs<D>,
cosine: bool,
) -> heed::Result<[Leaf<'static, D>; 2]> {
// This algorithm is a huge heuristic. Empirically it works really well, but I
// can't motivate it well. The basic idea is to keep two centroids and assign
// points to either one of them. We weight each centroid by the number of points
// assigned to it, so to balance it.
const ITERATION_STEPS: usize = 200;
let [leaf_p, leaf_q] = leafs.choose_two(rng)?.unwrap();
let mut leaf_p = leaf_p.into_owned();
let mut leaf_q = leaf_q.into_owned();
if cosine {
D::normalize(&mut leaf_p);
D::normalize(&mut leaf_q);
}
D::init(&mut leaf_p);
D::init(&mut leaf_q);
let mut ic = 1.0;
let mut jc = 1.0;
for _ in 0..ITERATION_STEPS {
let node_k = leafs.choose(rng)?.unwrap();
let di = ic * D::non_built_distance(&leaf_p, &node_k);
let dj = jc * D::non_built_distance(&leaf_q, &node_k);
let norm = if cosine { D::norm(&node_k) } else { 1.0 };
if norm.is_nan() || norm <= 0.0 {
continue;
}
if di < dj {
Distance::update_mean(&mut leaf_p, &node_k, norm, ic);
Distance::init(&mut leaf_p);
ic += 1.0;
} else if dj < di {
Distance::update_mean(&mut leaf_q, &node_k, norm, jc);
Distance::init(&mut leaf_q);
jc += 1.0;
}
}
Ok([leaf_p, leaf_q])
}

src/writer.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants