Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binary format support with HNSW method in Faiss Engine #1781

Merged

Conversation

heemin32
Copy link
Collaborator

@heemin32 heemin32 commented Jul 1, 2024

Description

What this PR has

Add binary format support with HNSW method in Faiss engine

What this PR does not have

  1. Search parameter support which is required for dynamic ef_search value in query, efficient filtering, and multi vector support. This will be handled by a separate PR.
  2. Script scoring

Changes

  1. Introduction of UNDEFINED space type. This is needed because the space type is parsed before we know about data type and we are going to have a different default space type per data_type; hamming for binary and l2 for all other. Therefore, I had to assign UNDEFINED space type when user does not provide any space type first. Then later inside mapper, I set proper default space type based on data type if the space type is UNDEFINED
  2. Introduction of isBinary parameter in JNIService.free method. Because faiss::Index and faiss::BinaryIndex does not share common parent class, we cannot determine if it is faiss::Index or faiss::BinaryIndex using cast. Therefore, we need to know if it is binary index or not. The reason why we not having a separate freeBinaryIndex method is because it is single delete method in cpp side whereas, createIndex has separate method for binary index like createBinaryIndex.
  3. Adding vector data type as parameter in IndexEntryContext. This data will be used when loading an index into memory to determine if it is binary index or not. For creation and query, we use vector data type value in fieldInfo to determine if it is binary index or not.
  4. New VectorTransfer, KNNIterator, and byteQueryVector variable inside Query was needed as binary vector will have byte[] type when non binary vector will have float[] type. Refactoring will be needed to enhance maintainability which is captured in [FEATURE]Refactoring of code for better maintainability #1810

Testing

  1. Create index
PUT /my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 8,
        "data_type": "binary",
        "method": {
          "name": "hnsw",
          "engine": "faiss"
        }
      }
    }
  }
}
  1. Ingest documents
POST /_bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"my_vector" : [9]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"my_vector" : [14]}
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{"my_vector" : [-100]}
  1. Query
POST /my-knn-index-1/_search
{
  "query": {
    "knn": {
      "my_vector": {
        "vector": [13],
        "k": 2
      }
    }
  }
}

Response

{
  "took": 81,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.5,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 0.5,
        "_source": {
          "my_vector": [
            9
          ]
        }
      },
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 0.33333334,
        "_source": {
          "my_vector": [
            14
          ]
        }
      }
    ]
  }
}

Issues Resolved

#1767

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@heemin32 heemin32 force-pushed the binary-java branch 2 times, most recently from 66419ea to d12ac25 Compare July 1, 2024 21:25
@heemin32 heemin32 force-pushed the binary-java branch 3 times, most recently from 95e0fbe to 6840c3e Compare July 2, 2024 04:00
@heemin32 heemin32 force-pushed the feature/binary-format branch 8 times, most recently from dafd79b to a913082 Compare July 3, 2024 16:40
@heemin32 heemin32 force-pushed the binary-java branch 10 times, most recently from f310113 to b6c9c6e Compare July 8, 2024 19:12
@heemin32 heemin32 marked this pull request as ready for review July 8, 2024 19:13
@heemin32 heemin32 changed the title Implement binary format support Add binary format support with HNSW method in Faiss Engine Jul 9, 2024
@heemin32 heemin32 requested a review from navneet1v July 9, 2024 20:36
protected float computeScore() throws IOException {
final BytesRef value = binaryDocValues.binaryValue();
final ByteArrayInputStream byteStream = new ByteArrayInputStream(value.bytes, value.offset, value.length);
final byte[] vector = byteStream.readAllBytes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be behind the VectorSerializer interfaces otherwise there is no common way to read the binary vectors/float vectors.

import java.util.List;

/**
* Vector transfer for float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better java doc needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is child class of VectorTransfer. Wouldn't the java doc in VectorTransfer be enough?

abstract public void init(final long totalLiveDocs);

/**
* Transfer a single vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documentation is incorrect, we are not transferring 1 vector here. We are transferring more than 1 vector.

Copy link
Collaborator Author

@heemin32 heemin32 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say this method is called for a single vector.
Yes. We buffer vectors and transfer more than 1 vector later internally but it is hidden from caller.

/**
* Close the transfer
*/
abstract public void close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how we are ensuring that close method is always called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it is caller's responsibility to call the close. Any idea to enforce it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make it as closable class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got your point. I think it should be refactored so that there is no dependency between methods.

@heemin32
Copy link
Collaborator Author

heemin32 commented Jul 10, 2024

@navneet1v I captured the all comments related with code refactoring in #1810. Please update in the issue if I missed any. I will follow up after completing this feature implementation.
The reason that I want to push this code as quickly as possible without doing all the suggested refactoring is to give enough time for reviewing IVF implementation which has developed on top of this PR so that the complete binary format support can be released in 2.16.

@heemin32 heemin32 requested a review from navneet1v July 10, 2024 03:58
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a big PR like this, it would be helpful to give a high level summary of what you had to change in the implementation and why. For instance, explaining the different components changed, like why we are changing the free api to take a boolean. This will make review a lot easier and there will be less back and forth.

@navneet1v
Copy link
Collaborator

@navneet1v I captured the all comments related with code refactoring in #1810. Please update in the issue if I missed any. I will follow up after completing this feature implementation. The reason that I want to push this code as quickly as possible without doing all the suggested refactoring is to give enough time for reviewing IVF implementation which has developed on top of this PR so that the complete binary format support can be released in 2.16.

I think there are multiple refactoring suggested in the code and my worry is there are new features that we are working on including Disk based vector search and reducing memory foot print during index creation they all touch the same code path. Doing refactoring later will impact those features.

@heemin32
Copy link
Collaborator Author

@navneet1v I captured the all comments related with code refactoring in #1810. Please update in the issue if I missed any. I will follow up after completing this feature implementation. The reason that I want to push this code as quickly as possible without doing all the suggested refactoring is to give enough time for reviewing IVF implementation which has developed on top of this PR so that the complete binary format support can be released in 2.16.

I think there are multiple refactoring suggested in the code and my worry is there are new features that we are working on including Disk based vector search and reducing memory foot print during index creation they all touch the same code path. Doing refactoring later will impact those features.

We can discuss offline. My plan is working on refactoring after 2.16 release. Therefore, the delay won't be much between now and later.

@heemin32
Copy link
Collaborator Author

Ran ./gradlew :spotlessApply to fix build failure.

@heemin32 heemin32 merged commit 5d008cd into opensearch-project:feature/binary-format Jul 12, 2024
47 of 49 checks passed
naveentatikonda pushed a commit that referenced this pull request Jul 12, 2024
heemin32 added a commit that referenced this pull request Jul 16, 2024
…n Faiss Engine (#1829)

* Add faiss custom patch to support search parameter in binary index (#1815)

Signed-off-by: Heemin Kim <[email protected]>

* Add binary format support with HNSW method in Faiss Engine (#1781)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
heemin32 added a commit to heemin32/k-NN that referenced this pull request Jul 16, 2024
…n Faiss Engine (opensearch-project#1829)

* Add faiss custom patch to support search parameter in binary index (opensearch-project#1815)

Signed-off-by: Heemin Kim <[email protected]>

* Add binary format support with HNSW method in Faiss Engine (opensearch-project#1781)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit fe1d86f)
heemin32 added a commit that referenced this pull request Jul 16, 2024
…n Faiss Engine (#1829) (#1834)

* Add faiss custom patch to support search parameter in binary index (#1815)

Signed-off-by: Heemin Kim <[email protected]>

* Add binary format support with HNSW method in Faiss Engine (#1781)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit fe1d86f)
ryanbogan pushed a commit to ryanbogan/k-NN that referenced this pull request Jul 16, 2024
…n Faiss Engine (opensearch-project#1829)

* Add faiss custom patch to support search parameter in binary index (opensearch-project#1815)

Signed-off-by: Heemin Kim <[email protected]>

* Add binary format support with HNSW method in Faiss Engine (opensearch-project#1781)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants