Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse KNNVectorFieldData for reduce disk usage #1571

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

luyuncheng
Copy link
Collaborator

@luyuncheng luyuncheng commented Mar 20, 2024

Description

in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)

"mappings": { 
  "_source": { 
    "excludes": [
      "target_field1",
      "target_field2",
     ]
  }
}

so I propose to use doc_values field for the vector fields. like:

POST some_index/_search
{
  "docvalue_fields": [
    "vector_field1",
    "vector_field2",
  ],
  "_source": false
}'

Proposal

  1. Rewrite KNNVectorDVLeafFieldData get data from docvalues

i rewrite KNNVectorDVLeafFieldData and make KNN80BinaryDocValues can return the specific knn docvalue_fields like: (vector_field1 is knn field type)

"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]

optimize result:
1m SIFT dataset, 1 shard,
with source store: 1389MB
without source store: 1055MB(-24%)

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat

  1. composite vector field to _source

I added KNNFetchSubPhase and add a processor like FetchSourcePhase#FetchSubPhaseProcessor to combine the docvalue_fields into _source something like synthetic logic

Issues Resolved

#1087
#1572

  • 1st I made KNNVectorDVLeafFieldData can return the vectorDocValue fields like script do.
  • 2nd I write a KNNFetchSubPhase class which add a process in fetch phase, and it can fulfill the _source with 1st step docValues fields response. and this way something like synthetic source but need explicit add value from search body like docvalue_fields

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@luyuncheng
Copy link
Collaborator Author

@navneet1v Easy test:

  1. create index with
 {"mappings":{"_source": {"excludes": ["vector_field1"] }, "properties": {"vector_field1": {"type": "knn_vector", "dimension": 2 }, "vector_field2": {"type": "knn_vector", "dimension": 4 }, "number_field": {"type": "long"} } } }
  1. write data with
{"vector_field1" : [1.5, 2.5], "vector_field2" : [1.0, 2.0, 3.0, 4.0], "number_field":10 }
  1. POST test/_search
response do not contain vector_field1
  1. POST test/_search
{"docvalue_fields": ["vector_field1"] }
response contains vector_field1 in _source and in fields

@navneet1v
Copy link
Collaborator

@navneet1v Easy test:

  1. create index with
 {"mappings":{"_source": {"excludes": ["vector_field1"] }, "properties": {"vector_field1": {"type": "knn_vector", "dimension": 2 }, "vector_field2": {"type": "knn_vector", "dimension": 4 }, "number_field": {"type": "long"} } } }
  1. write data with
{"vector_field1" : [1.5, 2.5], "vector_field2" : [1.0, 2.0, 3.0, 4.0], "number_field":10 }
  1. POST test/_search
response do not contain vector_field1
  1. POST test/_search
{"docvalue_fields": ["vector_field1"] }
response contains vector_field1 in _source and in fields

My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing.

@luyuncheng
Copy link
Collaborator Author

My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing.

@navneet1v gotcha, I will do the continues tests like reindex and other scenarios.

@navneet1v
Copy link
Collaborator

My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing.

@navneet1v gotcha, I will do the continues tests like reindex and other scenarios.

Also there is something called as _recovery_source which is added as a fallback to support the re-indexing. If you are testing locally I would recommend to remove these line of code
https://github.com/opensearch-project/OpenSearch/blob/e6975e412b09a8d82675edd9a43c20f7c325c0f9/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L215-L219

To ensure that recovery source is never created. This recovery source gets deleted after some when if indexing is happening continuously, but I have never tested this to understand does this really happen or not.

@luyuncheng
Copy link
Collaborator Author

My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing.

@navneet1v i tested search source and reindex scenarios with KNNSubPhase, it shows correctly without nested field.

@luyuncheng
Copy link
Collaborator Author

luyuncheng commented Apr 7, 2024

Why cant we have public BytesRef nextValue() throws IOException { return the whole string represetation of the current vector?

@jmazanec15 as I see, we say synthetic source field is type XContentType.JSON.xContent . but SortedBinaryDocValues#BytesRef nexValue() is bin bytes array. I am not sure we need trans the bin bytes in DocValuesFormat to utf-8 JSON like format bytes string.

also, I see we are trying to reconstruct the vector format with KnnVectorsFormat. so I think in KnnVectorsFormat we can simply rewrite docvalues with JSON.xContent format.

but I am not sure Which SortedBinaryDocValues#BytesRef nexValue() format is better(bin, or one double value).

what do you think which is better.

@luyuncheng
Copy link
Collaborator Author

luyuncheng commented Apr 7, 2024

Also there is something called as _recovery_source which is added as a fallback to support the re-indexing.

@navneet1v I added IT tests for the search, and reindex scenarios, I think it works with knnFetchSubPhase to synthesize the _source field

@navneet1v
Copy link
Collaborator

@navneet1v @jmazanec15 i rebased it from master. and would you pls review it?

and i see @navneet1v have an issue: opensearch-project/OpenSearch#13490, i am wondering this pr could be merged or blocked until it finished?

@luyuncheng this PR is not dependent on the issue which I have raised in core.

We can merge this PR once we have nested fields use case also handled. Last I remember the nested fields were not handled in this PR. Ref:#1571 (comment)

@navneet1v
Copy link
Collaborator

My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing.

@navneet1v i tested search source and reindex scenarios with KNNSubPhase, it shows correctly without nested field.

can we test update use case too. Ref: #1694

@luyuncheng
Copy link
Collaborator Author

luyuncheng commented May 14, 2024

@navneet1v @jmazanec15

At 469fbf3 i added nested field process logic. and added tests for update, search, reindex, nested scenarios.

can we test update use case too. Ref: #1694

@navneet1v also added update tests, it shows works only for update vector field, like following tests:

public void testSyntheticSourceUpdate_whenEnabledSynthetic_thenReturnSource() throws IOException, ParseException {

but when excluded from _source and do update operation, it goes to logic:

https://github.com/opensearch-project/OpenSearch/blob/14f1c43c108f378b13d109ade364216c082fb858/server/src/main/java/org/opensearch/index/engine/InternalEngine.java#L1311-L1318

it using lucene source to do update. as i know, in the original reference there is a warning that when exclude source, we can not use update, update_by_query, reindex APIs

there is 2 scenarios:

  1. exclude vector, update vector field: OK
  2. exclude vector, update other field: Failed

@navneet1v
Copy link
Collaborator

@navneet1v @jmazanec15

At 469fbf3 i added nested field process logic. and added tests for update, search, reindex, nested scenarios.

can we test update use case too. Ref: #1694

@navneet1v also added update tests, it shows works only for update vector field, like following tests:

public void testSyntheticSourceUpdate_whenEnabledSynthetic_thenReturnSource() throws IOException, ParseException {

but when excluded from _source and do update operation, it goes to logic:

https://github.com/opensearch-project/OpenSearch/blob/14f1c43c108f378b13d109ade364216c082fb858/server/src/main/java/org/opensearch/index/engine/InternalEngine.java#L1311-L1318

it using lucene source to do update. as i know, in the original reference there is a warning that when exclude source, we can not use update, update_by_query, reindex APIs

there is 2 scenarios:

  1. exclude vector, update vector field: OK
  2. exclude vector, update other field: Failed

The #1 scenario will work if you exclude the vector field from _source and then go ahead and update vector field only. I think this will even work if we don't have this capability, because we can gather other fields from this _source.

The thing which I was really interested in testing was #2 and seems like it is not possible.

@luyuncheng
Copy link
Collaborator Author

luyuncheng commented May 15, 2024

The thing which I was really interested in testing was #2 and seems like it is not possible.

@navneet1v i added tests at

public void testSyntheticSourceUpdateOtherField_whenEnabledSynthetic_thenReturnNothing() throws IOException, ParseException {

2nd scenarios i think it is because that we synthetic at KNNFetchSubPhase which is in FetchPhase.

but Update Logic in InternalEngine#L1311-L1318 fetch _source from storeField which is not stored in _source storeField

@navneet1v
Copy link
Collaborator

2nd scenarios i think it is because that we synthetic at KNNFetchSubPhase which is in FetchPhase.

Yes I thought so.

@luyuncheng luyuncheng force-pushed the DVFieldData branch 3 times, most recently from 0383372 to 72081bb Compare May 22, 2024 14:00
@luyuncheng
Copy link
Collaborator Author

@navneet1v @jmazanec15 i resolved all review comments and all tests passed.

@bugmakerrrrrr
Copy link
Contributor

bugmakerrrrrr commented May 27, 2024

AFAIK, the plugin's sub fetch phases will run after the OS core engine's sub fetch phases, which includes the FetchSourcePhase used for source filtering. Therefore, if I exclude a vector field (e.g. field1) in the index mappings and activate the synthetic logic introduced in this PR, I believe that when I submit a search request with the source filter logic ("_source": {"exclude": "field1"}), field1 will still be present in the response.

In current OS implementation, sub fetch phase has no order concept like ActionFilter, so we can't arrange them. We may need to introduce the filter logic in KNNFetchSubPhase if we don't change the logic of the core engine.

Please let me know if I have misunderstand it.

@luyuncheng
Copy link
Collaborator Author

luyuncheng commented May 27, 2024

AFAIK, the plugin's sub fetch phases will run after the OS core engine's sub fetch phases, which includes the FetchSourcePhase used for source filtering. Therefore, if I exclude a vector field (e.g. field1) in the index mappings and activate the synthetic logic introduced in this PR, I believe that when I submit a search request with the source filter logic ("_source": {"exclude": "field1"}), field1 will still be present in the response. Please let me know if I have misunderstand it.

@bugmakerrrrrr
synthetic is used for reducing disk usages which in majority scenarios we use vectors for search not for fetch, but we still need it in reindex. at the same time we can reduce the _source store field io usage. so the KNNFetchSubPhase can make it worked when disable mapping source which stored in _source field like followings.

"mappings":{"_source": {"excludes": ["vector_field1"] } } }

@luyuncheng luyuncheng requested a review from navneet1v May 29, 2024 15:52
@navneet1v
Copy link
Collaborator

In current OS implementation, sub fetch phase has no order concept like ActionFilter, so we can't arrange them. We may need to introduce the filter logic in KNNFetchSubPhase if we don't change the logic of the core engine.

@bugmakerrrrrr

The fetch subphases runs in a order ref. The order is first all the Fetch subphases defined in Opensearch core will run and then all the plugins phases are run. The only exception for this is the InnerHit Fetchphase. The inner hit fetch phase is run at the end. Ref: https://github.com/opensearch-project/OpenSearch/blob/52b27f47bca5b3ab52cab237542f32c307d203b4/server/src/main/java/org/opensearch/search/fetch/FetchPhase.java#L104-L107 The order of these phases cannot be changed.

@luyuncheng
Copy link
Collaborator Author

The only exception for this is the InnerHit Fetchphase. The inner hit fetch phase is run at the end. Ref: https://github.com/opensearch-project/OpenSearch/blob/52b27f47bca5b3ab52cab237542f32c307d203b4/server/src/main/java/org/opensearch/search/fetch/FetchPhase.java#L104-L107 The order of these phases cannot be changed.

@navneet1v as i see the logic for InnerHit Fetchphase, it used for fulfill the response in inner_hits field,

"_source": {}
"inner_hits": {
  "<inner_hits_name>": {
    "hits": {
        "total": ...,
        "hits": [
          {
            "_type": ...,
            "_id": ...,
            "_source": ...,
             ...
          },
       ]
    }
  }
}

But KNNFetchphase fulfill the response in _source field. so the exception is we can not fulfill the _source filed in inner_hits but can get the parent topLevel _source field

@bugmakerrrrrr
Copy link
Contributor

The order of these phases cannot be changed.

@navneet1v Indeed, this is the key point that I want to emphasize, and it is precisely why I suggest that we consider incorporating the filter logic that you mentioned in your comment into the KNNFetchSubPhase. Otherwise, it will cause conflicts at the API level (I requested to exclude certain fields in the response, but they appeared in the response). Or if it is too complex to implement the filter logic, we can consider it as a limitation and clearly mark it in the document.

@luyuncheng
Copy link
Collaborator Author

The order of these phases cannot be changed.

@navneet1v Indeed, this is the key point that I want to emphasize, and it is precisely why I suggest that we consider incorporating the filter logic that you mentioned in your comment into the KNNFetchSubPhase. Otherwise, it will cause conflicts at the API level (I requested to exclude certain fields in the response, but they appeared in the response). Or if it is too complex to implement the filter logic, we can consider it as a limitation and clearly mark it in the document.

LGTM, i like it.

@navneet1v
Copy link
Collaborator

The order of these phases cannot be changed.

@navneet1v Indeed, this is the key point that I want to emphasize, and it is precisely why I suggest that we consider incorporating the filter logic that you mentioned in your comment into the KNNFetchSubPhase. Otherwise, it will cause conflicts at the API level (I requested to exclude certain fields in the response, but they appeared in the response). Or if it is too complex to implement the filter logic, we can consider it as a limitation and clearly mark it in the document.

@luyuncheng and @bugmakerrrrrr agreed.

Copy link
Contributor

@bugmakerrrrrr bugmakerrrrrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v @luyuncheng I've checked the filter logic in FetchSourcePhase, and I think that it's too complicated to implement in this subphase.

@navneet1v
Copy link
Collaborator

@luyuncheng can we fix up the comments and so that we can merge this change?

Signed-off-by: luyuncheng <[email protected]>
@luyuncheng
Copy link
Collaborator Author

@luyuncheng can we fix up the comments and so that we can merge this change?

@navneet1v FIXED at 2a61fcd

@jmazanec15
Copy link
Member

Thanks @luyuncheng. I have been reviewing and I think overall it looks good. Im still not confident on the nested portion, particularly the innerProcessOneNestedField. Could you add comments or explain more whats happening here?

Also, can we capture a list of known limitations in the issue? Somewhere we can refer to when developing the documentation what can and cannot be done with this feature. When testing, here is what I have:

  1. [nested] passing inner_hits: {} for query results in an exception - inner_hits will not work
  2. [nested] nested field partial doc only have vector and source exclude (from comment)
  3. [non-nested] If source is disabled for the mapping completely, (i.e. "mappings": {"_source": {"enabled": false,"recovery_source_enabled": false} synthetic source will not work. So, in order for synthetic source to work, the vector fields need to be excluded explicitly

Also, do we know if it works with partially constructed non-nested documents? Are there any other limitations for non-nested case?

The functionality that will work:

  1. Basic search when vector field is excluded from source, the vectors will be included
  2. Basic nested search, the vectors will show up
  3. Reindex of non-nested indices
  4. Reindex of nested indices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants