Skip to content

Commit

Permalink
Update dedup.md (#334)
Browse files Browse the repository at this point in the history
Changed some language and layout of the section.
  • Loading branch information
jonfritz committed Apr 2, 2024
1 parent e3d44f8 commit 8b7190b
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions docs/source/querying_data/dedup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# De-Duplicating Query Results

As mentioned in [Remote Search Processors](remote_processors.md), the Sycamore near-duplicate detection (NDD) facility can be used to drop duplicates from query results. This is implemented as a remote search response processor called `dedup-response`. It's configured in `pipelines.yml` like so:
The Sycamore near-duplicate detection (NDD) feature can be used to drop duplicates from query results. It is implemented as a [Remote Search Processor](remote_processors.md) called `dedup-response`, and you can add this to your hybrid search or RAG search pipelines. A prerequisite for NDD is to have previously ingested the documents using the `Sketcher` Sycamore transform. See documentation for [sketch](../data_ingestion_and_preparation/transforms/sketch.md) in `DocSet` for details. Sycamore's defeault data ingestion and search pipeliens have sketching and NDD enabled.

NDD is configured in `pipelines.yml` with a several of preset values:

```yaml
- dedup00:
Expand Down Expand Up @@ -41,13 +43,13 @@ As mentioned in [Remote Search Processors](remote_processors.md), the Sycamore n
threshold: 0.55
```

As can be seen, there's one parameter, `threshold`, which controls how aggressively NDD will drop documents. Near 0.0, few documents will be removed and they will need to be practically identical to higher-scoring documents. Above 1.0, all documents will be removed, except for the first one.
As can be seen, there's one parameter, `threshold`, which controls how aggressively NDD will drop documents. Near 0.0, few documents will be removed and they will need to be practically identical to higher-scoring documents. Above 1.0, all documents will be removed, except for the first one.

The current implementation of NDD uses "shingles" which consist of 16 hash values. The distance between two documents is the number of hash values that differ between the two documents' shingles. The raw number is between 0 and 16, but we normalize it to between 0.0 and 1.0. The logic is basically: `if distance < threshold, drop`. Our default preset `dedup02` allows two hashes to differ. That would make the threshold 2 / 16, or 0.125, but we need to add a smidge because it's a less-than comparison. So, we round to 0.15.
The current implementation of NDD uses "shingles" which consist of 16 hash values. The distance between two documents is the number of hash values that differ between the two documents' shingles. The raw number is between 0 and 16, but we normalize it to between 0.0 and 1.0. The logic is basically: `if distance < threshold, drop the result`.

A prerequisite for query-time NDD is to have previously ingested the documents using the `Sketcher` Sycamore transform. See documentation for [sketch](../data_ingestion_and_preparation/transforms/sketch.md) in `DocSet` for details.
Sycamore's default hybrid search and RAG pipelines use `dedup02`, which allows two hashes to differ. That would make the threshold 2 / 16, or 0.125, but we need to set the value slightly higher because it uses a less-than comparison.

In order for the `dedup-response` processor to do its job, it must be able to access the `shingles` field of each retrieved document. This will happen by default if the OpenSearch query does not specify `_source`. Otherwise, it needs to be listed specifically like so:
The `dedup-response` processor requires the`shingles` field of each retrieved document in the hybrid search step of a pipeline. This will happen by default if the OpenSearch query does not specify `_source`. Otherwise, it needs to be specified directly:

```
{
Expand All @@ -62,5 +64,3 @@ In order for the `dedup-response` processor to do its job, it must be able to ac
}
}
```

The practical effect of this is that it's possible to enable or disable NDD by controlling the contents of `_source`.

0 comments on commit 8b7190b

Please sign in to comment.