[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

huanghao128 · 2024-06-28T09:55:28Z

Bug Description

BM25Retriever retrieval nodes, top_n indexs and scores not match
llama_index/retrievers/bm25/base.py, line 92

for ix, score in zip(top_n, scores):
     nodes.append(NodeWithScore(node=self._nodes[ix], score=float(score)))

fix like this:

for ix, score in zip(top_n, scores):
     nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

Version

=v0.10.46

Steps to Reproduce

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever

documents = [
    Document(text="Large Language Model"),
    Document(text="LlamaIndex is a data framework for your LLM application"),
    Document(text="How to use LlamaIndex")
]

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

result_nodes = retriever.retrieve("llamaindex llm")
for node in result_nodes:
    print(node)

Relevant Logs/Tracbacks

Error Output:
Node ID: d9ed555e-bda4-4c45-aa21-dcd852e54d35
Text: LlamaIndex is a data framework for your LLM application
Score:  0.000

Node ID: 44962ba7-79f2-479c-a817-274290c6f626
Text: How to use LlamaIndex
Score:  0.498

Expected Output:
Node ID: e55c811f-94d9-444e-9ff1-518a5517238c
Text: LlamaIndex is a data framework for your LLM application
Score:  0.498

Node ID: 3645e578-a7fe-45e7-b139-311c54158bc4
Text: How to use LlamaIndex
Score:  0.121

dosubot · 2024-06-28T11:32:27Z

Hey @huanghao128! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let me know how I can assist you!

The issue in the BM25Retriever class at line 92 of the file llama_index/retrievers/bm25/base.py can be fixed by modifying the loop to correctly match the retrieval nodes, top_n indexes, and scores. The suggested fix is correct. Here is the corrected code:

for ix in top_n:
    nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

This ensures that the scores are correctly matched with their corresponding nodes. The updated _retrieve method should look like this:

def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    if query_bundle.custom_embedding_strs or query_bundle.embedding:
        logger.warning("BM25Retriever does not support embeddings, skipping...")

    query = query_bundle.query_str
    tokenized_query = self._tokenizer(query)
    scores = self.bm25.get_scores(tokenized_query)

    top_n = scores.argsort()[::-1][: self._similarity_top_k]

    nodes: List[NodeWithScore] = []
    for ix in top_n:
        nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

    return nodes

This change ensures that the scores are correctly associated with the nodes they correspond to ^[1].

To continue talking to Dosu, mention @dosu.

huanghao128 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

huanghao128 commented Jun 28, 2024 •

edited

Loading

dosubot bot commented Jun 28, 2024

[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

Comments

huanghao128 commented Jun 28, 2024 • edited Loading

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Jun 28, 2024

huanghao128 commented Jun 28, 2024 •

edited

Loading