Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: BM25Retriever retrieval nodes, top_n indexs and scores not match #14442

Open
huanghao128 opened this issue Jun 28, 2024 · 1 comment
Open
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@huanghao128
Copy link

huanghao128 commented Jun 28, 2024

Bug Description

BM25Retriever retrieval nodes, top_n indexs and scores not match
llama_index/retrievers/bm25/base.py, line 92

for ix, score in zip(top_n, scores):
     nodes.append(NodeWithScore(node=self._nodes[ix], score=float(score)))

fix like this:

for ix, score in zip(top_n, scores):
     nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

Version

=v0.10.46

Steps to Reproduce

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever

documents = [
    Document(text="Large Language Model"),
    Document(text="LlamaIndex is a data framework for your LLM application"),
    Document(text="How to use LlamaIndex")
]

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

result_nodes = retriever.retrieve("llamaindex llm")
for node in result_nodes:
    print(node)

Relevant Logs/Tracbacks

Error Output:
Node ID: d9ed555e-bda4-4c45-aa21-dcd852e54d35
Text: LlamaIndex is a data framework for your LLM application
Score:  0.000

Node ID: 44962ba7-79f2-479c-a817-274290c6f626
Text: How to use LlamaIndex
Score:  0.498

Expected Output:
Node ID: e55c811f-94d9-444e-9ff1-518a5517238c
Text: LlamaIndex is a data framework for your LLM application
Score:  0.498

Node ID: 3645e578-a7fe-45e7-b139-311c54158bc4
Text: How to use LlamaIndex
Score:  0.121
@huanghao128 huanghao128 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jun 28, 2024
Copy link

dosubot bot commented Jun 28, 2024

Hey @huanghao128! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let me know how I can assist you!

The issue in the BM25Retriever class at line 92 of the file llama_index/retrievers/bm25/base.py can be fixed by modifying the loop to correctly match the retrieval nodes, top_n indexes, and scores. The suggested fix is correct. Here is the corrected code:

for ix in top_n:
    nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

This ensures that the scores are correctly matched with their corresponding nodes. The updated _retrieve method should look like this:

def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    if query_bundle.custom_embedding_strs or query_bundle.embedding:
        logger.warning("BM25Retriever does not support embeddings, skipping...")

    query = query_bundle.query_str
    tokenized_query = self._tokenizer(query)
    scores = self.bm25.get_scores(tokenized_query)

    top_n = scores.argsort()[::-1][: self._similarity_top_k]

    nodes: List[NodeWithScore] = []
    for ix in top_n:
        nodes.append(NodeWithScore(node=self._nodes[ix], score=float(scores[ix])))

    return nodes

This change ensures that the scores are correctly associated with the nodes they correspond to [1].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant