Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How were the prompts and scoring logic define for AnswerRelevancyEvaluator, ContextRelevancyEvaluator, AnswerRelevancyEvaluator and FaithfulnessEvaluator #14432

Open
1 task done
thulard opened this issue Jun 27, 2024 · 1 comment
Labels
question Further information is requested

Comments

@thulard
Copy link

thulard commented Jun 27, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hi,

I was wondering on what base the prompt used by these metrics were defined.
Was there any research that support these specific prompts?

I'll also like to understand why the scores systems are so different from one metric to the next.

  • Correctness is 1 to 5 score.
  • Context relevency and Answer relevency are based on a set of questions. However, Context relvency encourage partial marks but not Answer relevency.
  • Faithulness is a Binary score.

Thank you

@thulard thulard added the question Further information is requested label Jun 27, 2024
Copy link

dosubot bot commented Jun 27, 2024

Hey @thulard! I'm here to help you with any questions you have while waiting for a human maintainer. If you need assistance with bugs, questions, or becoming a contributor, feel free to reach out.

The prompts and scoring systems for the AnswerRelevancyEvaluator, ContextRelevancyEvaluator, and FaithfulnessEvaluator were designed to assess different aspects of generated responses using both standard LLMs and Multi-Modal LLMs. Here's a breakdown:

  1. Correctness (LLM-As-A-Judge): This evaluates the correctness of the response without considering the context, using a standard LLM like GPT-4. The scoring is on a scale from 1 to 5.
  2. Faithfulness (LMM-As-A-Judge): This evaluates whether the response is faithful to the provided context, requiring both images and text. It uses a Multi-Modal LLM (LMM) and provides a binary score (YES or NO).
  3. Relevancy (LMM-As-A-Judge): This evaluates the relevancy of the response to the provided context, also requiring both images and text. It uses a Multi-Modal LLM (LMM) and provides a binary score (YES or NO).

The specific prompts and scoring differences are designed to leverage the unique strengths of each model type appropriately. The use of different models ensures a comprehensive evaluation of the responses.

Supporting evidence from the document:

### Correctness, Faithfulness, Relevancy

With the generated responses in hand (stored in a custom data object tailored for this ASL use case, namely: `response_data`). We can now compute metrics for their evaluation:
- Correctness (LLM-As-A-Judge):
- Faithfulness (LMM-As-A-Judge):
- Relevancy (LMM-As-A-Judge):

To compute all three, we prompt another generative model to provide a score assessing each of their own criterion. For Correctness, since we're not considering context, then the judge is an LLM. In contrast, to compute Faithfulness and Relevancy, we're required to pass in the context, meaning both images and text that was supplied to the RAG to generate the response in the first place. As a result of this requirement to pass in both image and text, the judges for Faithfulness and Relevancy must be an LMM (or Multi-Modal LLM).

Code snippet showing the setup of the evaluators:

from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

Correctness Evaluation Prompt

prometheus_correctness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
			1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general. 
			2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric. 
			3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)" 
			4. Please do not generate any other opening, closing, and explanations. 
            5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.

			###The instruction to evaluate: Your task is to evaluate the generated answer and reference answer for the query: {query}
			
            ###Generate answer to evaluate: {generated_answer} 

            ###Reference Answer (Score 5): {reference_answer}
            
    		###Score Rubrics: 
            Score 1: If the generated answer is not relevant to the user query and reference answer.
            Score 2: If the generated answer is according to reference answer but not relevant to user query.
            Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
    		Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
            Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
    
    		###Feedback:"""

Faithfulness Evaluation Prompt

prometheus_faithfulness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given. 
	        1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.

        ###Information: {query_str} 

        ###Context: {context_str}
            
        ###Score Rubrics: 
        Score YES: If the given piece of information is supported by context.
        Score NO: If the given piece of information is not supported by context
    
        ###Feedback: """

Relevancy Evaluation Prompt

prometheus_relevancy_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query with response, context, and a score rubric representing evaluation criteria are given. 
            1. You are provided with evaluation task with the help of a query with response and context.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.

        ###Query and Response: {query_str} 

        ###Context: {context_str}
            
        ###Score Rubrics: 
        Score YES: If the response for the query is in line with the context information provided.
        Score NO: If the response for the query is not in line with the context information provided.
    
        ###Feedback: """

[1][2]

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant