Skip to content

Latest commit

 

History

History
71 lines (55 loc) · 3.56 KB

evaluation.md

File metadata and controls

71 lines (55 loc) · 3.56 KB

Evaluation with a Script

Run Evaluation

The evaluation can also be carried out using evaluation script.
Example:

python scripts/experiment.py --config_exp_path=scripts/configs/experiment_settings/sample_evaluation

The directory containing the configuration file can be specified using the --config_exp_path option. The example of the configuration file are as follows:

{
    "chain_config": {
        "dataset": {
            "dataset_name": "NQ",
            "num_evaluate": 10,
            "batch_size": 20
        },
        "len_chain": 2,
        "chain": [
            {
                "prompt_template": "{question}",
                "function": "Retriever",
                "retriever_name": "flat_subset_499992",
                "npassage": 5,
                "f-strings_or_eval": "f-strings"
            },
            {
                "prompt_template": "Referring to the following document, answer \"{question}?\" in 5 words or less.\n\n{response[0]}\n\nAnswer: ",
                "function": "LLM",
                "llm_name": "llama-2-13b-chat",
                "f-strings_or_eval": "f-strings"
            }
        ]
    }
}
  • dataset_name : The names of the datasets for evaluation. For evaluation in the kilt benchmark, select one of the following options: FEV, AY2, WnWi, WnCw, T-REx, zsRE, NQ, HoPo, TQA, ELI5, or WoW.
  • num_evaluate : The number of questions from the development set to evaluate the chain (-1 for all questions).
  • retriever_name : The name of the retriever (including the corresponding index and corpus) specified in scripts/configs/base_settings/retrievers.json
  • llm_name : The LLM name specified in scripts/configs/base_settings/llms.json

When evaluated with the given settings, the EM score is 0.1 (10.0%) because the corpus is a subset. However, when the entire corpus is used, the EM score increases to 36.1% with the provided prompt on the entire NQ dataset.

Metrics

Downstream performance

Following preprocessing (lowercasing the text, removing punctuation, articles, and extraneous whitespace), the KILT implementation computes the following four metrics for downstream tasks: Accuracy, EM (Exact Match), F1-score, and ROUGE-L.

Additionally, we include has_answer percentage for short answers, a proxy metric to measure the proportion of questions that contain gold answers within the final output generated by an R-LLM. By tracking this metric, developers can identify situations where the model generates responses that include gold answers but may be overlooked due to evaluation biases such as exact matching.

Retrieval performance

Rprec (R-Precision): The page-level R-precision is the percentage of R gold pages inside each provenance set among the top-R retrieved pages. Typically, R-Precision is equivalent to Precision@1 except FEVER and HotPotQA (multi-hop datasets). (see also Craswell (2017))

recall_at_5: Recall@5

KILT scores

KILT version for downstream metrics (see also Petroni et al. (2021) for details). When the R-precision is 1 (retrieval success), the set of scores awards Accuracy, EM, ROUGE-L, and F1-score points to KILT-Accuracy, KILT-EM, KILT-RL, and KILT-F1, respectively.