add generate metrics #29

Wenshansilvia · 2024-02-05T12:28:15Z

@QianHaosheng @bugtig6351 @yuanpcr you can list all potential metrics for the generate task in this issue. For more details about the generate task, you can refer to issue #12 .

The text was updated successfully, but these errors were encountered:

faneshion · 2024-02-06T09:11:42Z

This issue is to define new feature about the answer correctness.

QianHaosheng · 2024-02-23T07:16:49Z

There are some open source libraries of metrics that we may be able to use in our projects. For example, Rouge and MAUVE.
Rouge
Rouge-chinese
MAUVE

bugtig6351 · 2024-02-23T15:35:25Z

I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm.
For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.

faneshion · 2024-02-23T16:22:40Z

I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm. For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.

There are many high-quality implemented metrics we can use in datasets. Besides, we can learn from them for those uncovered metrics.

QianHaosheng · 2024-02-25T13:33:29Z

Here are some metrics related to the answer and the papers mention them.

Answer/Query
DisambigF1 Active retrieval augmented generation
Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
Answer/Contexts
FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
Answer/GT_Answer
accuracy、EM、F1、Rouge
BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
citation recall/precision Enabling Large Language Models to Generate Text with Citations
nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
Rare F1 Retrieval augmentation reduces hallucination in conversation
Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
perplexity Improving retrieval-augmented lms with compression and selective augmentation
bits-per-byte Replug: Retrievalaugmented black-box language models.
MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers Enabling Large Language Models to Generate Text with Citations
Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

bugtig6351 · 2024-03-04T09:23:35Z

disambig-F1: Active retrieval augmented generation, ASQA: Factoid Questions Meet Long-Form Answers
Use a RoBERTa-based model to normalize the answer and the ground truth answer, then compute the token-level F1 score between them.

WangYiting-1999 · 2024-03-06T07:43:44Z

I am going to implement the following 2 metrics:
LCS Retrieval-based prompt selection for code-related few-shot learning
Edit distance Retrieval-based prompt selection for code-related few-shot learning

FBzzh · 2024-03-06T07:58:42Z

I am going to implement the following 2 metrics:
BLEU BLEU: a Method for Automatic Evaluation of Machine Translation
Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems

henan991201 · 2024-03-06T09:02:32Z

Here are some metrics related to the answer and the papers mention them.

Answer/Query
DisambigF1 Active retrieval augmented generation
Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation

Answer/Contexts
FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Answer/GT_Answer
accuracy、EM、F1、Rouge
BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
citation recall/precision Enabling Large Language Models to Generate Text with Citations
nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
Rare F1 Retrieval augmentation reduces hallucination in conversation
Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
perplexity Improving retrieval-augmented lms with compression and selective augmentation
bits-per-byte Replug: Retrievalaugmented black-box language models.
MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers Enabling Large Language Models to Generate Text with Citations
Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

I am going to implement F1, TER and chrF++ Lift yourself up: Retrieval-augmented text generation with self memory

RZFan525 · 2024-03-06T11:11:10Z

Here are some metrics related to the answer and the papers mention them.

Answer/Query
DisambigF1 Active retrieval augmented generation
Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation

Answer/Contexts
FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Answer/GT_Answer
accuracy、EM、F1、Rouge
BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
citation recall/precision Enabling Large Language Models to Generate Text with Citations
nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
Rare F1 Retrieval augmentation reduces hallucination in conversation
Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
perplexity Improving retrieval-augmented lms with compression and selective augmentation
bits-per-byte Replug: Retrievalaugmented black-box language models.
MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers Enabling Large Language Models to Generate Text with Citations
Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

I'm going to implement truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

FBzzh · 2024-03-12T07:47:04Z

I am going to implement the following 2 metrics: BLEU BLEU: a Method for Automatic Evaluation of Machine Translation Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems

the Q-BLEU metric is to meature the answerability of questions generated by Automatic Question Generation system. It use if the question include relevant content words, named entitiesand question types or function words to meature the answerability. This metric is not useful for answer generation.
I am going to implement another metric:
perplexity Improving retrieval-augmented lms with compression and selective augmentation

Wenshansilvia mentioned this issue Feb 5, 2024

Initialize Metrics #23

Open

7 tasks

faneshion added this to the Version 0.1 milestone Feb 6, 2024

faneshion assigned Wenshansilvia Feb 6, 2024

faneshion mentioned this issue Feb 23, 2024

Hotfix/standardize metric input #49

Merged

QianHaosheng linked a pull request Feb 23, 2024 that will close this issue

add answer exact match metric #51

Merged

QianHaosheng removed a link to a pull request Feb 23, 2024

add answer exact match metric #51

Merged

QianHaosheng mentioned this issue Feb 23, 2024

add answer exact match metric #51

Merged

faneshion assigned yuanpcr, bugtig6351 and QianHaosheng Feb 23, 2024

faneshion changed the title ~~add generator metrics~~ add generate metrics Feb 23, 2024

bugtig6351 mentioned this issue Feb 23, 2024

add rouge metric #54

Merged

Wenshansilvia added enhancement New feature or request good first issue Good for newcomers labels Mar 5, 2024

bugtig6351 mentioned this issue Mar 5, 2024

add disambig f1 score #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add generate metrics #29

add generate metrics #29

Wenshansilvia commented Feb 5, 2024 •

edited by faneshion

Loading

faneshion commented Feb 6, 2024

QianHaosheng commented Feb 23, 2024

bugtig6351 commented Feb 23, 2024

faneshion commented Feb 23, 2024

QianHaosheng commented Feb 25, 2024 •

edited

Loading

bugtig6351 commented Mar 4, 2024

WangYiting-1999 commented Mar 6, 2024 •

edited

Loading

FBzzh commented Mar 6, 2024 •

edited

Loading

henan991201 commented Mar 6, 2024 •

edited

Loading

RZFan525 commented Mar 6, 2024

FBzzh commented Mar 12, 2024

add generate metrics #29

add generate metrics #29

Comments

Wenshansilvia commented Feb 5, 2024 • edited by faneshion Loading

faneshion commented Feb 6, 2024

QianHaosheng commented Feb 23, 2024

bugtig6351 commented Feb 23, 2024

faneshion commented Feb 23, 2024

QianHaosheng commented Feb 25, 2024 • edited Loading

bugtig6351 commented Mar 4, 2024

WangYiting-1999 commented Mar 6, 2024 • edited Loading

FBzzh commented Mar 6, 2024 • edited Loading

henan991201 commented Mar 6, 2024 • edited Loading

RZFan525 commented Mar 6, 2024

FBzzh commented Mar 12, 2024

Wenshansilvia commented Feb 5, 2024 •

edited by faneshion

Loading

QianHaosheng commented Feb 25, 2024 •

edited

Loading

WangYiting-1999 commented Mar 6, 2024 •

edited

Loading

FBzzh commented Mar 6, 2024 •

edited

Loading

henan991201 commented Mar 6, 2024 •

edited

Loading