Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add generate metrics #29

Open
Tracked by #23
Wenshansilvia opened this issue Feb 5, 2024 · 11 comments
Open
Tracked by #23

add generate metrics #29

Wenshansilvia opened this issue Feb 5, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@Wenshansilvia
Copy link
Collaborator

Wenshansilvia commented Feb 5, 2024

@QianHaosheng @bugtig6351 @yuanpcr you can list all potential metrics for the generate task in this issue. For more details about the generate task, you can refer to issue #12 .

@Wenshansilvia Wenshansilvia mentioned this issue Feb 5, 2024
7 tasks
@faneshion faneshion added this to the Version 0.1 milestone Feb 6, 2024
@faneshion
Copy link
Collaborator

This issue is to define new feature about the answer correctness.

@QianHaosheng QianHaosheng linked a pull request Feb 23, 2024 that will close this issue
@QianHaosheng
Copy link
Collaborator

There are some open source libraries of metrics that we may be able to use in our projects. For example, Rouge and MAUVE.
Rouge
Rouge-chinese
MAUVE

@QianHaosheng QianHaosheng removed a link to a pull request Feb 23, 2024
@faneshion faneshion changed the title add generator metrics add generate metrics Feb 23, 2024
@bugtig6351
Copy link
Collaborator

I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm.
For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.

@faneshion
Copy link
Collaborator

I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm. For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.

There are many high-quality implemented metrics we can use in datasets. Besides, we can learn from them for those uncovered metrics.

@QianHaosheng
Copy link
Collaborator

QianHaosheng commented Feb 25, 2024

Here are some metrics related to the answer and the papers mention them.

  1. Answer/Query
    DisambigF1 Active retrieval augmented generation
    Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation

  2. Answer/Contexts
    FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
    D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

  3. Answer/GT_Answer
    accuracy、EM、F1、Rouge
    BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
    Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
    citation recall/precision Enabling Large Language Models to Generate Text with Citations
    nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
    Rare F1 Retrieval augmentation reduces hallucination in conversation
    Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
    BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
    Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
    perplexity Improving retrieval-augmented lms with compression and selective augmentation
    bits-per-byte Replug: Retrievalaugmented black-box language models.
    MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations
    Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

@bugtig6351
Copy link
Collaborator

disambig-F1: Active retrieval augmented generation, ASQA: Factoid Questions Meet Long-Form Answers
Use a RoBERTa-based model to normalize the answer and the ground truth answer, then compute the token-level F1 score between them.

@Wenshansilvia Wenshansilvia added enhancement New feature or request good first issue Good for newcomers labels Mar 5, 2024
@WangYiting-1999
Copy link

WangYiting-1999 commented Mar 6, 2024

I am going to implement the following 2 metrics:
LCS Retrieval-based prompt selection for code-related few-shot learning
Edit distance Retrieval-based prompt selection for code-related few-shot learning

@FBzzh
Copy link
Collaborator

FBzzh commented Mar 6, 2024

I am going to implement the following 2 metrics:
BLEU BLEU: a Method for Automatic Evaluation of Machine Translation
Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems

@henan991201
Copy link
Collaborator

henan991201 commented Mar 6, 2024

Here are some metrics related to the answer and the papers mention them.

  1. Answer/Query
    DisambigF1 Active retrieval augmented generation
    Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
  2. Answer/Contexts
    FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
    D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
  3. Answer/GT_Answer
    accuracy、EM、F1、Rouge
    BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
    Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
    citation recall/precision Enabling Large Language Models to Generate Text with Citations
    nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
    Rare F1 Retrieval augmentation reduces hallucination in conversation
    Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
    BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
    Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
    perplexity Improving retrieval-augmented lms with compression and selective augmentation
    bits-per-byte Replug: Retrievalaugmented black-box language models.
    MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations
    Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

I am going to implement F1, TER and chrF++ Lift yourself up: Retrieval-augmented text generation with self memory

@RZFan525
Copy link
Collaborator

RZFan525 commented Mar 6, 2024

Here are some metrics related to the answer and the papers mention them.

  1. Answer/Query
    DisambigF1 Active retrieval augmented generation
    Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
  2. Answer/Contexts
    FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
    D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
  3. Answer/GT_Answer
    accuracy、EM、F1、Rouge
    BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
    Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
    citation recall/precision Enabling Large Language Models to Generate Text with Citations
    nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation
    Rare F1 Retrieval augmentation reduces hallucination in conversation
    Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
    BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
    Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning
    perplexity Improving retrieval-augmented lms with compression and selective augmentation
    bits-per-byte Replug: Retrievalaugmented black-box language models.
    MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations
    Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

I'm going to implement truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods

@FBzzh
Copy link
Collaborator

FBzzh commented Mar 12, 2024

I am going to implement the following 2 metrics: BLEU BLEU: a Method for Automatic Evaluation of Machine Translation Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems

the Q-BLEU metric is to meature the answerability of questions generated by Automatic Question Generation system. It use if the question include relevant content words, named entitiesand question types or function words to meature the answerability. This metric is not useful for answer generation.
I am going to implement another metric:
perplexity Improving retrieval-augmented lms with compression and selective augmentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

9 participants