Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/batch compute #110

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Feature/batch compute #110

wants to merge 10 commits into from

Conversation

bugtig6351
Copy link
Collaborator

@bugtig6351 bugtig6351 commented Jul 1, 2024

  1. Modify the Metric class, take the _compute_one as the abstract method and write a default implementation of _compute_batch method.
  2. Rewrite the F1 , ChrF and Ter metric.

Copy link

codecov bot commented Jul 2, 2024

Codecov Report

Attention: Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.21%. Comparing base (c253114) to head (95f6cd4).

Files Patch % Lines
rageval/metrics/_context_recall.py 66.66% 1 Missing ⚠️
rageval/metrics/_context_reject_rate.py 50.00% 1 Missing ⚠️
rageval/metrics/base.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
- Coverage   81.45%   81.21%   -0.25%     
==========================================
  Files          33       33              
  Lines        1170     1139      -31     
==========================================
- Hits          953      925      -28     
+ Misses        217      214       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bugtig6351
Copy link
Collaborator Author

Main changes:

  1. Since most metrics only differ in the _compute_one() method and the final results are aggregated using np.average(scores), I will use the following _compute_batch() as the default implementation for the Metric class, and _compute_one() as an abstract method that needs to be implemented by each metric individually.
    def _compute_batch(
        self,
        pred_answers: Optional[Iterable] = None,
        ref_answers: Optional[Iterable] = None,
        *args: Optional[Iterable]
    ) -> List[float]:
        """Compute the metric for a batch of predictions and references."""
        scores = []
        for pred, refs in tqdm(zip(pred_answers, ref_answers),
                               desc=f"Computing {self.name}",
                               total=len(pred_answers)):
            scores.append(self._compute_one(pred, refs))
        return scores
  1. I have rewritten the implementation of the F1 metric. Now, it can calculate the F1 score between any two iterable sequences, such as sequences of integers or tuples. By default, it will use the original method to calculate the F1 score for two strings, after removing articles, punctuation, etc., and tokenizing the strings by spaces. Of course, this method does not support Chinese. Whether to use the original method for string inputs can be controlled with normalize=True.

There are two issuses:

  1. The MetricWithLLM class involves model calls. The current approach is to package all the data and delegate the batch operation to the BaseLLM.generate() method for processing.
  2. Some metrics such as BLUE, ChrF, do not use a simple average to aggregate the results of each sample. Instead, they treat all samples as a corpus and compute the final result based on the answer corpus and the reference corpus. Currently, it is only possible to separately calculate the corpus score and each sample score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant