This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
-
Updated
Jun 29, 2024 - Jupyter Notebook
This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
Evaluation of Language Models in Non-English Languages
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
The official evaluation suite and dynamic data release for MixEval.
The LLM Evaluation Framework
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."