#

llm-evaluation-framework

Here are 11 public repositories matching this topic...

bowen-upenn / llm_token_bias

This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

reasoning large-language-models llm llm-evaluation llm-evaluation-framework llm-reasoning token-bias

Updated Jun 29, 2024
Jupyter Notebook

nagababumo / Building-and-Evaluating-Advanced-RAG

python rag llamaindex retrieval-augmented-generation llm-evaluation llm-evaluation-framework

Updated Jun 1, 2024
Jupyter Notebook

stair-lab / villm-eval

Evaluation of Language Models in Non-English Languages

llms-benchmarking llm-evaluation-framework

Updated Jun 14, 2024
Python

zhuohaoyu / KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 2, 2024
Python

parea-ai / parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jul 2, 2024
TypeScript

Networks-Learning / prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.

ranking-algorithm llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets

Updated May 27, 2024
Python

aws-samples / fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated May 8, 2024
Python

parea-ai / parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jul 2, 2024
Python

Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jul 2, 2024
Python

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jul 3, 2024
Python

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jul 3, 2024
TypeScript

Improve this page

Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."