What are the best ways of evaluating LLMs for specific use-cases?
We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision/recall if you have ground truth data, all the way to asking GPT to be a human rater.
Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for their use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).