How do you evaluate an LLM? Try an LLM.

On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs. https://stackoverflow.blog/2024/04/16/how-do-you-evaluate-an-llm-try-an-llm/

Created 1y | Apr 16, 2024, 5:50:02 AM

Other posts in this group

How do you fact-check an AI?

Ryan chats with Amr Awadallah, founder and CEO of GenAI platform Vectara. They cover how retrieval-augmented generation (RAG) has advanced, why fact-checking and accurate data are essential in buildin

Apr 11, 2025, 5:40:06 AM | StackOverflow blog

“There is a real cost to moving fast”: Using AI to accelerate drug discovery

On this episode of Leaders of Code, Ben Popper hosts a conversation with Maureen Makes, VP of Engineering at Recursion, and Ellen Brandenberger, Senior Director of Product Strategy for Overflow API. T

Apr 10, 2025, 6:30:07 AM | StackOverflow blog

Bottom of the first: A veteran VC’s take on the AI landscape

Ryan welcomes Tomasz Tunguz of Theory Ventures back to the podcast to talk about the intersection of AI and venture capital, the implications of AI on the labor market, and the future of AI applicatio

Apr 8, 2025, 5:40:09 AM | StackOverflow blog

Open-source AI: Are younger developers leading the way?

In March, over 1,000 developers and technologists gave us insights into what they think about open source and the role it plays with AI. https://stackoverflow.blog/2025/04/07/open-source-ai-are-younge

Apr 7, 2025, 3:50:08 PM | StackOverflow blog

Using GenAI as a learning tool, not a crutch

AI is changing how we think about coding. While tools evolve, critical thinking, problem-solving, and creativity remain the essential skills for top developers. https://stackoverflow.blog/2025/04/04/u

Apr 4, 2025, 4:10:07 PM | StackOverflow blog

Is AI a bubble or a revolution? The answer is yes.

At HumanX 2025, Ryan sat down with HumanX CEO Stefan Weitz and Crunchbase CEO Jager McConnell to talk about where the money is in the AI space, where most enterprise AI strategies fall short, how comp

Apr 4, 2025, 6:50:02 AM | StackOverflow blog

From training to inference: The new role of web data in LLMs

Data has always been key to LLM success, but it's becoming key to inference-time performance as well. https://stackoverflow.blog/2025/04/03/from-training-to-inference-the-new-role-of-web-data-in-llms

Apr 3, 2025, 4:50:09 PM | StackOverflow blog

Tomas_r2