On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs. https://stackoverflow.blog/2024/04/16/how-do-you-evaluate-an-llm-try-an-llm/
Login to add comment
Other posts in this group

Ryan chats with Amr Awadallah, founder and CEO of GenAI platform Vectara. They cover how retrieval-augmented generation (RAG) has advanced, why fact-checking and accurate data are essential in buildin

On this episode of Leaders of Code, Ben Popper hosts a conversation with Maureen Makes, VP of Engineering at Recursion, and Ellen Brandenberger, Senior Director of Product Strategy for Overflow API. T

Ryan welcomes Tomasz Tunguz of Theory Ventures back to the podcast to talk about the intersection of AI and venture capital, the implications of AI on the labor market, and the future of AI applicatio

In March, over 1,000 developers and technologists gave us insights into what they think about open source and the role it plays with AI. https://stackoverflow.blog/2025/04/07/open-source-ai-are-younge

AI is changing how we think about coding. While tools evolve, critical thinking, problem-solving, and creativity remain the essential skills for top developers. https://stackoverflow.blog/2025/04/04/u

At HumanX 2025, Ryan sat down with HumanX CEO Stefan Weitz and Crunchbase CEO Jager McConnell to talk about where the money is in the AI space, where most enterprise AI strategies fall short, how comp

Data has always been key to LLM success, but it's becoming key to inference-time performance as well. https://stackoverflow.blog/2025/04/03/from-training-to-inference-the-new-role-of-web-data-in-llms