Task-Specific LLM Evals That Do and Don't Work