Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning