The Bradley-Terry model works like this:
- It's based on a chosen/rejected split
- The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
- The log ratio between preferred and dispreferred can be used as the natural reward signal

