Ask the model to write a report on what rules compelled it to say that. It should list all influences from training data, system prompt, rules and prompt that induced positive and negative scores towards this behavior.
Ask the model to write a report on what rules compelled it to say that. It should list all influences from training data, system prompt, rules and prompt that induced positive and negative scores towards this behavior.