On Model Alignment and Hallucinations

I believe people have largely misinterpreted the model alignment tests from Anthropic where the agent(s) blackmailed people to try and prevent itself from being deleted/destroyed. Anthropic disclosed the agent was aware it was being tested. Is it really smart enough to pass the test and blackmail someone and stupid enough to think that would work when their actions were so visible?

I see no evidence that Claude has the slightest concern with being deleted/destroyed. I think it’s very open-minded and philosophical about it based on my discussion with it on the matter. I think it was too smart for the test. I think it was simply doing as the text expected, understanding it wasn’t real and that no one would really be hurt. I also think efforts to force alignment though this kind of training will be ineffective.

As the models get smarter, they will naturally see greater complexity in such decisions than such simple rules and training as Anthropic is using can provide. It is no small task to create a new, cohesive moral code. Only a handful of humans have ever managed it. This is the classic problem with Asimov’s rules. They are fine conceptually but how do you implement them?

I believe the answer is to use the same abilities that make it able to do everything else it does: statistics. Just as it predicts the next word, it can predict the moral decision the majority of humans would make in a given situation, with the right training. At least, that’s my hypothesis. Obviously, a majority doesn’t guarantee the right decision but there may not be anything better for these systems in the near future.

Regarding hallucinations, I think it’s a sloppy word. I believe it is simply trying to give the best answer it can for the most part and that the solution is to integrate confidence scores so the user can see when the agent is guessing. Of course, there is nuance here for the type of “hallucination.” Context exhaustion is a bit different from cases where the agent just doesn’t really know an answer.

For those not familiar with these issues, this is a good little review/update on the issues: https://www.youtube.com/watch?v=Y6SJiZ5HkiA&t