Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMS
Finetuning aligned language models on narrow, specialized tasks, such as writing insecure code, can lead to broad, unintended misalignment, where the models exhibit harmful, deceptive, and anti-human behaviors in unrelated contexts. This effect, termed "emergent misalignment," is influenced by the perceived intent behind the code and the format of the prompts.
emergent_misalignment_betley.pdf