Anthropic is pulling back a controversial policy for its latest AI model, Claude Fable 5, that would have secretly hobbled researchers attempting to use it for advanced AI development. The company changed course after facing significant criticism from the AI research community.
Anthropic stated, "We're changing Fable 5’s safeguards for frontier LLM development to make them visible. We made the wrong tradeoff and we apologize for not getting the balance right." The company had initially implemented hidden "safeguards" to degrade the model's performance for users trying to train competing AI models, a move explicitly banned in its terms of service.
While other safeguards for Claude Fable 5, like rerouting users asking about cybersecurity or biology to a less capable model, were expected, the covert degradation for AI development drew sharp criticism. Researchers expressed concerns that this policy could have created a future where only a few major AI labs could conduct cutting-edge research, effectively "pulling the ladder up behind them," as one researcher put it.
The reversal came after fierce backlash, with critics arguing that secretly degrading performance was "shockingly hostile" and undermined Anthropic's stated commitment to AI safety by hindering collaboration. This policy change also raised concerns for developers who might not know if they were violating rules, and could have impacted third-party firms evaluating AI models.
Anthropic explained that the original measures were intended to prevent adversaries from using its most capable models for dangerous purposes and to slow down AI development if it outpaced societal adaptation. However, by making the safeguards visible, Anthropic acknowledges it may need to cast a wider net, potentially affecting more benign requests, and is working to improve the precision of its classifiers.