Some other things that I think are interesting:

The postfix on the magic string is SHA256 according to a hash identifier tool. Which turns out to be the string "ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL" then hashed by SHA256. For the other example, it is still SHA256 but is not the string "ANTHROPIC_MAGIC_STRING_TRIGGER_REDACTED_THINKING".

It's also interesting that the intended use of TRIGGER_REFUSAL appears to testing for Claude Refusals by developers. Ironically, because Claude cannot visit its own documentation without breaking, it probably means that developers trying to use Claude to generate code don't have good coverage of this, shall we say, edge-case. Unless they read the docs and thought to do it /shrug.

0

If you have a fediverse account, you can quote this note from your own instance. Search https://infosec.exchange/users/morattisec/statuses/115939467310268146 on your instance and quote it. (Note that quoting is not supported in Mastodon.)