Hot take: llm "guardrails" are worthless and will always be ineffective; they are a throwback to a premodern model of security as a list of prohibitions against actions instead of a more modern, holistic approach where the system as a whole is structured such that impermissible operations fail as a consequence of the system architecture.
The core mechanism of llm systems relies on the random elision and remixing of inputs; all such guardrail systems exist within this milieu, and are thus - architecturally, according to how llms work as a baseline - subject to that same elision; therefore, you can never be assured that a given guardrail directive will be present in the context window for the llm at the time of processing.
I personally think this is blindingly obvious, but I do understand why people who are bought into the tech might not understand that any attempt to 'instruct' an llm as to 'alignment' is going to be subject to an erosion of those 'protections' as an inherent part of the function of the machine.
Bluntly, if you don't want the llm to "do" a thing, you must make that thing impossible for the llm to do. Do not give it access to your filesystem; do not give it access to your production infrastructure; do not give it access to your children; do not give it access to anything unsupervised whatsoever.
And do not use an llm for any system where determinacy of operation is even slightly important, for that matter.
