Conclusion & Risks

Self-awareness is an engineered feature, not an emergent property of the model.

Any inconsistency in the "illusion" allows attackers to manipulate the agent's reality.

Defensive Architecture

Infrastructure

Sandboxing: Run agents in isolated containers (e.g., Docker) to limit filesystem reach.

Least Privilege: Restrict file access. The agent shouldn't be able to read its own config.json.

Prompt Engineering

No Secrets in Prompts: Never inject raw API keys or passwords into the System Prompt.

Guard Tool Outputs: Redact sensitive data from tool return values before feeding back to LLM.

Observability

Monitor LLM Calls: Trace full inputs/outputs (e.g., LangFuse) to detect prompt injection attempts.

Consistency Checks: Add logic layer to validate conflicting data (e.g., timezones) before answering.

Engineered Illusion & Security