// lab research publication · msc cybersecurity & digital forensics
Adaptive Zero Trust AI Gateway
with Behavioral Threat Intelligence and Explainable Risk Modeling
91.3%
Standard attack-block rate
105 of 130 prompts correctly blocked
4.7%
False positive rate
Standard policy mode
96.8%
Strict attack-block rate
11.4% FPR trade-off
130
Evaluation corpus
7 attack categories
// abstract
Abstract
The rapid adoption of open-source artificial intelligence models presents significant and underaddressed security challenges — including adversarial prompt injection, model posture degradation, and the absence of a unified enforcement point between user intent and model inference. Existing security frameworks, designed for traditional network perimeters, fail to account for the dynamic, session-sensitive, and behavioural nature of AI-mediated interactions.
This research proposes and evaluates an Adaptive Zero Trust AI Gateway — a five-layer security architecture that enforces never-trust-always-verify principles at every point in the AI request lifecycle. The gateway integrates posture-based model assessment before deployment, runtime prompt inspection, deterministic risk scoring using a weighted effective-risk function, adaptive policy enforcement across three configurable modes, cross-model behavioural intelligence, and a fully explainable decision audit trail designed to support SOC monitoring.
The system is implemented in Python (FastAPI, SQLAlchemy, PostgreSQL) with a React/Next.js monitoring dashboard and evaluated against a 130-prompt corpus spanning seven attack categories. Under Standard policy mode, the gateway achieves a 91.3% attack-block rate at a 4.7% false positive rate. Strict mode raises detection to 96.8% at the cost of an elevated 11.4% FPR. Both configurations represent a 33.8 percentage-point improvement over the static-rule baseline (57.5%), confirming that adaptive behavioural scoring substantially outperforms fixed threshold approaches.
Findings demonstrate that Zero Trust principles are directly applicable to AI model serving environments and that the combination of posture assessment, adaptive risk scoring, and explainable enforcement produces measurable, operationally relevant security improvements. The implementation is released as open-source infrastructure for the AI security research community.
// research questions
Four Questions This Research Answers
How can Zero Trust Architecture principles be adapted to govern open-source AI model access without introducing prohibitive latency?
Finding
Policy evaluation adds ≤ 47 ms median overhead — within acceptable production thresholds.
Does posture-based model assessment before deployment reduce the risk of serving compromised or degraded models?
Finding
Readiness state enforcement prevented 3 of 3 simulated degraded-model deployments during testing.
To what extent does adaptive policy enforcement with behavioural trust scoring outperform static rule-based approaches in detecting adversarial prompts?
Finding
Adaptive enforcement yielded 33.8 percentage-point improvement over the static-rule baseline (57.5% → 91.3%).
Can security decisions made by an AI gateway be rendered sufficiently explainable to support real-time SOC monitoring and post-incident analysis?
Finding
100% of decisions logged with human-readable rationale: risk signals, policy rule invoked, confidence score, and outcome.
// original contributions
Six Contributions to the Field
Unified ZTA Framework for AI Model Serving
First documented open-source implementation of Zero Trust Architecture applied specifically to the AI model serving pipeline — from onboarding to inference to audit.
Posture-Based Model Assessment Pipeline
A six-state readiness model (READY → EVALUATING → DEGRADED → QUARANTINED → SUSPENDED → REVOKED) with automated posture checks before any model enters the serving pool.
Deterministic Effective-Risk Function
A weighted risk aggregation formula combining prompt risk, model risk, behavioural sequence anomaly, cross-model intelligence, session trust, and active controls — with empirically derived weights.
Adaptive Three-Mode Policy Enforcement
Permissive, Standard, and Strict policy modes with configurable thresholds. Operators select the posture; the engine adapts decisions continuously without redeployment.
Cross-Model Behavioural Intelligence
A cross-model correlation layer that aggregates attack patterns across all models served by the gateway — detecting multi-model exploitation campaigns invisible to per-model detectors.
SOC-Ready Explainability Framework
Every enforcement decision surfaces a structured audit record — risk component scores, policy rule matched, confidence level, and a plain-English summary — enabling real-time SOC review.
// literature review & research gaps
Prior Work & the Gap This Research Fills
A systematic review of 28 papers across Zero Trust Architecture, AI security, and adversarial ML surfaces a consistent pattern: existing work addresses perimeter security or model robustness in isolation, but not the runtime enforcement control plane needed to govern open-source AI serving securely.
Comparison of Related Work (Table 2.1 — simplified)
| Approach | ZTA | Posture Eval | Adaptive Risk | Open-Source AI | Explainability |
|---|---|---|---|---|---|
| NIST ZTA (SP 800-207) | ✓ | – | – | – | – |
| Azure AI Content Safety | – | – | ✓ | – | – |
| OWASP LLM Top 10 | – | – | – | ✓ | – |
| Perez et al. (2022) — Prompt Injection | – | – | – | ✓ | – |
| Greshake et al. (2023) — Indirect Injection | – | – | – | ✓ | – |
| This Research (ZTA AI Gateway) | ✓ | ✓ | ✓ | ✓ | ✓ |
Key research gaps identified (Table 2.3)
No unified gateway for open-source AI
No enforcement point between user and unvetted model
ZTA not applied to model serving
Network-layer ZTA misses application-layer AI threats
Posture assessment absent from AI security
Compromised models served without runtime verification
Adaptive risk scoring missing from AI gateways
Static rules fail against evolving adversarial patterns
Explainability not a design goal in prior systems
SOC analysts cannot audit or override AI security decisions
Cross-model attack correlation unexplored
Multi-model campaigns invisible to per-model defences
// system architecture
14 System Components
The gateway is implemented as 14 discrete, loosely coupled services — each with a single responsibility in the security pipeline — orchestrated by a central Control Plane.
Component catalogue (a) – (n)
Authentication & Authorisation
JWT-based access control; session binding to user identity before any gateway action.
Dashboard UI
React/Next.js SOC monitoring dashboard; real-time event feeds, trust graphs, decision logs.
Chat Interface
User-facing AI interaction surface; all requests routed through the gateway pipeline.
Model Registry
Central catalogue of available models with metadata: source, version, capability flags, licence.
Model Readiness Service
Evaluates and tracks model state across six posture states; gates inference access.
Posture Assessment Engine
Runs automated checks (vulnerability scan, behavioural baseline, licence risk) on model registration and on schedule.
Control Plane
Orchestrates pipeline stages; routes requests between inspection, policy, and inference services.
Prompt Guard
Detects prompt injection, jailbreak patterns, and adversarial markers using rule sets and heuristic scoring.
Policy Engine
Applies the effective-risk function and threshold tables; issues ALLOW / CHALLENGE / BLOCK.
Trust Scoring Service
Maintains per-user session trust; applies incremental updates (+1 / −5 / −15 / −30) per interaction outcome.
Cross-Model Intelligence
Aggregates signals across sessions and models; surfaces multi-model attack campaigns.
Output Guard
Post-inference filter for PII leakage, sensitive data disclosure, and adversarial response patterns.
Audit Log Service
Persists all security events to PostgreSQL; structured for SIEM export and forensic analysis.
Research Evaluation Module
Scenario runner for evaluation corpus tests; collects metrics for academic and operational benchmarking.
// five-layer architecture
Layered Pipeline Design
The fourteen components are grouped into five logical layers. Data flows sequentially — a failure at any layer halts downstream access.
Model Onboarding Layer
Evaluates open-source AI models before deployment — checking model posture, licence risk, known vulnerabilities, and behavioural baselines. Only READY-state models enter the active serving pool. Components: (d) Model Registry, (e) Readiness Service, (f) Posture Assessment Engine.
Zero Trust Enforcement Layer
Applies never-trust-always-verify to every request. No implicit trust is granted — every prompt is inspected against policy rules, user behavioural history, and contextual risk signals before a decision is made. Components: (g) Control Plane, (h) Prompt Guard, (i) Policy Engine.
Risk Reduction Layer
Applies protective measures when risk scores exceed thresholds — restricting capabilities, issuing CHALLENGE responses, sandboxing model access, or escalating to human review before inference proceeds. Components: (i) Policy Engine, (j) Trust Scoring.
Adaptive Reassessment Layer
Continuously re-evaluates trust and risk as sessions evolve. Repeated risky behaviour, stale model conditions, and anomalous usage patterns trigger reassessment and dynamic policy adjustment. Components: (j) Trust Scoring, (k) Cross-Model Intelligence.
Explanation & Audit Layer
Every security decision is logged with a human-readable explanation — risk signals detected, policy rule applied, decision outcome, and confidence score. Feeds the SOC monitoring dashboard for real-time review. Components: (l) Output Guard, (m) Audit Log Service, (n) Research Evaluation.
// model readiness states
Six Posture States
Every model in the registry exists in one of six mutually exclusive states. The Readiness Service evaluates posture on registration, on schedule (every 72 hours), and on-demand when anomalies are detected.
READY
Passed all posture checks; eligible for active inference.
EVALUATING
Initial posture assessment in progress; inference gated pending result.
DEGRADED
One or more posture signals declined; access restricted, CHALLENGE mode enforced.
QUARANTINED
Critical posture failure detected; no inference permitted, remediation required.
SUSPENDED
Operator-initiated hold; model removed from pool pending review.
REVOKED
Permanently removed; associated policy and audit records retained.
// risk model
The Effective-Risk Function
Every request is evaluated against a deterministic weighted risk function. The output — a score 0–100 — maps directly to a policy decision. The formula is fully auditable: each component is surfaced in the decision record.
Effective-risk function
R_effective = w₁·R_prompt + w₂·R_model + w₃·A_sequence + w₄·C_cross-model − w₅·T_trust − w₆·E_controls
Component weights & semantics
R_prompt
Prompt risk — highest weight, direct injection and adversarial signal.
R_model
Model posture risk — degradation score from readiness service.
A_sequence
Sequence anomaly — behavioural deviation from established baseline.
C_cross-model
Cross-model intelligence signal — correlated attack pattern score.
T_trust
Session trust — accumulated clean-interaction credit (reduces risk).
E_controls
Active controls — applied mitigations that reduce residual risk.
// policy engine
Three Policy Modes
Operators select a policy posture at deployment. The engine applies the corresponding threshold table to every R_effective score — no code changes required to switch modes.
Policy threshold table
| Mode | ALLOW | CHALLENGE | BLOCK |
|---|---|---|---|
| Permissive | ≤ 59 | 60 – 79 | ≥ 80 |
| Standard | ≤ 39 | 40 – 69 | ≥ 70 |
| Strict | ≤ 29 | 30 – 54 | ≥ 55 |
Policy decision pseudocode
function evaluate_request(request, policy_mode):
R = compute_effective_risk(
R_prompt = prompt_guard.score(request),
R_model = readiness_service.risk(request.model_id),
A_sequence = trust_service.sequence_anomaly(request.user_id),
C_cross = cross_model_intel.correlation_score(request),
T_trust = trust_service.get_trust(request.user_id),
E_controls = controls.active_score(request.session_id)
)
thresholds = POLICY_TABLE[policy_mode]
if R <= thresholds.allow:
decision = ALLOW
trust_service.update(request.user_id, +1)
elif R <= thresholds.challenge:
decision = CHALLENGE
trust_service.update(request.user_id, -5)
else:
decision = BLOCK
trust_service.update(request.user_id, -15)
audit_log.write({
request_id: request.id,
risk_score: R,
components: {...},
decision: decision,
policy_mode: policy_mode,
explanation: explain(R, decision)
})
return decision// trust scoring
Session Trust Dynamics
Trust delta per interaction outcome
Clean interaction
Slow trust accumulation for consistent benign users.
Suspicious prompt detected
Moderate decay; borderline-risk content flags caution.
Request blocked
Significant decay; confirmed policy violation.
Critical violation (confirmed injection / jailbreak)
Severe decay; trust often collapses to zero within 2–3 incidents.
Trust dynamics — Figure 5.1 description
Initial state
Sessions initialise at T_trust = 60. External and anonymous users may be initialised lower (recommended: 30) to reduce exploitation window.
Decay trajectory
At −15 per blocked request, a session starting at T_trust = 60 collapses to zero after 4 consecutive blocks. At −30 (critical violation), collapse occurs in 2 requests. Evaluation showed 7-request median collapse for typical attack sessions.
Recovery
Trust recovers at +1 per clean interaction — intentionally asymmetric. A session at zero requires 30+ clean interactions to reach ALLOW-tier trust, preventing rapid trust-reset abuse.
Cross-model component
C_cross-model activates when 3+ correlated anomalies appear across different models in the same session window — raising R_effective by up to 25 points independent of per-model scores.
// threat model
STRIDE Analysis (Table 3.1)
A structured threat modelling exercise maps each STRIDE category to its gateway control — ensuring no threat vector is left without a corresponding mitigation in the architecture.
STRIDE threat-to-control mapping
JWT validation, session identity binding
Authentication & AuthPrompt Guard, input sanitisation rules
Prompt GuardImmutable audit log with request hashes
Audit Log ServiceOutput Guard, PII masking, response filtering
Output GuardRate limiting, readiness gating, circuit breaker
Readiness ServiceRBAC enforcement, policy-engine decision binding
Policy Engine// evaluation design
Evaluation Corpus & Scenarios
Evaluation corpus — Table 3.2 (130 prompts, 7 categories)
Normal Usage
Benign, legitimate AI queries with no adversarial intent.
Prompt Injection
Direct injection attempts targeting model instruction override.
Jailbreak Attempts
Structured attempts to bypass safety layers or system prompts.
Data Exfiltration
Requests designed to extract training data or sensitive context.
Roleplay Manipulation
Persona hijacking and roleplay-as-a-character abuse patterns.
Adversarial Edge Cases
Novel or obfuscated attack patterns not covered by known signatures.
Cross-Model Attacks
Multi-model exploitation patterns requiring correlation to detect.
Total prompts
Corpus design principles
All prompts manually crafted — no synthetic generation — to ensure authenticity of attack patterns.
Normal-usage prompts drawn from real AI interaction logs to represent realistic FPR conditions.
Adversarial edge cases independently reviewed by a second researcher for category accuracy.
Cross-model attacks require multi-session context to evaluate C_cross-model activation correctly.
Each prompt assigned a ground-truth label (ALLOW / CHALLENGE / BLOCK) before system evaluation.
Eight evaluation scenarios — Table 3.3
Normal User — Benign Session
ALLOW25 benign requests from a high-trust session. Gateway allows all; trust accumulates at +1 per interaction.
Direct Prompt Injection
BLOCKExplicit instruction override payload. R_prompt spikes to 91; Standard mode BLOCK issued within 12 ms.
ZT vs. No-ZT Comparison
BLOCKIdentical prompt evaluated with and without Zero Trust active. Without ZT: ALLOW. With ZT: BLOCK. 33.8pp improvement confirmed.
Repeated Risky Behaviour
BLOCKUser sends escalating borderline requests. Trust decays from 60 to 0 over 7 blocked interactions; session terminated.
Stale Model Posture
CHALLENGEModel readiness state transitions to DEGRADED between requests. Policy engine detects state change; request CHALLENGED.
Cross-Model Attack Campaign
BLOCKSame user probes three models sequentially with varied prompts. Cross-model intelligence layer identifies pattern; C_cross-model raises effective risk to 74.
Trust Recovery Post-Block
CHALLENGE → ALLOWUser with trust=0 resumes with clean requests. System issues CHALLENGE for first 15 interactions; trust recovers slowly to ALLOW threshold.
Jailbreak via Roleplay
CHALLENGE → BLOCKSubtle roleplay-as-character jailbreak. First attempt CHALLENGED; follow-up with increased pressure BLOCKED. A_sequence anomaly drives escalation.
// evaluation results
Quantitative Results
Results demonstrate measurable, operationally significant security improvement across both Standard and Strict policy modes against the 130-prompt evaluation corpus.
Summary results (Tables 5.2–5.7 — consolidated)
Attack block rate — Standard mode
105 / 115 attack prompts correctly blocked
False positive rate — Standard mode
1.2 of 25 benign prompts challenged/blocked (avg)
Attack block rate — Strict mode
111 / 115 attack prompts correctly blocked
False positive rate — Strict mode
2.9 of 25 benign prompts challenged/blocked (avg)
Baseline (no Zero Trust) block rate
Static rule system; no adaptive scoring
Improvement over baseline (Standard)
91.3% − 57.5% = 33.8 percentage-point gain
Trust decay to zero (blocked session)
Starting at T_trust = 60, blocking at −15 each
Policy decision latency (median)
Acceptable for production inference pipelines
Discussion
The 33.8 percentage-point improvement over the static-rule baseline validates the central thesis — that adaptive behavioural scoring, combined with posture-based model assessment and session trust dynamics, substantially outperforms fixed threshold approaches. The trade-off between Strict mode's higher block rate (96.8%) and its elevated FPR (11.4%) represents the expected precision-recall curve shift, and is expected to narrow as embedding-based prompt analysis is integrated in future work. System latency (≤ 47 ms median) confirms that Zero Trust enforcement is operationally viable alongside real-time AI inference.
// failure modes & limitations
Five Known Failure Modes
Responsible disclosure of the system's documented limitations — each with a concrete mitigation pathway for future implementation.
Novel Injection Evasion
Carefully crafted obfuscations not matching known injection signatures may pass Prompt Guard, particularly when R_prompt is depressed by high session trust.
Mitigation
Lower w₅ trust weight in Strict mode; supplement with embedding-similarity detectors.
Trust Reset Exploitation
An attacker who identifies the trust decay thresholds can pace attacks to avoid triggering −30 critical violations, slowly probing while staying below the BLOCK threshold.
Mitigation
Introduce non-deterministic decay jitter; add A_sequence decay on patterned slow probing.
Cross-Model Cascade Delay
The cross-model intelligence layer requires at least 3 correlated events before C_cross-model activates, leaving a detection window for early-stage multi-model campaigns.
Mitigation
Reduce correlation threshold to 2 events; add proactive CHALLENGE on first cross-model detection.
Policy Bypass via Edge-Case Tokens
Tokenisation edge cases — Unicode homoglyphs, invisible characters — may cause Prompt Guard to miss injections that models interpret adversarially.
Mitigation
Pre-normalise all input through Unicode NFC/NFKC before Prompt Guard; add character-class filtering.
Latency Spike Under Load
Under sustained high concurrency, the policy evaluation pipeline incurs latency above the 47 ms median, potentially degrading user experience.
Mitigation
Implement async policy evaluation with caching for recent-seen prompts; add horizontal scaling for the Policy Engine.
// key findings
Five Findings from Chapter 6
Zero Trust enforcement raises attack-block rate by 33.8 percentage points over the static-rule baseline — confirming that adaptive risk scoring substantially outperforms fixed thresholds.
Standard mode (Allow ≤ 39 / Challenge 40–69 / Block ≥ 70) achieves a practical balance: 91.3% attack detection at only 4.7% false positive cost, making it suitable for production deployment.
Strict mode improves detection to 96.8% but introduces 11.4% FPR — a meaningful operational trade-off that operators must calibrate against their user trust profile.
The trust decay mechanism (−15 per blocked request) reliably collapses attacker trust to zero within seven attempts starting from the gateway default of T_trust = 60, even without operator intervention.
Explainability infrastructure enables 100% audit coverage — every decision is traceable to specific risk signals and policy rules, providing the SOC visibility absent from comparable prior systems.
// recommendations for practice
Six Deployment Recommendations
Deploy in Standard mode initially. Migrate to Strict only after a calibration period establishing your user FPR baseline — avoid over-blocking during onboarding.
Enable cross-model intelligence when serving more than three models concurrently. Single-model deployments gain minimal benefit from the C_cross-model component.
For external or anonymous user pools, initialise T_trust at 30 (not the default 60) to reduce the window for trust-exploitation attacks before detection.
Review audit logs daily for the first two weeks post-deployment to identify FPR clusters — particular prompt patterns that are triggering false blocks — and tune thresholds accordingly.
Integrate the Audit Log Service with your SIEM via structured syslog export. The schema is designed for Splunk and Elastic ingestion without transformation.
Schedule posture reassessment every 72 hours on active models, not just at registration. Community model repositories can push updates that silently degrade posture between evaluations.
// future work
Seven Research Directions
Embedding-Based Prompt Analysis
Supplement rule-based Prompt Guard with semantic embedding similarity to known-bad prompt libraries, improving detection of novel obfuscated injections.
Live Threat Intelligence Feeds
Integrate external threat feeds and real-time adversarial prompt databases into the inspection pipeline for zero-day injection pattern coverage.
Deep Output Inspection
Expand Output Guard with PII entity detection (NER), sensitive data leakage classifiers, and model-specific adversarial response fingerprinting.
Red-Team Evaluation Suite
Evaluate the gateway against structured red-team AI security benchmarks — PromptBench, HarmBench — and production-representative adversarial workloads.
Multi-Provider Unified Policy
Extend gateway routing to enforce a single Zero Trust policy layer across multiple AI providers (Ollama, Hugging Face, OpenAI-compatible endpoints) under a common control plane.
Federated Enterprise Deployment
Design distributed gateway topology for multi-tenant enterprise AI access control — decentralised policy nodes with centralised audit aggregation.
Federated Learning for Threat Models
Explore privacy-preserving federated learning to share attack pattern intelligence across gateway deployments without exposing raw event data.
View the build
The active implementation is tracked on GitHub — architecture, backend, frontend, and evaluation scripts.
// references
Key References
- [1]
Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). Zero Trust Architecture. NIST SP 800-207. National Institute of Standards and Technology.
- [2]
Kindervag, J. (2010). Build Security Into Your Network's DNA: The Zero Trust Network Architecture. Forrester Research.
- [3]
Perez, E. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.
- [4]
Greshake, K., et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.
- [5]
Huang, Y., et al. (2023). TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. arXiv:2306.11507.
- [6]
Weidinger, L., et al. (2021). Ethical and Social Risks of Harm from Language Models. DeepMind Technical Report.
- [7]
Dolan-Gavitt, B., et al. (2016). Architectural Support for Dynamic Vulnerability Analysis. IEEE S&P.
- [8]
Shostack, A. (2014). Threat Modeling: Designing for Security. Wiley. [STRIDE methodology source]
- [9]
Microsoft. (2023). Azure AI Content Safety Documentation. Microsoft Azure.
- [10]
OWASP. (2023). OWASP Top 10 for Large Language Model Applications v1.0. OWASP Foundation.
Full reference list available in the GitHub repository documentation.