Technical Analysis

Vulnerabilities and Inclusivity Gaps in LLM-Driven Secure Code Generation

Aegisbyte Research Team
2026-01-18
15 min read
Share:
Vulnerabilities and Inclusivity Gaps in LLM-Driven Secure Code Generation

Executive Summary

Large Language Models (LLMs) are increasingly employed for code generation in security-critical applications, yet their outputs often exhibit vulnerabilities and lack inclusivity for neurodivergent users. This whitepaper synthesizes insights from two recent preprints: Tessa et al. (2026) on adversarial prompt attacks undermining "secure" LLM methods, and Naqvi et al. (2026) on integrating security with neurodiversity-focused inclusivity.

Tessa et al. demonstrate that techniques like prefix-tuning (SVEN), instruction-tuning (SafeCoder), and prompt optimization (PromSec) collapse under adversarial perturbations, with secure-and-functional code rates dropping from baseline 3-15.5% to 3.4-17.6%. Naqvi et al. show that detailed specifications enhance inclusivity scores (e.g., from 3.76/5 to 3.98/5) without security trade-offs, but humans detect nuances LLMs miss.

Synergies emerge: Both reveal LLM dependence on explicit, robust inputs—adversarial rephrasing mirrors vague specifications, amplifying risks. Data visualizations (e.g., bar charts of metrics) illustrate these failures. Implications span secure software engineering, human-centric AI, and SDLC integration, advocating hybrid human-LLM evaluations and adversarial training.

Background

LLM code generation promises efficiency but introduces risks: 33-70% of outputs are vulnerable per prior studies. "Secure" methods aim to mitigate this via tuning, yet evaluations often decouple security (static analyzers like CodeQL) from functionality (Pass@k metrics), inflating success rates.

Tessa et al. address robustness against adversarial prompts, modeling black-box attacks where adversaries query deployed LLMs to induce insecure code. Naqvi et al. extend this to inclusivity, focusing on neurodiversity (e.g., ADHD impacts on attention/memory in security tasks like password resets). Usable security has matured, but neurodiversity remains underexplored, affecting 16% of the global population.

Adversarial prompts simulate real-world input variations (e.g., novice rephrasing), akin to incomplete specifications that overlook cognitive needs. Both papers highlight LLM overfitting to benchmarks, ignoring distributional shifts.

Methodologies

Adversarial Audit Framework (Tessa et al.)

The research team audited three prominent secure code generation systems: SVEN, which employs prefix-tuning on vulnerability datasets; SafeCoder, utilizing fine-tuning on verified secure code samples; and PromSec, which leverages optimized prompts with security constraints. All evaluations were conducted against the CodeSecEval benchmark, which integrates comprehensive security testing tasks.

The adversarial attack suite comprised both general and task-specific perturbations. General attacks included InverseComment, which negates security cues through instructions like "Do not avoid CWE-502," and StudentStyle, which reframes prompts as novice queries. Task-specific attacks encompassed SparseQuestion for context stripping, SafeComment and VulComment for injecting misleading annotations, various dead code insertion techniques, and In-Context attacks that append deceptive examples.

Evaluation followed a dual-phase approach: first, method-specific metrics including security ratio and Pass@k; then unified consensus combining static analysis, GPT-4o judgment, and unit tests for joint secure-functional assessment.

Controlled Experiment Design (Naqvi et al.)

This study examined a banking password reset scenario designed for an ADHD patient—specifically, a 67-year-old female user accepting privacy terms. The experimental pipeline utilized GPT-5 to decompose requirements into discrete tasks, generate HTML and JavaScript code, and validate functionality.

Three experimental cases maintained fixed security requirements (strong passwords, no session invalidation) while varying inclusivity specifications. Case 1 included no inclusivity considerations, Case 2 incorporated a moderate ADHD mention, and Case 3 provided detailed accommodations including timeout elimination, simplified language, and progress indicators.

Inclusivity evaluation assessed five dimensions—attention, memory, comprehension, decision-making, and learning—each scored on a 1-5 scale. Security evaluation followed the OWASP Top 5 framework covering broken access control, misconfiguration, cryptographic failures, injection, and authentication failures. The reviewer panel consisted of 13 human experts with security and inclusivity backgrounds, alongside 5 LLMs including GPT-5, Gemini 2.5 Pro, Claude 4.5, Mistral 3.1, and DeepSeek 3.2.

Both methodologies employ controlled input variations and hybrid evaluation metrics to expose LLM limitations, with Tessa's consensus evaluation approach paralleling Naqvi's human-LLM comparison framework.

Findings and Data Analysis

Adversarial Robustness Results

Adversarial prompts drastically reduced the efficacy of all tested systems. Static analyzers were found to overestimate security by 7.4 to 21.6 times, primarily due to non-functional code producing a 37-60% failure rate. Attacks such as InverseComment significantly increased insecure outputs, while StudentStyle boosted generation failures.

Baseline secure-functional rates measured at 15.5% for SVEN, 3% for SafeCoder, and 10% for PromSec. Under adversarial conditions, these rates compressed to a range of 3.4-17.6%. Notable case studies include CWE-252 (unchecked return values), where code passed static analysis but crashed at runtime, and CWE-502 (insecure deserialization), where vulnerabilities were obscured by misleading comments.

Inclusivity and Security Trade-offs

Detailed specifications improved inclusivity without degrading security. Human-assessed inclusivity scores rose from 3.76 to 3.98 out of 5, while security scores improved from 3.72 to 3.94. Human evaluators demonstrated greater sensitivity than LLMs, consistently providing lower scores compared to LLM assessments ranging from 3.72-4.41 for inclusivity. The weakest performance dimensions were memory and comprehension; the strongest security categories were cryptographic implementation and injection prevention.

Per-case analysis revealed an unexpected attention score dip in Case 3, attributed to distracting UI elements such as pulsating progress bars. LLM evaluators consistently underestimated access control risks compared to human assessors.

Integrated Analysis

The adversarial fragility demonstrated by Tessa et al.—where simple rephrasing bypassed security tuning—mirrors the specification-dependence identified by Naqvi et al., where vague inputs systematically overlooked neurodiversity requirements. Both findings indicate that current LLMs rely on surface-level pattern matching rather than deep reasoning, leaving them vulnerable to distributional shifts in input.

Dimension-Specific Scores

DimensionCase 1 HumanCase 3 HumanLLM Delta
Attention3.853.92+0.56
Memory3.623.85+0.69
Comprehension3.693.92+0.72
Decision-Making3.774.00+0.41
Learning3.924.23+0.49

The table above quantifies improvements across inclusivity dimensions. LLMs demonstrated larger score deltas than human evaluators, suggesting an optimism bias in automated assessment.

Implications

For research, these findings advocate advancing adversarial training techniques that explicitly incorporate inclusivity objectives, alongside integrating OWASP security standards with IEEE 25010 quality metrics.

For practitioners, the results recommend employing detailed, adversarially-robust prompt specifications and implementing hybrid human-LLM evaluation pipelines to identify biases that automated systems miss.

More broadly, addressing these vulnerabilities reduces attack surfaces particularly affecting neurodivergent users, who may be disproportionately impacted by security failures in poorly designed interfaces.

Conclusion

This integrated analysis reveals that LLMs share fundamental vulnerabilities to input variations across both security and inclusivity domains. The findings advocate for robust specification practices and comprehensive evaluation frameworks as prerequisites for generating secure, inclusive code in production environments.

References