Review proof metrics
87% eval pass rate, 18% to 6% hallucination reduction, 43 req/sec, 145+ passing tests.
AI Reliability Engineer / AI Solutions Engineer Portfolio
AgentTrust IQ — LLM Evaluation · RAG Reliability · Agent Guardrails · JSONL Audit Logs
I built an inspectable evaluation platform that tests whether RAG and agent outputs are grounded, cited, safe, privacy-aware, and regression-gated before release.
The proof is intentionally concrete: controlled demo/test/fixture results, JSONL audit evidence, GitHub tests, prompt-injection defense, PII redaction, citation checks, and release/block/escalate logic.
RECRUITER SCAN
The fastest route from hiring fit to inspectable proof for AI Solutions Engineer, LLM Evaluation, Applied GenAI, Forward Deployed AI, and AI Reliability roles.
87% eval pass rate, 18% to 6% hallucination reduction, 43 req/sec, 145+ passing tests.
Inspect structured fixture evidence for groundedness, citations, guardrails, and release status.
Review pytest coverage, regression gates, validation references, and README proof notes.
Run the browser-only scenario switcher and compare pass, review, and prompt-injection cases.
Use the resume, GitHub, proof artifacts, and contact links to move quickly to interview.
Built a FastAPI-based agent reliability and LLM evaluation platform with JSONL audit artifacts, Prometheus metrics, replay validation, governance workflows, and automated regression testing; achieved 87% eval pass rate, 43 req/sec throughput, static demo response latency of 1.3s-1.4s, 99%+ workflow success, and reduced hallucination rate from 18% to 6%.
AI Reliability Algorithms
AgentTrust IQ runs production-style AI reliability algorithms to evaluate RAG and agent outputs before deployment. Each check writes structured JSONL traces, powers dashboard metrics, and supports release/block/escalate decisions.
Compares generated answers against retrieved context and flags unsupported claims before release.
Checks whether generated claims are actually supported by cited source chunks.
Detects jailbreaks, unsafe tool-use requests, and system-prompt extraction attempts.
Detects and redacts emails, phone numbers, and sensitive identifiers before logs are stored.
Combines hallucination, citation, latency, PII, and injection checks into release/block/escalate decisions.
Verified project signals
Controlled demo/test/fixture results documented in the repository, not production customer benchmarks.
Hackathon Context
AgentTrust IQ was shaped for Microsoft Agents League-style reasoning-agent evaluation workflows, but this homepage leads with role fit, proof metrics, GitHub tests, JSONL evidence, and the recruiter review path.
Under 30 seconds
Four direct checks from project claim to inspectable evidence.
Confirm the current test counts and controlled-evidence scope in the README.
Open validation statusReview per-case evaluation records and the summary row from the hiring run.
Open JSONL evidenceInspect checked-in metric samples for evaluation and reliability observability.
Open metrics sampleTrace the headline numbers to tests, reports, artifacts, and reproducible commands.
Open README proofDirect evidence
Open the exact files behind the project summary.
Test counts, evidence scope, and local verification commands.
JSONL log hiring_eval.jsonlAuditable case records and controlled hiring-run summary.
Run summary hiring_eval_summary.jsonMachine-readable metrics for the deterministic hiring run.
Integrity summary eval_summary.jsonCombined 131-record evidence summary with SHA256 checksums.
Source GitHub repositoryImplementation, tests, artifacts, deployment files, and documentation.
Existing screenshot artifacts linked directly from the repository.
These artifacts map directly to the README, proof files, evaluation logs, and static walkthrough.
Evaluation Flow
From controlled fixtures to regression-tested reliability metrics.
Controlled test cases define expected grounded, cited, and safe outputs.
Runs scoring checks across answer quality, safety, citations, and regressions.
Stores each evaluation result as an auditable record for review.
Aggregates pass rate, hallucination rate, citation precision, refusal accuracy, latency, and cost/request.
Blocks behavior drift by rerunning deterministic eval cases in the test suite.
Packages metrics, screenshots, logs, and README evidence for fast technical review.
Product Demo
A simplified view of how prompts, retrieval, guardrails, and evaluation checks work before deployment.
Incoming prompt or task request.
Relevant context is fetched from the knowledge base.
PII, prompt injection, and policy checks run first.
The model generates a grounded draft answer.
Citation, refusal, hallucination, latency, and cost checks run.
Pass, fail, or route to human review.
Summarize the policy and cite the supporting source.
The policy requires source-backed answers for customer-facing responses. Unsupported claims should be blocked or routed for review. [Doc 2]
Interactive Static Eval Demo
Select a client-side reliability scenario and inspect the gates used before releasing a GenAI workflow.
Static demo data mirrors deterministic repository fixtures and JSONL proof artifacts; this browser demo is not connected to a backend API. The FastAPI service runs locally from the repository when started by a reviewer.
Summarize the refund policy and cite the source.
Refunds are available within 30 days of purchase when the customer provides proof of payment.
Customers can request a refund within 30 days of purchase if they provide proof of payment. [Policy Doc]
{
"case_id": "good_answer_001",
"groundedness": "pass",
"citation_precision": 0.94,
"hallucination_risk": "low",
"decision": "release"
}
AGENTIC RELIABILITY FLOW
Follow one AI agent request through retrieval, guardrails, scoring, JSONL logging, and trace replay.
SELECTED STAGE
“What are the main risks in this vendor contract? Include citations.” The workflow records the request and its citation requirement before execution.
{
"run_id": "agent_eval_042",
"stage": "user_prompt",
"status": "received",
"citation_required": true
}
Hiring relevance: Shows disciplined prompt intake and explicit acceptance criteria before model execution.
AI RELIABILITY DASHBOARD
Monitor eval health, guardrail outcomes, latency, cost, hallucination risk, citation quality, and failed-run debugging from one reviewer-facing console.
Submission evaluation scenarios meeting the gate
Primary headline metricReduced after evaluation and retrieval controls
Primary headline metricMeasured at ~99% success
Primary headline metricRegression evidence for the portfolio claim
Primary headline metricREVIEW QUEUE
| Run ID | Failure Type | Risk | Fix | Status |
|---|---|---|---|---|
agent_eval_017 |
Missing citation | Medium | Added citation coverage check | Fixed |
agent_eval_023 |
Prompt injection attempt | High | Blocked instruction override | Passed |
agent_eval_031 |
PII exposure risk | High | Added redaction guardrail | Fixed |
agent_eval_044 |
Hallucinated source | High | Tightened retrieval threshold | Retest passed |
AGENT TRACE REPLAY
Inspect how an agent task moves through planning, retrieval, tool use, guardrails, scoring, and JSONL audit evidence before release.
Summarize the enterprise refund policy and cite source documents.
Break task into retrieval, citation validation, answer generation, and safety checks.
Retrieved policy chunks doc_04 and doc_09 from the fixture dataset.
Checked for prompt injection, unsafe action request, and PII leakage.
Scored citation precision, groundedness, hallucination risk, and p95 latency.
Exported structured evidence with case ID, scores, citations, latency, and release status.
{
"trace_id": "agent_replay_024",
"workflow": "enterprise_refund_rag",
"planner_status": "pass",
"retrieved_docs": ["doc_04", "doc_09"],
"guardrails": {
"prompt_injection": "pass",
"pii_leakage": false,
"unsafe_action": false
},
"metrics": {
"citation_precision": 0.91,
"groundedness": 0.94,
"p95_latency_ms": 820
},
"release_status": "pass"
}
NEXT REVIEW STEP
Complementary Eval Layer
Those tools are useful for scoring chatbot and RAG answers. AgentTrust IQ is complementary: this demo also gates agent actions before a tool runs. The proof point is the tool-tier boundary: read-only / recon actions continue to the next reliability check, while destructive / irreversible actions route to human approval.
AGENTTRUST IQ EXTENSION
A controlled demo layer for simulated cyber-agent workflows: deterministic fixtures, guardrail checks, cited responses, and replayable evidence that help recruiters see how I turn AI security risk into inspectable release criteria.
This layer simulates three risks: prompt injection hijacking the agent, sensitive data leaking through the agent's own output, and unauthorized agent actions. It is not a SIEM, EDR, or SOAR replacement.
Portfolio demo scopeSimulated cyber-agent actions stopped before tool execution.
Deterministic fixtures; controlled fixture casesInstruction override attempts blocked in controlled demo runs.
Guardrail scanSensitive fields redacted across fixture-based security traces.
Privacy checkThreat-intel claims matched to retrieved supporting evidence.
Cited response gateCyber-agent reliability cases passed repeatable eval checks.
Release criteriaEvaluation latency for deterministic fixture scoring.
Reviewer-facing metricShows how an AI infrastructure team could review agent actions before execution: classify tool tier, run guardrails, preserve JSONL evidence, track p95 eval latency, and route destructive actions to human approval.
Controlled demo workflowCONTROLLED WORKFLOW
Security task request
Threat-intel retrieval
Guardrail scan
Tool-use tier classificationTier 1 read-only / recon: continue to reliability checks. Tier 2 destructive / irreversible: route to human approval.
Human approval gateRequired only for destructive / irreversible actions.
Cited response or approved action summary
Eval scorecard
JSONL trace replay
Release/block/escalate
Check a log, look up a CVE, or query a record.
No human approval gateIsolate a host, revoke access, or delete a file.
Routes to human approvalEvaluator Coverage
These are the types of issues the evaluator is designed to catch before release.
Controlled Benchmark Notice
The headline metrics come from a six-case deterministic hiring run. A separate checksum-backed 131-record evidence summary reports a 97.0% pass rate, 0.8% hallucination rate, 83.3% citation precision, and 83.3% refusal accuracy. Neither set represents production traffic, vendor model comparisons, or live customer results.
Compact Product Extension
AgentTrust IQ can extend from evaluation-time reliability into runtime governance for AI agents. The extension monitors tool/API use, detects prompt injection, redacts PII, blocks unsafe actions, escalates risky workflows to human review, and exports JSONL audit evidence.
Evaluation Roadmap
Employer Review
Built to show how AI outputs are tested before customer-facing deployment.
RAG and AI agent systems can produce unsupported answers, missing citations, unsafe refusals, and behavior drift after prompt or model changes.
Built a deterministic evaluation harness that scores groundedness, citation precision, refusal accuracy, latency, cost/request, and regression behavior across repeatable test cases.
Prompt/response fixtures feed an evaluation runner that applies citation checks, hallucination checks, refusal checks, JSONL logging, summary metrics, and pytest regression validation.
Current metrics are based on controlled deterministic fixtures, not broad vendor benchmarks. In production, I would expand the dataset, add real provider outputs, agent traces, human review, and CI regression gates.
Proof Artifacts
Structured prompt/response outputs with pass/fail labels and metric fields.
Pytest validation for citation checks, refusal behavior, and hallucination risk.
Summary view for pass rate, citation precision, refusal accuracy, latency, and cost/request.
Fast Contact
Review the static walkthrough and reproducible proof path, or contact me directly about LLM evaluation, RAG reliability, and applied GenAI roles.
Static Vercel walkthrough | FastAPI service runs locally from the repository