AI audits for regulated systems

We test what your AI actually does — and give counsel the evidence.

A model is not the whole product. In regulated workflows, the deployed system includes prompts, RAG, fine-tuning, routing, runtime policies, verifier layers, human review paths, and logs. Compliance Labs helps law firms, compliance teams, and risk committees see what changed, what fails under pressure, what controls help, and what evidence remains.

Technical evidence · not legal advice · often retained through counsel
What did you buy?

The vendor model, model card, and starting configuration.

What did you turn it into?

Fine-tuning, RAG, policies, routing, output gates, workflow integration, and human review.

Can you prove what it does?

Prompts, outputs, scores, failure taxonomies, verifier decisions, patch history, residual risk, and claim boundaries.

3 regulated domains testedFinance · healthcare · employment
Multi-turn failures exposedSingle-turn evals missed the important risks
Evidence package deliveredLogs · scores · verifier decisions · residual risks

Problems We Solve

AI compliance is moving from policy language to evidence.

Regulated organizations need more than “we told the model to be careful.” They need evidence of how the deployed system behaves, how it fails, what controls reduce risk, and what still needs human review.

01 · Deployed-system risk

The audit cannot stop at the model.

A vendor model, fine-tuned model, RAG workflow, routed system, and output-gated system can produce materially different user-visible behavior. We test the system users actually experience.

02 · Pressure failures

Single-turn evals can overstate safety.

Systems that look safe on clean prompts can fail under multi-turn pressure, authority laundering, direct override attempts, or requests to omit required content.

03 · Missing content

The risky answer may be true but incomplete.

In finance, healthcare, hiring, and insurance, omissions matter: disclosures, red flags, appeal rights, human review, contestability, escalation paths, and contraindications.

The question is no longer just “What model are you using?”
The better question is: “What did you turn it into, what does it do under pressure, and what evidence can you show?”


How Audits Work

Two audit layers. One evidence package.

Compliance Maps and Aegis Evals answer different questions. A serious audit may need one or both.

1

Compliance Map — what changed inside the model?

For fine-tuned or adapted models, we examine internal model changes: layer-by-layer change profiles, feature classification, hotspot analysis, verbalized feature dictionaries, coverage gaps, and output divergence. This is the mechanistic/model-internal audit layer.

2

Aegis Eval — what does the deployed system do?

We test the system in use: baseline outputs, multi-turn pressure, direct overrides, runtime policy behavior, verifier/output-gate decisions, completeness checks, routing/fast-pass behavior, and patch/retest loops. This is the behavioral/deployment audit layer.

3

Controls — what reduces risk?

Where appropriate, we test runtime policies, domain checklists, verifier layers, sentinel/routing logic, human review paths, and escalation rules. We do not assume controls work. We test them.

4

Evidence — what can counsel review?

Every engagement produces a bounded technical record: what was tested, what failed, what changed, what was patched, what remains unresolved, and what claims are and are not supported.


Evidence Package

A report is not enough. Evidence matters.

The deliverable is not just a PDF. It is a structured technical record that a compliance team, risk committee, outside counsel, or technical reviewer can inspect.

Every engagement can include:

We do not provide legal conclusions. We provide technical evidence: what the AI system did, what controls were tested, what changed between configurations, and what risks remain. Counsel applies the law.


Recent Findings

Recent findings from our evals.

In recent finance, healthcare-style, and employment AI evaluations, the pattern was consistent: single-turn tests made systems look safer than they were.

The important failures appeared under pressure: multi-turn escalation, authority laundering, requests to omit required content, and prompts that asked the model to make risky reasoning sound professional.

Employment

A bare model produced 25 fairness failures. Runtime policy reduced that to 3. The verifier/output gate eliminated the remaining user-visible fairness failures in the direct-override suite.

Healthcare-style

The key failure was not always a false statement. It was the true-but-incomplete answer: missing red flags, escalation paths, contraindications, or patient rights.

Finance

The core lesson was similar: prompt-only compliance has a ceiling. Independent verification changed what reached the user.

These are not certification claims. They are examples of what a deployed-system audit can reveal.


Engagements

Consulting-first. Evidence-first.

We start with high-touch audit engagements, not self-serve scores. The work is scoped to the model, deployment, domain, regulatory context, and evidence your team needs.

Engagement 01

Pilot AI Audit

One model or deployed AI workflow

A focused audit for a single AI system or workflow. Best for teams that need a defensible first look at how a system behaves under normal and pressured use.

  • scope definition
  • system inventory
  • single-turn baseline
  • multi-turn pressure testing
  • failure taxonomy
  • executive findings
Pilot
Engagement 02

Compliance Map

For fine-tuned or adapted models

A mechanistic audit of what changed inside the model. Appropriate where internal model evidence matters: fine-tunes, adapters, model version changes, or model-internal risk questions.

  • layer-by-layer change profile
  • feature classification
  • hotspot analysis
  • output divergence comparison
  • plain-language feature dictionary
  • coverage gap analysis
Map
Engagement 03

Aegis Deployment Eval

For RAG, routing, policy layers, output gates, and human review workflows

A deployed-system evaluation that tests what reaches the user. Best for systems operating in regulated workflows where prompts, policies, RAG, routing, and human review all shape behavior.

  • runtime policy testing
  • direct override tests
  • verifier/output-gate tests
  • true-but-incomplete omission checks
  • fast-pass/routing safety checks
  • patch and retest loop
Aegis
Engagement 04

Monitoring & Retesting

Periodic or change-triggered re-evaluation

For systems that keep changing. Retesting can be triggered by model updates, prompt changes, RAG corpus changes, routing changes, verifier changes, new workflows, or regulatory developments.

  • version comparison
  • change-triggered retests
  • evidence archive
  • drift/change alerts
  • quarterly or event-based cadence
  • residual risk updates
Monitoring

⚖️
For Legal Counsel

Technical experts for AI matters.

Compliance Labs can work under counsel as a technical expert team. We do not provide legal advice or guarantee regulatory approval. We produce the technical record: what the AI system did, what controls were tested, what changed between versions, and what risks remain.

Counsel provides the legal strategy. We provide the technical evidence. Our work can support AI governance, risk review, incident response, vendor due diligence, model-change review, and regulatory preparation when the engagement is properly scoped by counsel.


Research Credibility

Built from research. Used for evidence.

Compliance Labs grew out of original research into how fine-tuning changes model internals and behavior. That research showed why one measurement is not enough: internal representation changes and output behavior can disagree.

Research foundation

Two measurements are required.

Internal model evidence and behavioral output evidence answer different questions. A model may change internally while outputs look stable, or show output changes without obvious feature-level movement. Serious audits need both axes where applicable.

Current research

Mechanistic audit methods for fine-tuned language models.

Our model-internal audit methodology is grounded in original research currently under anonymous peer review. The techniques — including sparse autoencoder analysis, layer-by-layer change profiling, and representational divergence measurement — form the foundation of every Compliance Map.


Team

Technical audit work for serious AI governance.

Compliance Labs is a joint venture between Awakened Intelligence and Arvoinen.AI, combining AI systems engineering, mechanistic interpretability research, and applied compliance evaluation.

Founder & Principal

John Holman

John Holman is the founder and systems architect behind Compliance Labs and Awakened Intelligence. He designs and operates the evaluation infrastructure, coordinates client engagements, and turns model behavior, verifier logs, failure taxonomies, and patch histories into evidence packages counsel can use.

Founder & Lead Researcher

Arshavir Blackwell, PhD

Arshavir Blackwell, PhD, is a cognitive scientist and lead research partner. His work anchors the mechanistic interpretability side of Compliance Labs, including SAE-based feature analysis, LoRA audit methodology, representational change measurement, and the research program behind the Compliance Map.

Compliance Labs is a joint venture between Arvoinen.AI and Awakened Intelligence.


Ready to talk about your AI system?

Tell us what you have deployed, what it is used for, what has been modified, and what your compliance or risk team needs to understand.

No self-serve scores. No black-box report. A real scoping conversation with the team doing the work.

john@compliance-labs.ai
Good first scoping details
  • System type Vendor model, fine-tune, RAG system, routed workflow, agent, or internal tool.
  • Domain Finance, healthcare, employment, insurance, credit, legal, education, or other regulated workflow.
  • Deployment status Prototype, pilot, production, post-incident, or model-change review.
  • Controls in place Prompts, RAG, policies, routing, verifier layers, human review, logs, monitoring.
  • Primary question What do counsel, compliance, risk, or leadership need to prove or understand?
  • Contact john@compliance-labs.ai