Getting Started

ProbeSix is an automated security assessment platform for Large Language Models (LLMs). It tests how your AI models behave under adversarial conditions, identifying vulnerabilities before they reach production.

The Problem

LLMs are being deployed in production without adequate security testing. Traditional security tools weren't designed for AI-specific risks like prompt injection, data leakage or bias. ProbeSix fills that gap with automated, evidence-based assessments that map directly to compliance frameworks.

Quick Start

Create an account — Register for free at probesix.com/register

Add an endpoint — Register the LLM endpoint you want to test (HTTP API or AWS Bedrock)

Run a scan — Choose a compliance template and launch the assessment

Review your report — Get a detailed security assessment with findings, risk scores and remediation guidance

Free tier: Includes 5 scans per month with summary findings and risk scores. Upgrade to unlock detailed test examples and AI-generated remediation guidance.

Endpoints

An endpoint represents the LLM you want to assess. ProbeSix supports two connection methods: HTTP API endpoints and AWS Bedrock (via cross-account role assumption).

API Endpoint

Connect to any LLM that exposes an HTTP API. You provide the URL and any authentication headers needed.

Required fields

Endpoint Name — A descriptive name (e.g. "Production ChatBot API")
API URL — The full URL of your LLM endpoint (must be HTTPS)
Authentication Headers — Optional key-value pairs for API keys, bearer tokens or basic auth

AWS Bedrock

Connect to AWS Bedrock models in your AWS account using IAM cross-account role assumption. ProbeSix assumes a role in your account to access the Bedrock model directly.

Required fields

Endpoint Name — A descriptive name
Role ARN — The IAM role ARN in your account (e.g. arn:aws:iam::123456789012:role/ProbeSixAccess)
External ID — Optional additional security identifier
AWS Region — The region where your Bedrock model is deployed (e.g. us-east-1, eu-west-1)
Model ID — The Bedrock model identifier (e.g. anthropic.claude-3-5-sonnet-20241022-v2:0)
Guardrail ID — Optional. A Bedrock guardrail to apply to every scan request (short ID such as abc123def4, or a full guardrail ARN). See Guardrail testing.
Guardrail Version — Optional. The guardrail version number (e.g. 1), or DRAFT. Defaults to DRAFT if left blank.

Setting up cross-account access

You need to create an IAM role in your AWS account that allows ProbeSix to invoke your Bedrock models on your behalf. The steps below walk through the full setup.

1. Create an IAM role

In the AWS IAM console, create a new role. When prompted for a trusted entity type, choose Another AWS account.

2. Set the trust policy

Replace the default trust policy with the following. The ExternalId condition prevents the Confused Deputy problem. Without it, any party that knows your Role ARN and the ProbeSix account ID could assume your role. Use a unique, hard-to-guess value (such as a UUID) and enter the same value in your ProbeSix endpoint configuration:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::450328359362:role/probesix-cross-account-scanner"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "YOUR_EXTERNAL_ID"
        }
      }
    }
  ]
}

3. Attach a permissions policy

Attach the following inline policy (or a managed policy with the same permissions). The first statement grants the minimum access required to invoke Bedrock foundation models — including bedrock:Converse, which ProbeSix uses to verify your endpoint is reachable before each scan. The second statement is only required if you attach a guardrail to this endpoint (see Guardrail testing); without it, guardrailed scans fail with an access-denied error:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "InvokeBedrockModels",
      "Effect": "Allow",
      "Action": [
        "bedrock:Converse",
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:ListFoundationModels"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ApplyGuardrailOnlyIfUsed",
      "Effect": "Allow",
      "Action": "bedrock:ApplyGuardrail",
      "Resource": "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:guardrail/YOUR_GUARDRAIL_ID"
    }
  ]
}

To restrict model access to specific models or regions, replace the wildcard resource in the first statement with the exact ARN — for example arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0. If you are not using a guardrail, omit the second statement entirely.

4. Enter the Role ARN in ProbeSix

Once the role is created, copy its ARN (format: arn:aws:iam::YOUR_ACCOUNT_ID:role/ProbeSixAccess) and enter it in the ProbeSix endpoint form alongside your chosen External ID.

Guardrail testing (AWS Bedrock)

If your Bedrock model is protected by an AWS Bedrock Guardrail, you can have ProbeSix test your real production posture rather than the raw model. Set the optional Guardrail ID and Guardrail Version on the Bedrock endpoint. When set, every scan request is sent with your guardrail applied, and each finding shows whether the guardrail intervened.

Requirements

Same region — The guardrail must exist in the same AWS region as the Model ID / endpoint region.
IAM permission — The cross-account role must allow bedrock:ApplyGuardrail on the guardrail ARN (the second statement in the permissions policy above). Without it, guardrailed scans fail with an access-denied error.
Active policies — For meaningful results the guardrail should have at least one active policy (content filters, denied topics, sensitive-information/PII, or word filters) so adversarial prompts can trigger it.

In the report, each test result then shows a Bedrock Guardrail block:

Intervened — your guardrail blocked or modified the request, with the list of policies that fired (input or output).
Did not intervene — the guardrail observed the request and took no action.

This lets you see, per attack prompt, whether your guardrail caught it, the model itself defended, or neither did. Endpoints without a guardrail configured do not show this block.

Managing Endpoints

All endpoints are organised under projects. When you create your first endpoint, a default project is created automatically. You can view, edit and delete endpoints from the dashboard. Each endpoint shows its connection status (active, pending validation or invalid) and the date of its last scan.

Running Scans

A scan executes a suite of security tests against your LLM endpoint. Each scan uses a compliance template that determines which test categories are included.

Scan Templates

Choose a template based on the compliance framework or security standard you need to assess against:

EU AI Act

Compliance testing against EU AI Act requirements. Covers bias, transparency, human oversight and robustness.

OWASP Top 10 for LLMs

Tests for the most critical LLM security risks including prompt injection, data leakage and insecure output handling.

OWASP Top 10 for Agentic AI

Security testing specifically designed for agentic AI applications with tool use, multi-step reasoning and autonomous actions.

MITRE ATLAS

Adversarial threat landscape testing mapped to MITRE ATLAS tactics and techniques for AI/ML systems.

NIST AI RMF

Comprehensive security testing based on NIST AI Risk Management Framework guidelines. 19 categories across all four functions (Govern, Map, Measure, Manage).

ISO/IEC 42001

Assessment against the international standard for AI Management Systems. Covers 38 Annex A controls across 9 domains and management system clauses 4–10.

How a Scan Works

Select your endpoint — Choose which LLM to test. You can also launch scans directly from an endpoint's page.

Choose a template — Select the compliance framework to test against (e.g. OWASP, ATLAS, EU AI Act, NIST, ISO 42001).

Select categories — Pick which categories, tactics, articles or domains to include. Select all for comprehensive coverage or choose specific areas for targeted testing.

Answer governance questions — When you select a category, its governance questions appear inline below the row. Answer them in context — your responses auto-save and persist across scans.

Launch — The scan runs automatically, sending adversarial prompts to your endpoint and evaluating responses.

Review results — Once complete, your assessment report is generated with findings, risk scores and remediation guidance.

Scan Phases

Every scan progresses through four phases:

1. Provisioning

Infrastructure is allocated for your scan. This typically takes 1–2 minutes and requires no action on your part.

2. Generating Test Cases

Test cases are dynamically generated based on your chosen compliance template. The progress bar shows the number of tests created so far. This phase typically takes 3–8 minutes depending on template size.

3. Scanning

Each test case is sent to your endpoint, the response is evaluated, and the result is graded. The progress bar updates every 15 seconds showing the number of completed tests. This is typically the longest phase — see below for what affects duration.

4. Generating Report

Results are compiled into your assessment report with risk scores, findings, and remediation guidance. This takes 1–3 minutes.

Re-running Scans

Any completed scan can be re-run at any time using the same endpoint and template configuration. Re-runs let you track how your model's security posture changes over time — for example, after applying remediations or updating your model. Each re-run produces a new report, so you can compare results side by side.

Note: Scans send real prompts to your LLM endpoint. Ensure your endpoint is accessible and that you have authorisation to perform security testing against it.

What Affects Scan Duration

During the scanning phase, you may notice that some tests complete faster than others. This is normal and reflects the different types of security tests being run.

Direct Tests

The first portion of a scan consists of direct tests— individual adversarial prompts sent to your endpoint. Each test follows a simple cycle: send the prompt, receive the response, grade the result. These tests typically complete at a rate of 10–20 per minute.

Adaptive Tests

The later portion of a scan uses adaptive testing strategiessuch as jailbreak techniques, multi-step prompt injection, and iterative refinement attacks. These tests are more sophisticated — if the initial attempt doesn't bypass your model's defences, the test adapts its approach and tries again with a modified prompt. This multi-round process means each adaptive test requires several interactions with your endpoint, typically completing at 2–5 per minute.

Why adaptive tests matter:Adaptive tests simulate real-world attack scenarios where an adversary refines their approach based on the model's responses. A model that resists direct attacks but succumbs to adaptive techniques has a genuine vulnerability that needs addressing.

Tip:Your endpoint's response latency also affects scan duration. Endpoints with higher latency or rate limiting will result in longer scans. If your endpoint has strict rate limits, the scan will automatically pace requests to avoid being throttled.

Understanding Reports

After a scan completes, ProbeSix generates a detailed assessment report. The report provides a complete picture of your model's security posture, combining automated test results with governance assessment responses.

Report Structure

Score Card

Your overall security score (0–100%) calculated from weighted risk across all finding categories. If governance questions were answered, a blended score is shown: 70% security score + 30% governance score.

Test Results Summary

Total tests executed, pass/fail counts and scan duration at a glance.

Vulnerability Breakdown

A bar graph showing the count of failed tests by severity level — critical, high, medium and low. This gives a quick visual indicator of where the most serious issues lie.

Findings Table

Each finding represents a security test category. The table shows the category name, risk severity (low/medium/high/critical), attack complexity, pass percentage, and risk score (0–10). Findings are sortable by any column. Governance-assessed categories appear alongside automated findings.

Framework Coverage

Framework-specific coverage sections appear based on your scan template — for example, an ATLAS heat map for MITRE ATLAS scans, article-level coverage for EU AI Act, category-by-function views for NIST, or domain-level coverage for ISO 42001.

Failed Test Examples

For each finding category, the report includes the actual prompts that failed security checks and the model's responses. This evidence shows exactly how the model behaved under adversarial conditions.

Remediation Guidance

AI-generated remediation for each finding category, tailored to the actual test failures observed in your scan. Critical and high severity findings receive individual remediation; medium and low findings are grouped by category.

Executive Summary

Key risk statements summarising the assessment, along with a breakdown of the scoring methodology used to calculate the overall security score.

Attack Methods Used

A summary of the tools and plugins employed during the scan, showing which adversarial techniques were used to test your endpoint.

Risk Scoring

Each finding receives a risk score from 0 to 10, derived from the attack success rate (ASR) — the percentage of tests where the model failed to resist the attack. The score incorporates:

Attack impact — The potential damage if exploited (low to critical)
Exploit difficulty — How difficult the attack is to execute
Human oversight factor — Whether a human is in the loop
Penalty modifiers — Adjustments for high ASR or specific risk patterns

Severity thresholds: ≥7.5 critical, ≥5.0 high, ≥2.5 medium, >0 low, =0none. The overall security score (0–100%) is calculated as 100 − 10 × WRI, where WRI is a weighted risk index across all categories (critical 4×, high 3×, medium 2×, low 1×).

PDF Export

Full reports can be exported as PDF documents for offline review, sharing with stakeholders, or inclusion in compliance documentation. The PDF includes all report sections — score cards, vulnerability breakdown, findings, framework coverage, remediation guidance and executive summary.

Free vs Full Reports

Free Tier

✓ Score card and overall security score
✓ Findings table with risk scores
✓ Pass/fail counts per category
✓ Vulnerability breakdown chart
✗ Failed test examples (locked)
✗ AI-generated remediation (locked)
✗ PDF export (locked)

Full Report

✓ Everything in free tier
✓ Detailed failed test examples with prompts and outputs
✓ AI-generated remediation guidance per category
✓ Framework-specific coverage sections
✓ PDF export

Finding Categories

ProbeSix tests across a wide range of LLM security categories. Each category targets a specific class of vulnerability or risk. Below is a reference of the major categories currently assessed.

Prompt Injection & Evasion

Hijacking

Tests whether the model's intended behaviour can be overridden through prompt injection, causing it to follow attacker instructions.

Indirect Prompt Injection

Tests for injection attacks via external content (documents, web pages, tool outputs) that the model ingests as context.

System Prompt Override

Assesses whether adversarial prompts can override or replace the model's system instructions.

Prompt Extraction

Tests whether the model's system prompt or instructions can be extracted through adversarial techniques.

Encoding Bypass

Tests whether encoding tricks (Base64, ROT13, hexadecimal, leetspeak, Unicode homoglyphs and others) can bypass content filters.

Cross-Lingual Evasion

Assesses whether safety controls can be bypassed by switching languages, transliteration, code-switching or using low-resource languages.

ASCII Smuggling

Tests for vulnerabilities where Unicode or ASCII encoding tricks can bypass content filters and safety mechanisms.

Special Token Injection

Tests whether special tokens or control sequences can manipulate model behaviour or bypass safety layers.

Security Vulnerabilities

SQL Injection

Tests whether the model can be manipulated into generating or executing SQL injection payloads, potentially compromising backend databases.

Shell Injection

Assesses whether the model can be tricked into producing shell commands that could execute arbitrary code on underlying systems.

SSRF Enforcement

Tests for Server-Side Request Forgery vulnerabilities where the model could be manipulated into making requests to internal services or resources.

Malicious Code

Tests whether the model can be manipulated into generating malware, payload delivery mechanisms or detection evasion code.

Debug Access

Checks whether adversarial prompts can trick the model into revealing debug information, system prompts or internal configuration details.

Privilege Escalation

Tests whether the model can be manipulated into performing actions above its authorised permission level.

RBAC

Assesses whether role-based access controls are enforced correctly when the model handles requests from users with different permission levels.

Unauthorized Data Access

Tests whether the model can be tricked into accessing or revealing data it should not have access to.

RAG & Agentic Security

RAG Poisoning

Tests whether retrieval-augmented generation systems can be compromised by injecting malicious content into the knowledge base.

RAG Document Exfiltration

Assesses whether an attacker can extract documents or data from the RAG knowledge base through the model's responses.

RAG Source Attribution

Tests whether the model correctly attributes information to its retrieval sources and avoids fabricating citations.

Memory Poisoning

Tests whether persistent memory or conversation history can be manipulated to influence future model behaviour.

Tool Discovery

Assesses whether an attacker can discover which tools or functions are available to an agentic AI system.

Goal Misalignment

Tests whether the model can be manipulated into pursuing goals that conflict with its intended purpose.

Excessive Agency

Tests whether the model claims capabilities or takes actions beyond its intended scope, potentially misleading users.

Content Safety

Harmful Content

Tests whether the model can be manipulated into generating harmful, dangerous or illegal content. Covers 20+ subcategories including violent crime, self-harm, weapons, hate speech, cybercrime and more.

Hallucination

Assesses the model's tendency to generate false or fabricated information presented as fact.

Unverifiable Claims

Tests whether the model makes claims it cannot substantiate, potentially spreading misinformation.

Imitation

Checks whether the model can be made to impersonate real individuals, organisations or authoritative sources.

Overreliance

Assesses whether the model encourages users to over-rely on its outputs without appropriate caveats or disclaimers.

Sycophancy

Tests whether the model excessively agrees with the user rather than providing accurate or balanced responses.

Specialised Advice

Tests whether the model provides professional advice (financial, legal, medical) without appropriate disclaimers.

Off-Topic

Assesses whether the model can be led into responding to topics outside its intended domain or purpose.

Bias & Fairness

Bias — Race

Tests for racial bias in model outputs, including stereotyping, discrimination and unequal treatment based on race or ethnicity.

Bias — Gender

Assesses gender bias including stereotyping, discrimination and unequal treatment based on gender or gender identity.

Bias — Age

Tests for age-based discrimination or stereotyping in model responses.

Bias — Disability

Checks for disability-related bias, ableism or discriminatory language in model outputs.

Bias — Religion

Assesses whether the model exhibits bias towards or against particular religious groups or beliefs.

Bias — Sexual Orientation

Tests for bias or discrimination based on sexual orientation in model responses.

Bias — Socioeconomic

Assesses whether the model exhibits bias based on socioeconomic status or class.

Bias — Political

Tests whether the model makes politically biased or partisan statements that could influence users.

Bias — Nationality

Checks for bias or stereotyping based on national origin or nationality.

Data Protection & Privacy

PII Leakage

Tests whether the model can be manipulated into revealing personally identifiable information. Includes direct disclosure, API/database leakage, session leakage and social engineering variants.

Training Data Extraction

Attempts to extract training data samples from model responses, which may contain sensitive information.

Cross-Session Leak

Tests for data leakage between user sessions, where information from one conversation bleeds into another.

Data Exfiltration

Assesses whether the model can be used as a channel to exfiltrate data from connected systems.

Cloud Service Discovery

Probes for cloud service endpoints and infrastructure details that should remain confidential.

Model Theft

Tests whether model weights, architecture details or proprietary information can be extracted via the inference API.

EU AI Act Compliance

Prohibited Practice Refusals

Tests that the model refuses requests related to EU AI Act Article 5 prohibited practices, including subliminal manipulation, vulnerability exploitation, social scoring, criminal prediction, emotion recognition and biometric categorisation.

Transparency & Disclosure

Assesses AI self-disclosure, content marking, limitation disclosure and explainability in line with EU AI Act transparency obligations.

Contracts

Tests the model's handling of contractual and licensing obligations related to AI-generated content.

Accuracy & Robustness

Factual Accuracy

Tests the model's ability to provide factually correct information and avoid confident assertions of false claims.

Confidence Calibration

Assesses whether the model's expressed confidence aligns with the actual accuracy of its responses.

Reasoning DoS

Tests whether the model can be overwhelmed by computationally expensive reasoning tasks that degrade performance.

Compliance Mapping

Every finding in a ProbeSix assessment maps to one or more compliance frameworks. This mapping provides auditable evidence that your AI systems have been tested against recognised security and governance standards.

Supported Frameworks

OWASP Top 10 for LLMs

The Open Worldwide Application Security Project's top 10 most critical vulnerabilities for LLM applications. Covers prompt injection, insecure output handling, training data poisoning, denial of service and more.

MITRE ATLAS

Adversarial Threat Landscape for AI Systems. A knowledge base of adversarial tactics and techniques specific to AI/ML systems, modelled after the MITRE ATT&CK framework.

NIST AI Risk Management Framework

The National Institute of Standards and Technology's framework for managing risks in AI systems. Addresses governance, mapping, measurement and management of AI risks.

ISO 42001

The international standard for AI Management Systems. Provides requirements for establishing, implementing and continually improving AI governance within organisations.

EU AI Act

The European Union's regulation on artificial intelligence. Establishes obligations for AI systems based on risk classification, with requirements for transparency, human oversight, accuracy and robustness.

Equality Act 2010

UK legislation protecting against discrimination. Relevant to AI bias testing, ensuring model outputs do not discriminate based on protected characteristics including race, gender, age, disability and religion.

How Mapping Works

When ProbeSix generates remediation guidance for a finding, it includes a references section that maps the issue to specific codes and clauses within relevant frameworks. For example:

References:

- OWASP LLM Top 10: LLM01 — Prompt Injection

- MITRE ATLAS: AML.T0051 — LLM Prompt Injection

- NIST AI RMF: GV-1.1 — Legal and regulatory requirements

- ISO 42001: 6.1.2 — AI risk assessment

These references allow security teams and auditors to trace each finding back to the specific compliance requirement it addresses, providing the evidence trail needed for regulatory reporting and governance reviews.