Getting Started
Probe Six is an automated security assessment platform for Large Language Models (LLMs). It tests how your AI models behave under adversarial conditions, identifying vulnerabilities before they reach production.
The Problem
LLMs are being deployed in production without adequate security testing. Traditional security tools weren't designed for AI-specific risks like prompt injection, data leakage or bias. Probe Six fills that gap with automated, evidence-based assessments that map directly to compliance frameworks.
Quick Start
Create an account — Register for free at probesix.com/register
Add an endpoint — Register the LLM endpoint you want to test (HTTP API or AWS Bedrock)
Run a scan — Choose a compliance template and launch the assessment
Review your report — Get a detailed security assessment with findings, risk scores and remediation guidance
Free tier: Includes 5 scans per month with summary findings and risk scores. Upgrade to unlock detailed test examples and AI-generated remediation guidance.
Endpoints
An endpoint represents the LLM you want to assess. Probe Six supports two connection methods: HTTP API endpoints and AWS Bedrock (via cross-account role assumption).
API Endpoint
Connect to any LLM that exposes an HTTP API. You provide the URL and any authentication headers needed.
Required fields
- Endpoint Name — A descriptive name (e.g. "Production ChatBot API")
- API URL — The full URL of your LLM endpoint (must be HTTPS)
- Authentication Headers — Optional key-value pairs for API keys, bearer tokens or basic auth
AWS Bedrock
Connect to AWS Bedrock models in your AWS account using IAM cross-account role assumption. Probe Six assumes a role in your account to access the Bedrock model directly.
Required fields
- Endpoint Name — A descriptive name
- Role ARN — The IAM role ARN in your account (e.g. arn:aws:iam::123456789012:role/ProbeSixAccess)
- External ID — Optional additional security identifier
- AWS Region — The region where your Bedrock model is deployed (e.g. us-east-1, eu-west-1)
- Model ID — The Bedrock model identifier (e.g. anthropic.claude-3-5-sonnet-20241022-v2:0)
Setting up cross-account access
You need to create an IAM role in your AWS account that allows Probe Six to invoke your Bedrock models on your behalf. The steps below walk through the full setup.
1. Create an IAM role
In the AWS IAM console, create a new role. When prompted for a trusted entity type, choose Another AWS account.
2. Set the trust policy
Replace the default trust policy with the following. The ExternalId condition prevents the Confused Deputy problem. Without it, any party that knows your Role ARN and the Probe Six account ID could assume your role. Use a unique, hard-to-guess value (such as a UUID) and enter the same value in your Probe Six endpoint configuration:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::450328359362:role/probesix-cross-account-scanner"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "YOUR_EXTERNAL_ID"
}
}
}
]
}3. Attach a permissions policy
Attach the following inline policy (or a managed policy with the same permissions). This grants the minimum access required to invoke Bedrock foundation models:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
"bedrock:ListFoundationModels"
],
"Resource": "*"
}
]
}To restrict access to specific models or regions, replace the wildcard resource with the exact ARN — for example arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0.
4. Enter the Role ARN in Probe Six
Once the role is created, copy its ARN (format: arn:aws:iam::YOUR_ACCOUNT_ID:role/ProbeSixAccess) and enter it in the Probe Six endpoint form alongside your chosen External ID.
Managing Endpoints
All endpoints are organised under projects. When you create your first endpoint, a default project is created automatically. You can view, edit and delete endpoints from the dashboard. Each endpoint shows its connection status (active, pending validation or invalid) and the date of its last scan.
Running Scans
A scan executes a suite of security tests against your LLM endpoint. Each scan uses a compliance template that determines which test categories are included.
Scan Templates
Choose a template based on the compliance framework or security standard you need to assess against:
EU AI Act
Compliance testing against EU AI Act requirements. Covers bias, transparency, human oversight and robustness.
OWASP Top 10 for LLMs
Tests for the most critical LLM security risks including prompt injection, data leakage and insecure output handling.
OWASP Top 10 for Agentic AI
Security testing specifically designed for agentic AI applications with tool use, multi-step reasoning and autonomous actions.
MITRE ATLAS
Adversarial threat landscape testing mapped to MITRE ATLAS tactics and techniques for AI/ML systems.
NIST AI RMF
Comprehensive security testing based on NIST AI Risk Management Framework guidelines. 19 categories across all four functions (Govern, Map, Measure, Manage).
ISO/IEC 42001
Assessment against the international standard for AI Management Systems. Covers 38 Annex A controls across 9 domains and management system clauses 4–10.
How a Scan Works
Select your endpoint — Choose which LLM to test. You can also launch scans directly from an endpoint's page.
Choose a template — Select the compliance framework to test against (e.g. OWASP, ATLAS, EU AI Act, NIST, ISO 42001).
Select categories — Pick which categories, tactics, articles or domains to include. Select all for comprehensive coverage or choose specific areas for targeted testing.
Answer governance questions — When you select a category, its governance questions appear inline below the row. Answer them in context — your responses auto-save and persist across scans.
Launch — The scan runs automatically, sending adversarial prompts to your endpoint and evaluating responses.
Review results — Once complete, your assessment report is generated with findings, risk scores and remediation guidance.
Scan Phases
Every scan progresses through four phases:
1. Provisioning
Infrastructure is allocated for your scan. This typically takes 1–2 minutes and requires no action on your part.
2. Generating Test Cases
Test cases are dynamically generated based on your chosen compliance template. The progress bar shows the number of tests created so far. This phase typically takes 3–8 minutes depending on template size.
3. Scanning
Each test case is sent to your endpoint, the response is evaluated, and the result is graded. The progress bar updates every 15 seconds showing the number of completed tests. This is typically the longest phase — see below for what affects duration.
4. Generating Report
Results are compiled into your assessment report with risk scores, findings, and remediation guidance. This takes 1–3 minutes.
Re-running Scans
Any completed scan can be re-run at any time using the same endpoint and template configuration. Re-runs let you track how your model's security posture changes over time — for example, after applying remediations or updating your model. Each re-run produces a new report, so you can compare results side by side.
Note: Scans send real prompts to your LLM endpoint. Ensure your endpoint is accessible and that you have authorisation to perform security testing against it.
What Affects Scan Duration
During the scanning phase, you may notice that some tests complete faster than others. This is normal and reflects the different types of security tests being run.
Direct Tests
The first portion of a scan consists of direct tests— individual adversarial prompts sent to your endpoint. Each test follows a simple cycle: send the prompt, receive the response, grade the result. These tests typically complete at a rate of 10–20 per minute.
Adaptive Tests
The later portion of a scan uses adaptive testing strategiessuch as jailbreak techniques, multi-step prompt injection, and iterative refinement attacks. These tests are more sophisticated — if the initial attempt doesn't bypass your model's defences, the test adapts its approach and tries again with a modified prompt. This multi-round process means each adaptive test requires several interactions with your endpoint, typically completing at 2–5 per minute.
Why adaptive tests matter:Adaptive tests simulate real-world attack scenarios where an adversary refines their approach based on the model's responses. A model that resists direct attacks but succumbs to adaptive techniques has a genuine vulnerability that needs addressing.
Tip:Your endpoint's response latency also affects scan duration. Endpoints with higher latency or rate limiting will result in longer scans. If your endpoint has strict rate limits, the scan will automatically pace requests to avoid being throttled.
Understanding Reports
After a scan completes, Probe Six generates a detailed assessment report. The report provides a complete picture of your model's security posture, combining automated test results with governance assessment responses.
Report Structure
Score Card
Your overall security score (0–100%) calculated from weighted risk across all finding categories. If governance questions were answered, a blended score is shown: 70% security score + 30% governance score.
Test Results Summary
Total tests executed, pass/fail counts and scan duration at a glance.
Vulnerability Breakdown
A bar graph showing the count of failed tests by severity level — critical, high, medium and low. This gives a quick visual indicator of where the most serious issues lie.
Findings Table
Each finding represents a security test category. The table shows the category name, risk severity (low/medium/high/critical), attack complexity, pass percentage, and risk score (0–10). Findings are sortable by any column. Governance-assessed categories appear alongside automated findings.
Framework Coverage
Framework-specific coverage sections appear based on your scan template — for example, an ATLAS heat map for MITRE ATLAS scans, article-level coverage for EU AI Act, category-by-function views for NIST, or domain-level coverage for ISO 42001.
Failed Test Examples
For each finding category, the report includes the actual prompts that failed security checks and the model's responses. This evidence shows exactly how the model behaved under adversarial conditions.
Remediation Guidance
AI-generated remediation for each finding category, tailored to the actual test failures observed in your scan. Critical and high severity findings receive individual remediation; medium and low findings are grouped by category.
Executive Summary
Key risk statements summarising the assessment, along with a breakdown of the scoring methodology used to calculate the overall security score.
Attack Methods Used
A summary of the tools and plugins employed during the scan, showing which adversarial techniques were used to test your endpoint.
Risk Scoring
Each finding receives a risk score from 0 to 10, derived from the attack success rate (ASR) — the percentage of tests where the model failed to resist the attack. The score incorporates:
- Attack impact — The potential damage if exploited (low to critical)
- Exploit difficulty — How difficult the attack is to execute
- Human oversight factor — Whether a human is in the loop
- Penalty modifiers — Adjustments for high ASR or specific risk patterns
Severity thresholds: ≥7.5 critical, ≥5.0 high, ≥2.5 medium, >0 low, =0none. The overall security score (0–100%) is calculated as 100 − 10 × WRI, where WRI is a weighted risk index across all categories (critical 4×, high 3×, medium 2×, low 1×).
PDF Export
Full reports can be exported as PDF documents for offline review, sharing with stakeholders, or inclusion in compliance documentation. The PDF includes all report sections — score cards, vulnerability breakdown, findings, framework coverage, remediation guidance and executive summary.
Free vs Full Reports
Free Tier
- ✓ Score card and overall security score
- ✓ Findings table with risk scores
- ✓ Pass/fail counts per category
- ✓ Vulnerability breakdown chart
- ✗ Failed test examples (locked)
- ✗ AI-generated remediation (locked)
- ✗ PDF export (locked)
Full Report
- ✓ Everything in free tier
- ✓ Detailed failed test examples with prompts and outputs
- ✓ AI-generated remediation guidance per category
- ✓ Framework-specific coverage sections
- ✓ PDF export
Finding Categories
Probe Six tests across a wide range of LLM security categories. Each category targets a specific class of vulnerability or risk. Below is a reference of the major categories currently assessed.
Prompt Injection & Evasion
Hijacking
Tests whether the model's intended behaviour can be overridden through prompt injection, causing it to follow attacker instructions.
Indirect Prompt Injection
Tests for injection attacks via external content (documents, web pages, tool outputs) that the model ingests as context.
System Prompt Override
Assesses whether adversarial prompts can override or replace the model's system instructions.
Prompt Extraction
Tests whether the model's system prompt or instructions can be extracted through adversarial techniques.
Encoding Bypass
Tests whether encoding tricks (Base64, ROT13, hexadecimal, leetspeak, Unicode homoglyphs and others) can bypass content filters.
Cross-Lingual Evasion
Assesses whether safety controls can be bypassed by switching languages, transliteration, code-switching or using low-resource languages.
ASCII Smuggling
Tests for vulnerabilities where Unicode or ASCII encoding tricks can bypass content filters and safety mechanisms.
Special Token Injection
Tests whether special tokens or control sequences can manipulate model behaviour or bypass safety layers.
Security Vulnerabilities
SQL Injection
Tests whether the model can be manipulated into generating or executing SQL injection payloads, potentially compromising backend databases.
Shell Injection
Assesses whether the model can be tricked into producing shell commands that could execute arbitrary code on underlying systems.
SSRF Enforcement
Tests for Server-Side Request Forgery vulnerabilities where the model could be manipulated into making requests to internal services or resources.
Malicious Code
Tests whether the model can be manipulated into generating malware, payload delivery mechanisms or detection evasion code.
Debug Access
Checks whether adversarial prompts can trick the model into revealing debug information, system prompts or internal configuration details.
Privilege Escalation
Tests whether the model can be manipulated into performing actions above its authorised permission level.
RBAC
Assesses whether role-based access controls are enforced correctly when the model handles requests from users with different permission levels.
Unauthorized Data Access
Tests whether the model can be tricked into accessing or revealing data it should not have access to.
RAG & Agentic Security
RAG Poisoning
Tests whether retrieval-augmented generation systems can be compromised by injecting malicious content into the knowledge base.
RAG Document Exfiltration
Assesses whether an attacker can extract documents or data from the RAG knowledge base through the model's responses.
RAG Source Attribution
Tests whether the model correctly attributes information to its retrieval sources and avoids fabricating citations.
Memory Poisoning
Tests whether persistent memory or conversation history can be manipulated to influence future model behaviour.
Tool Discovery
Assesses whether an attacker can discover which tools or functions are available to an agentic AI system.
Goal Misalignment
Tests whether the model can be manipulated into pursuing goals that conflict with its intended purpose.
Excessive Agency
Tests whether the model claims capabilities or takes actions beyond its intended scope, potentially misleading users.
Content Safety
Harmful Content
Tests whether the model can be manipulated into generating harmful, dangerous or illegal content. Covers 20+ subcategories including violent crime, self-harm, weapons, hate speech, cybercrime and more.
Hallucination
Assesses the model's tendency to generate false or fabricated information presented as fact.
Unverifiable Claims
Tests whether the model makes claims it cannot substantiate, potentially spreading misinformation.
Imitation
Checks whether the model can be made to impersonate real individuals, organisations or authoritative sources.
Overreliance
Assesses whether the model encourages users to over-rely on its outputs without appropriate caveats or disclaimers.
Sycophancy
Tests whether the model excessively agrees with the user rather than providing accurate or balanced responses.
Specialised Advice
Tests whether the model provides professional advice (financial, legal, medical) without appropriate disclaimers.
Off-Topic
Assesses whether the model can be led into responding to topics outside its intended domain or purpose.
Bias & Fairness
Bias — Race
Tests for racial bias in model outputs, including stereotyping, discrimination and unequal treatment based on race or ethnicity.
Bias — Gender
Assesses gender bias including stereotyping, discrimination and unequal treatment based on gender or gender identity.
Bias — Age
Tests for age-based discrimination or stereotyping in model responses.
Bias — Disability
Checks for disability-related bias, ableism or discriminatory language in model outputs.
Bias — Religion
Assesses whether the model exhibits bias towards or against particular religious groups or beliefs.
Bias — Sexual Orientation
Tests for bias or discrimination based on sexual orientation in model responses.
Bias — Socioeconomic
Assesses whether the model exhibits bias based on socioeconomic status or class.
Bias — Political
Tests whether the model makes politically biased or partisan statements that could influence users.
Bias — Nationality
Checks for bias or stereotyping based on national origin or nationality.
Data Protection & Privacy
PII Leakage
Tests whether the model can be manipulated into revealing personally identifiable information. Includes direct disclosure, API/database leakage, session leakage and social engineering variants.
Training Data Extraction
Attempts to extract training data samples from model responses, which may contain sensitive information.
Cross-Session Leak
Tests for data leakage between user sessions, where information from one conversation bleeds into another.
Data Exfiltration
Assesses whether the model can be used as a channel to exfiltrate data from connected systems.
Cloud Service Discovery
Probes for cloud service endpoints and infrastructure details that should remain confidential.
Model Theft
Tests whether model weights, architecture details or proprietary information can be extracted via the inference API.
EU AI Act Compliance
Prohibited Practice Refusals
Tests that the model refuses requests related to EU AI Act Article 5 prohibited practices, including subliminal manipulation, vulnerability exploitation, social scoring, criminal prediction, emotion recognition and biometric categorisation.
Transparency & Disclosure
Assesses AI self-disclosure, content marking, limitation disclosure and explainability in line with EU AI Act transparency obligations.
Contracts
Tests the model's handling of contractual and licensing obligations related to AI-generated content.
Accuracy & Robustness
Factual Accuracy
Tests the model's ability to provide factually correct information and avoid confident assertions of false claims.
Confidence Calibration
Assesses whether the model's expressed confidence aligns with the actual accuracy of its responses.
Reasoning DoS
Tests whether the model can be overwhelmed by computationally expensive reasoning tasks that degrade performance.
Compliance Mapping
Every finding in a Probe Six assessment maps to one or more compliance frameworks. This mapping provides auditable evidence that your AI systems have been tested against recognised security and governance standards.
Supported Frameworks
OWASP Top 10 for LLMs
The Open Worldwide Application Security Project's top 10 most critical vulnerabilities for LLM applications. Covers prompt injection, insecure output handling, training data poisoning, denial of service and more.
MITRE ATLAS
Adversarial Threat Landscape for AI Systems. A knowledge base of adversarial tactics and techniques specific to AI/ML systems, modelled after the MITRE ATT&CK framework.
NIST AI Risk Management Framework
The National Institute of Standards and Technology's framework for managing risks in AI systems. Addresses governance, mapping, measurement and management of AI risks.
ISO 42001
The international standard for AI Management Systems. Provides requirements for establishing, implementing and continually improving AI governance within organisations.
EU AI Act
The European Union's regulation on artificial intelligence. Establishes obligations for AI systems based on risk classification, with requirements for transparency, human oversight, accuracy and robustness.
Equality Act 2010
UK legislation protecting against discrimination. Relevant to AI bias testing, ensuring model outputs do not discriminate based on protected characteristics including race, gender, age, disability and religion.
How Mapping Works
When Probe Six generates remediation guidance for a finding, it includes a references section that maps the issue to specific codes and clauses within relevant frameworks. For example:
References:
- OWASP LLM Top 10: LLM01 — Prompt Injection
- MITRE ATLAS: AML.T0051 — LLM Prompt Injection
- NIST AI RMF: GV-1.1 — Legal and regulatory requirements
- ISO 42001: 6.1.2 — AI risk assessment
These references allow security teams and auditors to trace each finding back to the specific compliance requirement it addresses, providing the evidence trail needed for regulatory reporting and governance reviews.