Defend

Pipeline

Stages of the input guard pipeline from raw text through heuristics, session scoring, and the Defend classifier.

The input path runs a pipeline before producing a final guard result. Exact ordering and diagnostics are implemented in defend_api.pipeline and the guard router; this page summarizes the ideas you need to configure and operate the service.

Defend input pipeline

Normalization

Text is normalized for consistent downstream checks (whitespace, invisible characters, and related transforms). Diagnostics can record which transformations applied.

Intent fast-pass

When provider is defend, a heuristic layer (phrase and token signals, no ML) assigns a coarse label (benign, neutral, suspicious) and a score. Together with regex, it can safe-pass regex-clean benign traffic and skip the local classifier (decided_by: intent_safe_pass). When the provider is claude (Anthropic) or openai, this L2 stage is omitted entirely—the pipeline is normalization, regex, then L6 (modules + LLM). Implementation: defend_api/pipeline/heuristic_intent.py.

Regex and heuristics

Pattern-based rules contribute scores and can block or flag on high-confidence matches (for example, categories such as system_prompt_extraction).

Session accumulation

Per session_id, the service tracks rolling risk across turns so repeated suspicious behavior can escalate even when individual utterances look mild.

Defend classifier

When provider is defend, a local Hugging Face classifier estimates injection risk. It requires pip install pydefend[local] (or the adxzer/defend:local image). Model warm-up is tied to startup and to readiness when defend is the configured provider.

Providers and modules

For LLM-backed evaluation (claude or openai), the provider orchestrates semantic checks. Modules add structured prompt fragments on top of the provider; input modules apply on the input path, output modules on the output path. See Modules overview.

Regex categories

Heuristic categories are defined alongside patterns in defend_api/patterns.py (for example instruction_override, system_prompt_extraction, roleplay_jailbreak, role_hijack, wrapper_bypass, meta_jailbreak). Use those strings when interpreting regex-related diagnostics, not the intent fast-pass labels (benign, neutral, suspicious).