LLM-Driven Research Engineering

End-to-end anomaly detection from one command.

AD-AGENT turns natural-language requests into runnable anomaly detection pipelines across PyOD, PyGOD, and TSB-AD — with automated review, sandbox execution, and evaluation.

Supported by NSF POSE Phase II OpenAD (Award #2346158)

3 modalities

Tabular, graph, and time-series anomaly detection

Multi-agent

Processor, Selector, Generator, Reviewer, Evaluator

Automation

Parse, select, codegen, test, and run

What It Does

Running anomaly detection across different data modalities usually means switching libraries, re-learning APIs, and wiring up evaluation code by hand. AD-AGENT collapses that loop: a prompt describing the task you want to solve — "Run IForest on cardio.mat", "Detect anomalies in my graph data", or "Try all PyOD models on this dataset" — is turned into a working script, executed inside a secure sandbox, and evaluated end-to-end.

Core Features

Natural-language Interface

Write commands like: Run IForest on ./data/glass_train.mat and ./data/glass_test.mat.

Cross-library Support

Works across pyod (tabular), pygod (graph), and tsb_ad (time-series) — one interface for multiple data modalities.

Self-checking Pipeline

Generated code is reviewed on synthetic data before real execution.

Pipeline API

Call api.pipeline stages directly from Python to embed AD-AGENT in notebooks, scripts, or larger workflows.

Automatic Model Suggestion

When no algorithm is specified the selector agent recommends competitive candidates based on data modality and shape.

Secure Sandbox Execution

Generated code runs inside an isolated sandbox — Modal (remote, default) or Docker (local) — never on the host process.

Workflow

  1. 1Processor extracts algorithms, datasets, and params from user command.
  2. 2Selector infers data modality and selects AD library/tools.
  3. 3InfoMiner queries authoritative docs and model usage details.
  4. 4CodeGenerator creates runnable scripts and revises on errors.
  5. 5Reviewer tests on synthetic data; re-runs CodeGenerator on failure (up to 4 cycles).
  6. 6Evaluator executes the reviewed code on real data inside the sandbox and records AUROC/AUPRC.
  7. 7Optimizer (optional, -o) tunes hyper-parameters with LLM guidance and re-evaluates.

Quickstart

git clone git@github.com:USC-FORTIS/AD-AGENT.git
cd AD-AGENT
python -m venv .venv

# macOS / Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

pip install -r requirements.txt
export OPENAI_API_KEY=your-api-key-here   # or set in src/config/config.py
python main.py

Then type a natural-language request, for example:

Run IForest on ./data/pyod_data/cardio.mat
Run DOMINANT on ./data/pygod_data/books.pt
Run IForest on ./data/SMAP/SMAP_train.npy
Run all on ./data/pyod_data/cardio.mat

Parallel: python main.py -p  |  Optimizer: python main.py -o  |  Sandbox: python main.py --sandbox docker

Sandbox Execution

AD-AGENT runs its agent workflow on the host machine and executes generated model scripts inside an isolated sandbox backend. This keeps the orchestration layer lightweight while moving package-heavy model execution into containers. Workflow logs ([main], [selector], [reviewer] …) come from the host; generated-script output is streamed back from the sandbox into the same terminal.

Modal (default)

Remote execution in a managed cloud sandbox. No local Docker required.

Prerequisite — install and authenticate once:

pip install modal
modal setup

Paths inside Modal: /workspace (script), /data (datasets).

Docker

Local container execution. Requires Docker Desktop to be running. Useful for offline or air-gapped environments.

Resource limits applied to every container:

--memory=4g  --cpus=2  --rm

Dataset files are bind-mounted read-only at the requested in-container paths.

Quick Start

python main.py --sandbox docker
python main.py --sandbox modal

# with debug retention (Modal only — sandboxes kept for post-run inspection)
ADAGENT_SANDBOX_DEBUG=1 python main.py --sandbox modal

Configuration

Sandbox mode is resolved in this priority order:

  1. 1--sandbox flag passed to main.py
  2. 2Environment variable ADAGENT_SANDBOX
  3. 3Legacy OPENAD_SANDBOX (backward compatibility)
  4. 4src/config/settings.yaml if present
  5. 5Default: modal

Additional environment variables:

ADAGENT_SANDBOX_DEBUG=1        # retain Modal sandboxes after run for inspection
ADAGENT_MODAL_APP_NAME=...     # override Modal app name  (default: adagent-sandbox)
ADAGENT_MODAL_VOLUME_NAME=...  # override Modal volume name (default: adagent-data)

Supported Package Images

Docker image tags

adagent-pyod:latest
adagent-pygod:latest
adagent-tsb-ad:latest

Modal image definitions

PYOD_IMAGE
PYGOD_IMAGE
TSB_AD_IMAGE

pygod requires PyG wheel deps (pyg_lib, torch_sparse, torch_scatter).

API Reference

api.pipeline module

Step-by-step pipeline functions for anomaly detection. Each function can be called individually (pass explicit arguments) or inside a graph (pass a FullToolState).

class api.pipeline.FullToolState

Bases: TypedDict

Shared state dictionary passed between all pipeline nodes.

messages Sequence[Any]
Accumulated message history across pipeline stages.
current_tool str
Name of the algorithm currently being processed.
input_parameters dict
User-supplied hyper-parameters for the algorithm.
data_path_train str
Path to the training dataset file.
data_path_test str
Path to the testing dataset file.
package_name str
AD library in use: "pyod", "pygod", or "tsb_ad".
code_quality CodeQuality | None
Evaluation result attached to the current candidate code.
should_rerun bool
Flag indicating whether the code generation step should retry.
experiment_config dict | None
Structured experiment configuration produced by the Processor.
algorithm_doc str | None
Documentation string fetched by the InfoMiner for the current tool.
feature_dim int | None
Feature dimensionality of the dataset, inferred by the Selector.
metadata dict | None
Dataset metadata (e.g. num_samples, has_labels) inferred by the Selector.
results List[Tuple[str, Any]] | None
Collected (tool, final_state) pairs after all tools are processed.
api.pipeline.build_state()

Create a default FullToolState with all agent instances initialised.

state dict
Fully initialised pipeline state with default values for every key.
api.pipeline.run_processor(state=None)

Launch the interactive chatbot to collect algorithm, dataset, and parameter information from the user and populate experiment_config.

state FullToolState, optional
Existing pipeline state to reuse. A new default state is created when None.
state dict
Updated pipeline state with experiment_config populated from user input.
api.pipeline.run_selector(algorithm=None, dataset_train=None, dataset_test=None, parameters=None, state=None)

Resolve the AD library and tool list from experiment configuration or explicit arguments. Infers package_name, feature_dim, and dataset metadata.

algorithm list[str], optional
Algorithm names to run. Pass "all" to use every available algorithm for the detected library, or None to let the agent decide.
dataset_train str, optional
Path to the training dataset. Required when state is None.
dataset_test str, optional
Path to the testing dataset.
parameters dict, optional
Algorithm hyper-parameters to forward to the generator.
state FullToolState, optional
Existing pipeline state whose experiment_config is used when provided.
state dict
Updated state with agent_selector, package_name, feature_dim, and metadata set.
api.pipeline.run_info_miner(algorithm=None, package_name=None, state=None)

Query authoritative documentation for an algorithm and store the result in algorithm_doc. Results are cached to disk to avoid redundant API calls.

algorithm str, optional
Algorithm name to look up. Required when state is None.
package_name str, optional
Package to query ("pyod", "pygod", "tsb_ad"). Required when state is None.
state FullToolState, optional
Existing pipeline state containing current_tool and package_name.
state dict
Updated state with algorithm_doc set to the retrieved documentation string.
api.pipeline.run_code_generator(tool=None, data_path_train=None, algorithm_doc=None, package_name=None, data_path_test=None, input_parameters=None, code_quality=None, metadata=None, state=None)

Generate an initial runnable script for the algorithm, or revise an existing one when a previous CodeQuality with errors is supplied.

tool str, optional
Algorithm name. Required when state is None.
data_path_train str, optional
Path to the training dataset. Required when state is None.
algorithm_doc str, optional
Documentation string for the algorithm (from run_info_miner). Required when state is None.
package_name str, optional
AD library name. Required when state is None.
data_path_test str, optional
Path to the testing dataset.
input_parameters dict, optional
Hyper-parameters to embed in the generated code.
code_quality CodeQuality, optional
Previous CodeQuality with error_message set, triggering a revision pass instead of fresh generation.
metadata dict, optional
Dataset metadata to pass to the generator for data-shape-aware code.
state FullToolState, optional
Existing pipeline state. All of the above are read from state when provided.
state dict
Updated state with code_quality.code containing the generated or revised script.
api.pipeline.run_reviewer(code_quality=None, tool=None, state=None)

Execute the generated code against synthetic data to catch runtime errors before real-data evaluation. Updates code_quality.error_message and increments review_count on failure.

code_quality CodeQuality, optional
Code to review. Required when state is None.
tool str, optional
Algorithm name. Required when state is None.
state FullToolState, optional
Existing pipeline state.
state dict
Updated state with code_quality.error_message set (empty string on success).
api.pipeline.run_codegenerator_reviewer_loop(tool, data_path_train, algorithm_doc=None, package_name=None, data_path_test=None, input_parameters=None, max_reviews=2)

Convenience loop that alternates code generation and synthetic review until the code passes or max_reviews is reached.

tool str
Algorithm name.
data_path_train str
Path to the training dataset.
algorithm_doc str, optional
Documentation string for the algorithm.
package_name str, optional
AD library name.
data_path_test str, optional
Path to the testing dataset.
input_parameters dict, optional
Hyper-parameters to embed in the generated code.
max_reviews int, optional (default=2)
Maximum number of review–revision cycles before exiting the loop.
state dict
Final state after the last review cycle, containing code_quality.
api.pipeline.run_evaluator(code_quality=None, tool=None, state=None)

Execute the reviewed code on real training and testing data inside the configured sandbox and compute AUROC / AUPRC metrics.

code_quality CodeQuality, optional
Code to evaluate. Required when state is None.
tool str, optional
Algorithm name. Required when state is None.
state FullToolState, optional
Existing pipeline state.
state dict
Updated state with code_quality.auroc and code_quality.auprc populated.
api.pipeline.run_optimizer(code_quality=None, algorithm_doc=None, state=None)

Use an LLM to propose improved hyper-parameters and re-evaluate. Only active when the -o flag is passed on the command line. Returns the original code_quality unchanged otherwise.

code_quality CodeQuality, optional
Baseline result to optimise. Required when state is None.
algorithm_doc str, optional
Documentation string for the algorithm. Required when state is None.
state FullToolState, optional
Existing pipeline state.
state dict
Updated state with tuned code_quality after up to 8 optimisation steps.
api.pipeline.run_evaluator_optimizer_loop(cq, tool, algorithm_doc=None, optimizer_cycles=1)

Run an initial evaluation pass and then alternate optimizer and evaluator for optimizer_cycles iterations. Exits early if any step returns an error.

cq CodeQuality
Initial code quality object with reviewed code ready for evaluation.
tool str
Algorithm name.
algorithm_doc str, optional
Documentation string for the algorithm.
optimizer_cycles int, optional (default=1)
Number of optimise-then-evaluate cycles after the initial evaluation.
state dict
Final state with the best code_quality achieved across all cycles.
api.pipeline.check_dataset_exists(dataset_train, dataset_test=None)

Validate that dataset files exist on disk before the pipeline starts.

dataset_train str
Path to the training dataset.
dataset_test str, optional
Path to the testing dataset.
FileNotFoundError
If either dataset path does not exist on the filesystem.
api.pipeline.log_local(stage, message, tool=None)

Print a formatted stage-tagged log line. Inserts a blank separator line when the stage or tool context changes.

stage str
Pipeline stage label, e.g. "code_generator", "reviewer".
message str
Log message to print.
tool str, optional
Algorithm/tool name appended to the prefix as [stage][tool].

Citation

If this project helps your work, cite the paper:

@inproceedings{yang2025ad,
  title={AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection},
  author={Yang, Tiankai and Liu, Junjun and Siu, Michael and Wang, Jiahang
          and Qian, Zhuangzhuang and Song, Chanjuan and Cheng, Cheng
          and Hu, Xiyang and Zhao, Yue},
  booktitle={Proceedings of the 14th International Joint Conference on
             Natural Language Processing and the 4th Conference of the
             Asia-Pacific Chapter of the Association for Computational Linguistics},
  pages={191--205},
  year={2025}
}

Support

This project is supported by the U.S. National Science Foundation (NSF), TIP POSE program: NSF POSE: Phase II: OpenAD: An Integrated Open-Source Ecosystem for Anomaly Detection.

Award ID: 2346158 | Status: Active | Period: Jun 15, 2024 - May 31, 2027

Lead institution: University of Illinois at Chicago. Partners: Illinois Institute of Technology, Lehigh University, University of Southern California.

NSF Program Director: Florence Rabanal. Award page