Schedule

The schedule is subject to change: this website is under construction; please check back frequently.

Discussion: Piazza

Date Topic / In-class plan Presentations / Readings Logistics
Part I. Course setup and evaluation basics
Week 1
Wed, Jan 14
Course overview; seminar format; framing trustworthiness and adversarial risks in foundation models. 1. TrustLLM: Trustworthiness in Large Language Models(ICML 2024)
2. On the Trustworthiness of Generative Foundation Models
3. Mitigating Hallucinations via Causal Reasoning(AAAI 2026)
4. Treble Counterfactual VLMs: A Causal Approach to Hallucination(EMNLP 2025 Findings)
5. StealthRank: LLM Ranking Manipulation
6. Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers
Sign-up
Week 2
Wed, Jan 21
Student paper presentations. 1. Chufan Shi: A Comprehensive Study of Jailbreak Attack versus Defense
2. Feiyu Zhu: Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
3. Ruiteng Li: MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
4. Bingxin Xu: Composite Backdoor Attacks Against Large Language Models
Week 3
Wed, Jan 28
Student paper presentations. 1. Tiannuo Yang: Prompt-Guided Internal States for Hallucination Detection (PRISM)
2. Ryuichi Lun: Red-teaming LLM Multi-Agent Systems via Communication Attacks
3. Valliammai Valliappan: RAG Security and Privacy: Formalizing the Threat Model and Attack Surface
4. Jayavibhav Niranjan Kogundi: DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
5. Tsung-Jui Wu: One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation
Part II. Attacks
Week 4
Wed, Feb 4
Student paper presentations. 1. Tiannuo Yang: Mitigating Hallucination in Multimodal LLMs with Layer Contrast Decoding
2. Ryuichi Lun: AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
3. Jiate Li: MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
4. Valliammai Valliappan: Machine Learning Models Have a Supply Chain Problem
One-page project proposal due with all the team members name on (by 23:59pm, PST)
Week 5
Wed, Feb 11
Student paper presentations. 1. Jiate Li: Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models
2. Faith Baca: SoK: The Privacy Paradox of Large Language Models
3. Shreyas Kolte: IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models
Week 6
Wed, Feb 18
Student paper presentations. 1. Tsung-Jui Wu: A Controllable Adversarial Attack against Diffusion Models
2. Mitesh Adake: JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
3. Aryan Bhusari: Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
4. Shicheng Wen: Instructional Fingerprinting of Large Language Models
5. Ashutosh Chaubey: Improving Adversarial Robustness in Vision-Language Models
Week 7
Wed, Feb 25
Student paper presentations. 1. Faith Baca: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
2. Ruth-Ann Armstrong: Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
3. Xiaoqin Feng: Breaking Agents: Compromising Autonomous LLM Agents
4. Li Li: Hallucination Detection in LLMs Using Spectral Features
5. Li Li: Beyond Text: Multimodal Jailbreaking of Vision-Language Models
Week 8
Wed, Mar 4
Student paper presentations. 1. Raja Kumar: Chain-of-Intention Reasoning Elicits Defense in Multimodal LLMs
2. Anagha Shyama Prakash: A Comprehensive Study of Jailbreak Attack versus Defense
3. Xiaoqin Feng: Weak-to-Strong Jailbreaking on Large Language Models
4. Jae Won Choi: AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
5. Junhan Wu: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Part III. Defenses and verification
Week 9
Wed, Mar 11
Student paper presentations. 1. Mitesh Adake: Adversarial Reasoning at Jailbreaking Time
2. Aryan Bhusari: AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-Language Models
3. Alexander Romanus: Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
4. Shicheng Wen: GhostPrompt: Jailbreaking Text-to-image Generative Models
5. Ramya Krishnan: Is Poisoning a Real Threat to LLM Alignment?
Week 10
Wed, Mar 18
NO CLASS — Spring Break None
Week 11
Wed, Mar 25
Student paper presentations. 1. Ojas Nimase: Searching for Privacy Risks in LLM Agents via Simulation
2. Anagha Shyama Prakash: Defenses Against Prompt Attacks Learn Surface Heuristics
3. Ojas Nimase: Paper number/name TBD
4. Alexander Romanus: Understanding and Rectifying Safety Perception Distortion in VLMs
5. Xu Wang: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models
Mid project slides due (TBD)
Week 12
Wed, Apr 1
Student paper presentations. 1. Ashutosh Chaubey: Paper number/name TBD
2. Xinyuan Li: Paper number/name TBD
3. Tianyi Zhang: RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback
4. Jae Won Choi: Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
5. Junhan Wu: BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
Part IV. Paper discussions and student projects
Week 13
Wed, Apr 8
Student paper presentations. 1. Faezeh Dehghan: Safety Alignment Should Be Made More Than Just Superficial
2. Ruth-Ann Armstrong: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models
3. Tianyi Zhang: On the Vulnerability of Applying Retrieval-Augmented Generation
4. Ramya Krishnan: Memory Injection Attacks on LLM Agents via Query-Only
5. Jayavibhav Niranjan Kogundi: Reasoning Model Unlearning: Forgetting Traces, Not Just Answers
Week 14
Wed, Apr 15
Student paper presentations. 1. Shreyas Kolte: Training LLMs for Honesty via Confessions
2. Yi Nian: On the Vulnerability of Safety Alignment in Open-Access LLMs
3. Xu Wang: Trading Inference-Time Compute for Adversarial Robustness
4. Jingzhen Wang: Paper number/name TBD
5. Jingzhen Wang: Paper number/name TBD
Week 15
Wed, Apr 22
Student paper presentations. 1. Qi Pan: NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation
2. Yi Nian: JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
3. Faezeh Dehghan: Paper number/name TBD
4. Xinyuan Li: Paper number/name TBD
5. Tim Aris: Inference-Time Reward Hacking in Large Language Models
Week 16
Wed, Apr 29
Final Project Presentations
In-class project presentations.
None Presentation signup sheet
Final
Sun, May 3
Final project report due (no in-class final exam). None Due Sun, May 3, 2026, 11:59 PM PST (Gradescope)