| Part I. Course setup and evaluation basics |
Week 1 Wed, Jan 14 |
Course overview; seminar format; framing trustworthiness and adversarial risks in foundation models.
|
1. TrustLLM: Trustworthiness in Large Language Models(ICML 2024)
2. On the Trustworthiness of Generative Foundation Models
3. Mitigating Hallucinations via Causal Reasoning(AAAI 2026)
4. Treble Counterfactual VLMs: A Causal Approach to Hallucination(EMNLP 2025 Findings)
5. StealthRank: LLM Ranking Manipulation
6. Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers
|
Sign-up |
Week 2 Wed, Jan 21 |
Student paper presentations. |
1. Chufan Shi: A Comprehensive Study of Jailbreak Attack versus Defense
2. Feiyu Zhu: Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
3. Ruiteng Li: MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
4. Bingxin Xu: Composite Backdoor Attacks Against Large Language Models
|
|
Week 3 Wed, Jan 28 |
Student paper presentations. |
1. Tiannuo Yang: Prompt-Guided Internal States for Hallucination Detection (PRISM)
2. Ryuichi Lun: Red-teaming LLM Multi-Agent Systems via Communication Attacks
3. Valliammai Valliappan: RAG Security and Privacy: Formalizing the Threat Model and Attack Surface
4. Jayavibhav Niranjan Kogundi: DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
5. Tsung-Jui Wu: One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation
|
|
| Part II. Attacks |
Week 4 Wed, Feb 4 |
Student paper presentations. |
1. Tiannuo Yang: Mitigating Hallucination in Multimodal LLMs with Layer Contrast Decoding
2. Ryuichi Lun: AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
3. Jiate Li: MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
4. Valliammai Valliappan: Machine Learning Models Have a Supply Chain Problem
|
One-page project proposal due with all the team members name on (by 23:59pm, PST) |
Week 5 Wed, Feb 11 |
Student paper presentations. |
1. Jiate Li: Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models
2. Faith Baca: SoK: The Privacy Paradox of Large Language Models
3. Shreyas Kolte: IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models
|
|
Week 6 Wed, Feb 18 |
Student paper presentations. |
1. Tsung-Jui Wu: A Controllable Adversarial Attack against Diffusion Models
2. Mitesh Adake: JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
3. Aryan Bhusari: Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
4. Shicheng Wen: Instructional Fingerprinting of Large Language Models
5. Ashutosh Chaubey: Improving Adversarial Robustness in Vision-Language Models
|
|
Week 7 Wed, Feb 25 |
Student paper presentations. |
1. Faith Baca: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
2. Ruth-Ann Armstrong: Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
3. Xiaoqin Feng: Breaking Agents: Compromising Autonomous LLM Agents
4. Li Li: Hallucination Detection in LLMs Using Spectral Features
5. Li Li: Beyond Text: Multimodal Jailbreaking of Vision-Language Models
|
|
Week 8 Wed, Mar 4 |
Student paper presentations. |
1. Raja Kumar: Chain-of-Intention Reasoning Elicits Defense in Multimodal LLMs
2. Anagha Shyama Prakash: A Comprehensive Study of Jailbreak Attack versus Defense
3. Xiaoqin Feng: Weak-to-Strong Jailbreaking on Large Language Models
4. Jae Won Choi: AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
5. Junhan Wu: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
|
|
| Part III. Defenses and verification |
Week 9 Wed, Mar 11 |
Student paper presentations. |
1. Mitesh Adake: Adversarial Reasoning at Jailbreaking Time
2. Aryan Bhusari: AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-Language Models
3. Alexander Romanus: Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
4. Shicheng Wen: GhostPrompt: Jailbreaking Text-to-image Generative Models
5. Ramya Krishnan: Is Poisoning a Real Threat to LLM Alignment?
|
|
Week 10 Wed, Mar 18 |
NO CLASS — Spring Break |
None |
|
Week 11 Wed, Mar 25 |
Student paper presentations. |
1. Ojas Nimase: Searching for Privacy Risks in LLM Agents via Simulation
2. Anagha Shyama Prakash: Defenses Against Prompt Attacks Learn Surface Heuristics
3. Ojas Nimase: Paper number/name TBD
4. Alexander Romanus: Understanding and Rectifying Safety Perception Distortion in VLMs
5. Xu Wang: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models
|
Mid project slides due (TBD) |
Week 12 Wed, Apr 1 |
Student paper presentations. |
1. Ashutosh Chaubey: Paper number/name TBD
2. Xinyuan Li: Paper number/name TBD
3. Tianyi Zhang: RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback
4. Jae Won Choi: Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
5. Junhan Wu: BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
|
|
| Part IV. Paper discussions and student projects |
Week 13 Wed, Apr 8 |
Student paper presentations. |
1. Faezeh Dehghan: Safety Alignment Should Be Made More Than Just Superficial
2. Ruth-Ann Armstrong: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models
3. Tianyi Zhang: On the Vulnerability of Applying Retrieval-Augmented Generation
4. Ramya Krishnan: Memory Injection Attacks on LLM Agents via Query-Only
5. Jayavibhav Niranjan Kogundi: Reasoning Model Unlearning: Forgetting Traces, Not Just Answers
|
|
Week 14 Wed, Apr 15 |
Student paper presentations. |
1. Shreyas Kolte: Training LLMs for Honesty via Confessions
2. Yi Nian: On the Vulnerability of Safety Alignment in Open-Access LLMs
3. Xu Wang: Trading Inference-Time Compute for Adversarial Robustness
4. Jingzhen Wang: Paper number/name TBD
5. Jingzhen Wang: Paper number/name TBD
|
|
Week 15 Wed, Apr 22 |
Student paper presentations. |
1. Qi Pan: NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation
2. Yi Nian: JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
3. Faezeh Dehghan: Paper number/name TBD
4. Xinyuan Li: Paper number/name TBD
5. Tim Aris: Inference-Time Reward Hacking in Large Language Models
|
|
Week 16 Wed, Apr 29 |
Final Project Presentations
In-class project presentations.
|
None |
Presentation signup sheet |
Final Sun, May 3 |
Final project report due (no in-class final exam). |
None |
Due Sun, May 3, 2026, 11:59 PM PST (Gradescope) |