Shir Rozenfeld

ML researcher working on AI Safety through mechanistic interpretability.

Data Scientist @ Microsoft Security · PhD Candidate @ BGU

intro

About

I'm a Data Scientist at the Cyber Security GenAI Lab in the Microsoft Security CTO Office, and a PhD candidate at Ben-Gurion University, where I do Technical AI Safety research in the Offensive AI Research Labstrong>. Earlier, I led the development of GAVELstrong>, an open-source framework for detecting misaligned behaviors and intentions in LLMs through activation analysis (ICLR 2026; Black Hat USA 2026 Briefing). Before that, I led a data analysis team on national security missions.

Research interests

LLM safety · activation-analysis · mechanistic interpretability · AI agents · model behavior & intention detection · offensive AI

Education

M.Sc. in Data Science, Ben-Gurion University — Meitar Excellence Program · 2024–2026
B.Sc. in Computer Software Engineering, Ben-Gurion University — 2021–2025

LATELY

News

2026

🎤 Giving a talk at Black Hat USA 2026 Briefings

Presenting an extension of our work at one of the world's top cybersecurity conferences, August 5–6 in Las Vegas.

2026

🎤 Gave a talk at Heron Lightning Talks, TLV

A lightning talk on recent advancements in AI Safety for the AI Security community.

2026

🌍 EA Global: London 2026

Attended EAG London to connect with the AI safety community.

2026

🎉 Paper accepted at ICLR 2026

GAVEL: Towards Rule-Based Safety Through Activation Monitoring — arXiv · code

2026

🎉 Paper accepted at USENIX Security 2026

Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams — arXiv

focus

Research

LLM safety & activation monitoring

Real-time detection of unsafe or misaligned behavior by monitoring a model's internal activations — the line of work behind GAVEL.

Interpretability & mechanistic analysis

Understanding the internal computations of language models, from individual features up to circuits.

AI agents for interpretability

Using agent-tool workflows to automate hypothesis generation, experiment orchestration, and validation of model-internal circuits.

Model behavior & misalignment

Characterizing how models behave, when they fail, and how they can be misused — including LLM-driven social engineering.

say hi 👋

Contact

I'm glad to talk about LLM safety, interpretability, and offensive AI. The best ways to reach me are email and LinkedIn.

shirozenfeld@gmail.com