Shir Rozenfeld
ML researcher working on LLM safety, activation-based monitoring, and interpretability.
Data Scientist, Microsoft — Security CTO organization · Israel
About
I'm an ML researcher and data scientist working at the intersection of large language models and security. My work focuses on making LLMs safer and more transparent — reading and monitoring what a model is doing at the level of its internal activations, rather than judging it only by its outputs.
I currently work as a Data Scientist at Microsoft, in the Security CTO organization. I led the development of GAVEL, an open-source framework for detecting misaligned behaviors and intentions in LLMs through rule-based monitoring over model activations (ICLR 2026; a Black Hat USA 2026 Briefing). Earlier, I researched interpretability and offensive AI at the Offensive AI Lab, Ben-Gurion University, and led a data-analysis team on national security missions.
Research interests
LLM safety · activation-based monitoring · mechanistic interpretability · AI agents · model behavior & misalignment detection · offensive AI
Education
- M.Sc. in Data Science, Ben-Gurion University
- B.Sc. in Computer Software Engineering, Ben-Gurion University
News
-
2026
Heron delegation @ London Initiative for Safe AI (LISA)
Visited LISA with the Heron team and 14 people from our network — supporting experienced cyber professionals as they bring their skills into AI security.
-
2026
EA Global: London 2026
Attended EAG London to connect with the AI safety community.
Research
LLM safety & activation monitoring
Real-time detection of unsafe or misaligned behavior by monitoring a model's internal activations — the line of work behind GAVEL.
Interpretability & mechanistic analysis
Understanding the internal computations of language models, from individual features up to circuits.
AI agents for interpretability
Using agent-tool workflows to automate hypothesis generation, experiment orchestration, and validation of model-internal circuits.
Model behavior & misalignment
Characterizing how models behave, when they fail, and how they can be misused — including LLM-driven social engineering.
Publications
-
GAVEL: Towards Rule-Based Safety Through Activation Monitoring
ICLR 2026 · International Conference on Learning Representations
-
Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams
USENIX Security 2026 · 35th USENIX Security Symposium
Full list on Google Scholar.
Talks
GAVEL: Real-Time LLM Safety Monitoring Through Activation Analysis
Catching misaligned intentions in LLMs in real time by monitoring their internal activations.
GAVEL — Detecting Misaligned Intentions in LLMs via Activation Monitoring
Contact
I'm glad to talk about LLM safety, interpretability, and offensive AI. The best ways to reach me are email and LinkedIn.