Shir Rozenfeld
Shir Rozenfeld

Shir Rozenfeld

ML researcher working on LLM safety, activation-based monitoring, and interpretability.

Data Scientist, Microsoft — Security CTO organization · Israel

intro

About

I'm an ML researcher and data scientist working at the intersection of large language models and security. My work focuses on making LLMs safer and more transparent — reading and monitoring what a model is doing at the level of its internal activations, rather than judging it only by its outputs.

I currently work as a Data Scientist at Microsoft, in the Security CTO organization. I led the development of GAVEL, an open-source framework for detecting misaligned behaviors and intentions in LLMs through rule-based monitoring over model activations (ICLR 2026; a Black Hat USA 2026 Briefing). Earlier, I researched interpretability and offensive AI at the Offensive AI Lab, Ben-Gurion University, and led a data-analysis team on national security missions.

Research interests

LLM safety · activation-based monitoring · mechanistic interpretability · AI agents · model behavior & misalignment detection · offensive AI

Education

  • M.Sc. in Data Science, Ben-Gurion University — Meitar Excellence Program · 2024–2026
  • B.Sc. in Computer Software Engineering, Ben-Gurion University — 2021–2025
latest

News

focus

Research

LLM safety & activation monitoring

Real-time detection of unsafe or misaligned behavior by monitoring a model's internal activations — the line of work behind GAVEL.

Interpretability & mechanistic analysis

Understanding the internal computations of language models, from individual features up to circuits.

AI agents for interpretability

Using agent-tool workflows to automate hypothesis generation, experiment orchestration, and validation of model-internal circuits.

Model behavior & misalignment

Characterizing how models behave, when they fail, and how they can be misused — including LLM-driven social engineering.

papers

Publications

  1. GAVEL: Towards Rule-Based Safety Through Activation Monitoring

    Shir Rozenfeld, et al.

    ICLR 2026 · International Conference on Learning Representations

    arXiv code

  2. Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams

    with collaborators at the Offensive AI Lab, incl. Shir Rozenfeld

    USENIX Security 2026 · 35th USENIX Security Symposium

    arXiv

Full list on Google Scholar.

speaking

Talks

GAVEL: Real-Time LLM Safety Monitoring Through Activation Analysis

ICLR 2026 · Poster & talk · Apr 2026

Catching misaligned intentions in LLMs in real time by monitoring their internal activations.

ICLR video

GAVEL — Detecting Misaligned Intentions in LLMs via Activation Monitoring

Black Hat USA 2026 · Briefing · Aug 2026

Black Hat

say hi 👋

Contact

I'm glad to talk about LLM safety, interpretability, and offensive AI. The best ways to reach me are email and LinkedIn.