Agent: Cursor, Claude CodeLLM: Claude 3.5, GPT-4#mechanistic-interpretability#abliteration#model-editing#llm-research#open-source
OBLITERATUS is an advanced open-source toolkit for understanding and removing refusal behaviors from large language models through abliteration techniques. It provides a complete pipeline from probing hidden states to surgical intervention, with a Gradio interface on HuggingFace Spaces and a Python API for researchers. Every run contributes to a crowd-sourced dataset powering next-generation abliteration research.