Can We Locate and Prevent Stereotypes in LLMs?

cs.CL updates on arXiv.org

Alex D'Souza

Apr 23, 2026, 12:00 AM

arXiv:2604.19764v1 Announce Type: new Abstract: Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.