Analog in-memory computing attention mechanism for fast and energy-efficient large language models
This paper introduces a novel analog in-memory computing architecture using "gain cells" for the attention mechanism in large language models (LLMs). This hardware approach significantly reduces energy consumption (up to four orders of magnitude) and latency (up to two orders of magnitude) compared to GPUs, achieving GPT-2 comparable performance despite introducing hardware-specific non-idealities and limitations like capacitor leakage. The authors developed an adaptation algorithm to map pre-trained models to this new hardware without training from scratch.