Tag: LLM
All the articles with the tag "LLM".
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated: at 15:06Published: at 13:27Flash Attention,利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling,Online Softmax和Kernel Fusion。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated: at 15:06Published: at 18:32From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron,减少资源消耗。