Tag: LLM
All the articles with the tag "LLM".
初探AI Infra
Updated: at 18:30Published: at 16:04趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识,还有一些CUDA编程的体系架构之类的内容。
Titans: Learning to Memorize at Test Time
Updated: at 14:57Published: at 18:36从TTT改进而来的新架构,尝试通过TTT的方式改进模型的记忆能力。
Were RNNs All We Needed?
Updated: at 15:06Published: at 16:07改进RNN,便于scale up
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
Updated: at 15:06Published: at 17:11LLM的Interger-Only PTQ量化工作。
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated: at 15:06Published: at 13:27Flash Attention,利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling,Online Softmax和Kernel Fusion。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated: at 15:06Published: at 18:32From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron,减少资源消耗。