Tag: 推理加速
All the articles with the tag "推理加速".
初探AI Infra
Updated: at 18:30Published: at 16:04趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识,还有一些CUDA编程的体系架构之类的内容。
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference
Updated: at 15:06Published: at 14:18GPU上做MM相关的算子生成,利用load balancing和稀疏做加速,根据model生成PTX代码
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated: at 15:06Published: at 13:27Flash Attention,利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling,Online Softmax和Kernel Fusion。
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Updated: at 15:06Published: at 18:33谷歌的,第一篇完整跑通interger-only量化推理流程的工作。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated: at 15:06Published: at 18:32From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron,减少资源消耗。