Tag: 推理加速

All the articles with the tag "推理加速".

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Published:2025年7月7日 at 16:23
T-MAC, 用LUT加速BitNet系列的工作，在CPU上跑，后续还有一个工作叫T-MAN是在移动端的高通CPU里面的NPU上跑LUT加速。
HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches
Published:2025年6月25日 at 16:27
ISCA2025，做稀疏数据流分块的，后半截没什么精力看了，现在的工作还没做稀疏编码。
Prosperity: Accelerating Spiking Neural Networks via Product Sparsity
Published:2025年6月11日 at 16:52
HPCA在投的一篇SNN加速器文章，里面的“Product Sparsity”本质是减少相同内容的重复计算，和一般讨论的稀疏是两种不同的概念。
Recurrent Residual Module for Fast Inference in Videos
Published:2025年6月9日 at 15:25
CVPR2018， DiffEncode + 稀疏加速，但感觉太老了。
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
Published:2025年6月9日 at 14:18
NIPS2022上一篇比较有影响力的论文，对GAN和扩散模型做推理加速的工作，提出了Spatially Sparse Inference，仅在被编辑区域上稀疏地应用卷积滤波器，同时对未编辑区域复用缓存的特征
初探AI Infra
Updated:2025年3月11日 at 18:30Published: 2025年3月4日 at 16:04
趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识，还有一些CUDA编程的体系架构之类的内容。
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference
Updated:2025年3月8日 at 15:06Published: 2024年10月17日 at 14:18
GPU上做MM相关的算子生成，利用load balancing和稀疏做加速，根据model生成PTX代码
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated:2025年3月8日 at 15:06Published: 2024年3月7日 at 13:27
Flash Attention，利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling，Online Softmax和Kernel Fusion。
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Updated:2025年3月8日 at 15:06Published: 2024年3月4日 at 18:33
谷歌的，第一篇完整跑通interger-only量化推理流程的工作。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated:2025年3月8日 at 15:06Published: 2024年3月4日 at 18:32
From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron，减少资源消耗。

Tag: 推理加速

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches

Prosperity: Accelerating Spiking Neural Networks via Product Sparsity

Recurrent Residual Module for Fast Inference in Videos

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models

初探AI Infra

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU