Tag: LLM

All the articles with the tag "LLM".

Transformers without Normalization
Published:2025年5月7日 at 16:09
何恺明新作，用DyT代替Norm，把同步操作变成了Element Wise的操作。新文章里面有用到，学习一下。
Visualizing and Understanding the Effectiveness of BERT
Published:2025年5月6日 at 10:21
最近做SNN训练的过程中在研究怎么可视化训练过程中的Loss，在想新加入的方法会不会对模型的Loss Landscape有影响，一般讲Loss Landscape怎么做可视化的文章都会引用这篇文章对Loss Landscape的分析和做法。
初探AI Infra
Updated:2025年3月11日 at 18:30Published: 2025年3月4日 at 16:04
趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识，还有一些CUDA编程的体系架构之类的内容。
Titans: Learning to Memorize at Test Time
Updated:2025年3月8日 at 14:57Published: 2025年1月15日 at 18:36
从TTT改进而来的新架构，尝试通过TTT的方式改进模型的记忆能力。
Were RNNs All We Needed?
Updated:2025年3月8日 at 15:06Published: 2024年10月31日 at 16:07
改进RNN，便于scale up
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
Updated:2025年3月8日 at 15:06Published: 2024年6月17日 at 17:11
LLM的Interger-Only PTQ量化工作。
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated:2025年3月8日 at 15:06Published: 2024年3月7日 at 13:27
Flash Attention，利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling，Online Softmax和Kernel Fusion。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated:2025年3月8日 at 15:06Published: 2024年3月4日 at 18:32
From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron，减少资源消耗。

Tag: LLM

Transformers without Normalization

Visualizing and Understanding the Effectiveness of BERT

初探AI Infra

Titans: Learning to Memorize at Test Time

Were RNNs All We Needed?

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU