Posts

All the articles I've posted.

Nested Learning: The Illusion of Deep Learning Architectures
Updated:2025年11月10日 at 17:08Published: 2025年11月8日 at 11:40
谷歌新作，号称“深度学习新范式”。提到了异步，具体指的是让模型靠近输入的位置的更新频率高于靠后的位置，这个思路和之前Sakana AI的那个文章有点像。但文章里面的东西感觉全都是Fast Weight Programming的内容，arxiv的文章全文也一直没挂出来。
Kimi Linear: An Expressive, Efficient Attention Architecture
Updated:2025年11月4日 at 19:10Published: 2025年11月4日 at 13:55
Kimi Linear，有比较详细的实验&Scale Up。有Linear Attention可以去掉RoPE这个结论还是比较惊喜的。
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Updated:2025年10月16日 at 15:15Published: 2025年10月7日 at 17:05
AI Lab关于”广义“LLM推理加速的工作，包括Linear Attention，Sparse Attention，Diffusion LLM，Applications等。
Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2
Updated:2025年9月29日 at 00:39Published: 2025年9月28日 at 23:32
ICLR2025 Workshop，基于HAQ实现的Matmul-Free SNN LLM（虽然只做了370M参数的实验）部署到Loihi2上，实现了相比于Qwen-500M 模型3\timesThroughput和2\times能效。但说实话文章内容关键点都没怎么讲，也没有什么特别很exciting的东西。
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Updated:2025年9月26日 at 16:46Published: 2025年9月25日 at 14:43
DeltaNet
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Updated:2025年9月24日 at 15:07Published: 2025年9月23日 at 13:50
VLDB2024，阿里的工作，看起来工程特别扎实。LLM任务上只通过对weight做sparse load就能在decode阶段获得3-4倍的提速。
SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型
Updated:2025年9月15日 at 14:34Published: 2025年9月15日 at 10:46
李国齐老师组的新工作技术报告。说实话，我并不觉得这是一个正经的SNN-LLM工作，感觉已经完全是Linear Attention国产化的工作了。很难评价。
MLP Memory: Language Modeling with Retriever-pretrained External Memory
Published:2025年8月25日 at 14:22
用MLP学习并代替RAG中kNN输出的概率分布。
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Published:2025年8月14日 at 16:04
ACL2025 Best Paper，DeepSeek新作。分层KV Cache提高稀疏度，在训练和推理阶段同时提高性能。
GPU上的SNN稀疏加速
Updated:2025年7月14日 at 11:09Published: 2025年7月13日 at 14:11
把最近做的关于GPU上SNN稀疏加速的东西做一下总结，虽然不太成功。

Posts

Nested Learning: The Illusion of Deep Learning Architectures

Kimi Linear: An Expressive, Efficient Attention Architecture

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型

MLP Memory: Language Modeling with Retriever-pretrained External Memory

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

GPU上的SNN稀疏加速