-
vLLM
High-throughput and memory-efficient inference engine for LLMs
python · cuda · pytorch
-
TileLang
DSL for high-performance GPU/CPU/Accelerator kernels
python · compiler
-
SGLang
High-performance serving framework for LLMs and multimodal models
python · inference
-
DeepGEMM
Clean and efficient FP8 GEMM kernels with fine-grained scaling
cuda · fp8
-
FlashMLA
Efficient MLA decoding kernels
c++ · cuda
-
nanochat
Train a 500M+ model end-to-end for less than $100
LLM training