Submissions from siboehm.com

		Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022) (siboehm.com)
		74 points by georgehill 38 days ago \| past \| 23 comments
		How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022) (siboehm.com)
		103 points by mpweiher 43 days ago \| past \| 33 comments
		Pipeline Parallelism: Distributed Training via Model Partitioning (siboehm.com)
		2 points by ml_basics 7 months ago \| past
		Fast Multidimensional Matrix Multiplication on CPU from Scratch (siboehm.com)
		3 points by softwaredoug on Aug 25, 2023 \| past
		How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog (siboehm.com)
		130 points by todsacerdoti on Jan 5, 2023 \| past \| 16 comments
		Data-parallel distributed training of deep learning models (siboehm.com)
		1 point by siboehm on Nov 13, 2022 \| past
		Lleaves – Compiling decision trees for fast prediction using LLVM (siboehm.com)
		4 points by kylebarron on Sept 20, 2021 \| past