This blog post has been archived and is no longer accessible. For the updated version with improved performance, please check Beating OpenBLAS in FP32 Matrix Multiplication.