Beating NumPy matrix multiplication in 150 lines of C