Show HN: Sparse Matrix-Vector Multiplication that works at 30–90% sparsity

github.com

7 points by vlejd a day ago

To get benefits from sparsity, you usually need to have very sparse matrices, impose some structure on the sparsity pattern or have specialized hardware. None of it is the case if you want to rune pruned LLMs on consumer devices. I wanted to see how far can you push it on a GPU and ended up with this. Blog: https://www.grizzlytech.dev/blog/macko-spmv Paper: https://arxiv.org/abs/2511.13061 Code (example with torch): https://github.com/vlejd/macko_spmv

telmop 21 hours ago

Cool method. Pre deep learning there was plenty of interesting research on sparse methods. What do you think we're missing to have more widely used neural+sparse approaches?

  • vlejd 21 hours ago

    I think the lack of efficient GPU kernels was the main problem. It is much, much easier to get a real speedup and memory reduction from quantization from fp16 to fp8 than from 50% sparsity. For sparsity you needed structure (which makes your model worse) and special hardware support.

jjgreen a day ago

Interesting approach -- thanks

  • fleahunter 19 hours ago

    Interesting approach! I've been thinking a lot about how often we get caught up in striving for extreme sparsity without considering the practical implications of using pruned models on consumer hardware. It reminds me of a project I worked on where we had to optimize for both performance and memory constraints, and we found ourselves tangled in the weeds of matrix representation.

    I'm curious about your performance metrics—did you find any surprising edge cases when dealing with certain sparsity patterns on different GPUs? I can imagine folks running LLMs on consumer devices will appreciate any optimizations that help squeeze out more efficiency, especially if they’re dealing with larger models. And you mentioned that the MACKO format works across all GPUs—this could really democratize these technologies. That's exciting!

    Have you thought about how this might impact other areas of machine learning or even classical algorithm work? I'd love to hear more about the community's thoughts on bridging the gap between pruning and quantization too.

    • vlejd an hour ago

      Interestingly enough, we found that cublas is not that well optimized for some less common consumer GPUs, specifically 3090. We saw that is didn't really achieve it's full potential for a lot of different matrix shapes, probably because of poor tuning. Interestingly enough, out kernel does not have any parameters, and it was able to outperform cublas even in setting where it has no right to do so.

      Regarding patterns, we tested mainly random matrices and ones created by Wanda pruning. 2:4 sparsity (commonly used structure) will have same results as random matrix (probably even better). Interestingly enough, block sparsity could have very close to a worst case scenario with our format, because it promotes disproportional long sequences of zeroes.

      Regarding other usecases, we are looking into it, but most common ones we found are usually for much smaller sparsity <1%. If you know about some other use case that is in the 30-90 range, let us know.