My current project is related to capsule networks, it requires many small matrix vector multiplications.
Without optimization, I think one big matvec and many small m