JohannesGaessler's picture
CUDA: faster large batch FA without tensor cores (llama/7314)
a6d9f2d