Pytorch fp16 slow. 1 CUDA extension. 10s (FP16), ~ 0. vs PyTorch (high ...