Mar 28, 2023 | Read time 8 min

How to Accurately Time CUDA Kernels in Pytorch

In a world of increasingly costly machine learning model deployments, ensuring accurate GPU operation timing is key to resource optimization. In this blog post, we explore best practices to achieve this in PyTorch.
PyTorch timing operations
Lawrence Atkins
Lawrence AtkinsMachine Learning Engineer
David MacLeod
David MacLeodMachine Learning Architect
References[1] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).

[2] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in Neural Information Processing Systems 35 (2022): 16344-16359.

[3] Yao, Zhewei, et al. "ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers." Advances in Neural Information Processing Systems 35 (2022): 27168-27183.

[4] Fawzi, Alhussein, et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610.7930 (2022): 47-53.
AuthorsLawrence Atkins & David MacLeod
AcknowledgementsCaroline Dockes, Ed Rees, Ellena Reid & Markus Hennerbichler