Choose a GPU

GPUs on HPC

GPU Full Name Year Launched Memory # of Tensor Cores
A100 NVIDIA A100 2020 40GB or 80GB 432 (3rd gen)
A6000 NVIDIA RTX A6000 2020 48GB 336 (3rd gen)
A40 NVIDIA A40 2020 48GB 336 (3rd gen)
RTX3090 NVIDIA GeForce RTX 3090 2020 24GB 328 (3rd gen)
RTX2080Ti NVIDIA GeForce RTX 2080 Ti 2018 11GB 544 (2nd gen)
V100 NVIDIA V100 2018 32GB 640 (1st gen)

Wait Time in the Queue

  • You may not need to request an A100 GPU!
  • Requesting an A100 may mean you wait in the queue for a much longer time than using another GPU,
  • This could give you a slower overall time (wait time + execution time) than if you had used another GPU.
Photo Source: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Memory Required to Train a DL Model

Generally, you will choose a GPU based on how much GPU memory you need. But, it is a hard problem to determine how much GPU memory a DL model will need for training before training the model.

  • In addition to storing the DL model, training also requires additional storage space such as:
    • Optimizer states
    • Gradients
    • Data (how much is determined by the batch size)
  • Training can also use automatic mixed precision which lowers the amount of memory needed

Visit https://blog.eleuther.ai/transformer-math/ for more information on math related to computation and memory usage for transformers.

General Advice

  • If you are learning about DL and doing tutorials, the GPUs in the Interactive partition are probably fine.
  • You can leave the GPU choice as default on the GPU partition and work on whichever GPU you get or choose a GPU with a smaller amount of memory first.
    • Train your model for one epoch and monitor the GPU memory usage.
    • Use this information to choose a GPU to do the complete training on.
  • You can calculate the size of your DL model (the number of parameters) to compute the memory needed to store the model. See here for details.
  • There is a tool on Hugging Face that can calculate memory needs for a transformers or timm model (using a batch size of 1): https://huggingface.co/spaces/hf-accelerate/model-memory-usage
  • Providing more information to users on how to choose a GPU for DL is currently being worked on.
  • Information will be updated on our website as it becomes available.
Previous
Next