Comparaison des versions

Légende

  • Ces lignes ont été ajoutées. Ce mot a été ajouté.
  • Ces lignes ont été supprimées. Ce mot a été supprimé.
  • La mise en forme a été modifiée.

At the time of this document, RC has three models of nVidia GPU in the compute cluster:

  • gt740, 2Gb 
  • k20, 4.8Gb
  • Titan, 6Gb

These GPUs are scheduled using the Generic REsource Scheduler (gres) feature of slurm.  This ensures that the only jobs that can talk to a GPU are the ones scheduled against it.  We made a design decision to put all GPUs in a single gres called "gpu" even though we have two basic sizes.  If you simply request a gpu gres, you may get a small or large card depending on availability. 

We leverage a second feature of slurm called constraints to restrict which nodes are elegible to service a job request based on features we assign to specific nodes.  Some of the current features include: rhel, rhel6, rhel7, intel, amd, cuda, bigcuda, infiniband.

The two features we are interested in for GPU scheduling are cuda and bigcuda.  Nodes with small GPUs (gt740) have a feature of cuda. Nodes with big GPUs (k20, titan) have a feature of bigcuda.  This way, you can constrain your job to an approprately sized GPU.

Examples

Info
titleRun the job file SmallTraining.sh against any gpu available in the cluster
sbatch --qos=work --gres=gpu SmallTraining.sh
Info
titleRun the job file LargeTraining.sh against large memory GPUs
sbatch --qos=work --gres=gpu --constraint=bigcuda LargeTraining.sh

This wiki page is deprecated. You can find this documentation on our new documentation site: https://research-computing.git-pages.rit.edu/docs/gpu_scheduling.html 

Contenu par étiquette
showLabelsfalse
max5
spacesrc
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("slurm","gpu","cuda") and type = "page" and space = "rc"
labelsGPU Cuda slurm

Propriétés de la page
hiddentrue


Related issues
 

...