[Bi-users] Slurm problems (was: Slurm upgrade next week 2018-05-30 13-15 CEST)

Kent Engström kent at nsc.liu.se
Wed May 30 18:22:11 CEST 2018


kent at nsc.liu.se (Kent Engström) writes:
> Dear Bi Users,
>
> We are now ready for the next step in upgrading the Slurm resource
> manager on Bi (from version 16.05 to 17.02).
>
> We have scheduled this for Wednesday 2018-05-30 between 13 and 15 CEST.
> During that, there will be a short period (~ 15 min) when new jobs
> (including interactive ones) will not start but remain queued until the
> end of that short period.
>
> We expect no other inconveniences, as the same operation has already
> been done successfully on the SNIC/LiU clusters Triolith and Gamma.

Alas, we did run into problems related to Bi-specific configuration
related to the hyperthreading that has not been encountered on the
systems where it is turned off.

It turns out that just specifying "-N number-of-nodes" (like "-N2") does
not work as before, where -N2 gives 32 tasks spread over 2 nodes, 16 per
node. Instead you get the double number of tasks (64 tasks, 32 per
node) and it fails in the MPI launch where resources are not allocated
for that.

We will try to restore the old behaviour tomorrow...

... but a quick workaround until that succeeds is to add
"--ntasks-per-node=16" together with your -N flag. One can even argue
that being explicit like that is a good thing, and it will continue to
work.

Also, if it is natural for you to think in terms of the number of tasks
(MPI ranks) instead of nodes, please feel free to use -n to specify the
number of tasks and let Slurm figure out the number of nodes. For
example, "-n 32" works tonight and worked before, and gives you 32 tasks
spread over 2 nodes, 16 per node.


Sorry for the inconvenience,

-- 
Kent Engström, National Supercomputer Centre
kent at nsc.liu.se, +46 13 28 4444



More information about the Bi-users mailing list