BIAC Forums - cluster queues/jobs reconfigured 9/19/11

Duke-UNC Brain Imaging and Analysis Center

All Forums

Support Forums

Cluster Support

cluster queues/jobs reconfigured 9/19/11

New Topic

Reply to Topic

Printer Friendly

Author

Topic

petty
BIAC Staff

USA
453 Posts

Posted - Sep 19 2011 : 11:55:18 AM

As of this morning there have been a couple changes to the general queues/limitations/etc

1) qsub submissions are allowed on all available nodes. This will work just like on the head node, and should be easier to interact with since you can edit scripts and submit them through interactive jobs

2) there is now a default of 4GB memory hard limit set on each job. 4GB is a significant amount, therefore most users won't have to change the way they are doing anything. However, If a job peaks higher then the default amount, the grid engine will automatically kill the job. This is done because there is a finite amount of RAM installed in each node and since the nodes are all disk-less, its not possible to go over that limitation. If all jobs on a node go over the limit of installed RAM, then everything on the particular node crashes. If you find yourself in a situation needing more, it is a requestable resource at submission time.

> qsub -l h_vmem=5G -v EXPERIMENT=Test.01 script.sh

The above example would ask for 5GB. Therefore your job would not be sent to a node unless 5GB of ram were available and not already assigned.

3) the interactive jobs have been moved from the older nodes and divided as 1 job per node on all of the newer nodes. This should also prevent over allocation of resources, which i've seen a lot of recently on the older hardware.

4) no user can occupy more than 75% of slots in the queue. This will assure there's always some available slots for other users. This was done to better share resources and in anticipation of FSL's parallel processing, which should be opened up relatively soon.

Please let me know if there are any questions/concerns.

All of these changes have been documented on the wiki:
http://wiki.biac.duke.edu/biac:cluster:submit

Thanks,
-chris

dvsmith
Advanced Member

USA
218 Posts

Posted - Sep 20 2011 : 4:39:01 PM

Awesome -- Thanks, Chris!
I had a quick question about #2: are these memory limit values propagated to other nodes/processors when fsl_sub submits a single job into multiple parts? I mainly wonder for things like the groupICA where different parts of the job have different demands and flexibility -- e.g., the initial registrations to standard space can be parallelized across multiple nodes (and hence need smaller amounts of memory) while the PCA stage can't (but needs more memory). Right now, I'd have to submit these with the max memory (which still might fail), but only one part of the job would need this high amount of memory (and not the hundreds of registrations that would be conducted before the PCA stage).

petty
BIAC Staff

USA
453 Posts

Posted - Sep 20 2011 : 7:19:13 PM

The memory limit is set per job by the grid engine. So if you requested 20GB, any job that is spawned off by FSL gets the 4GB default ( unless it also requests more ) .

I looked through some of their code that calls fsl_sub and didn't see anywhere that they were requesting this type of variable, so it could be some trial and error values that we have to add if people start hitting the limit on any of the sub jobs. I spent a few days editing and testing an amended fsl_sub to make it relevant for BIAC and to allow us to submit from all nodes ( without fsl_sub automatically running ), so if it comes to what i just described, it wouldn't actually be that difficult.

the fsl_sub jobs won't work just yet because even though the mounts are the same across all the nodes, those are currently just symlinks. We found that everything generating a path follows the symlink to the real mount point, which is not currently static. However, there will be an update soon ( on our new mounter ) that will resolve that issue.

dvsmith
Advanced Member

USA
218 Posts

Posted - Sep 20 2011 : 8:40:07 PM

Cool, thanks for tweaking their code. With the memory requests and different stages within a single job (e.g., registrations -> PCA stage), I'm mainly just wondering how different stages are handled in terms of memory allocation. If, for example, the PCA stage in MELODIC is initiated by the final part of the registration stage, it would get the default memory allocation -- and not the the 32 GBs that it would need with a large dataset.

Looking forward to seeing fsl_sub (and other, similarly coded, submission mechanisms) in action at BIAC. Thanks a lot for getting this up and going!

petty
BIAC Staff

USA
453 Posts

Posted - Sep 20 2011 : 8:45:43 PM

well, if the PCA stage isn't one that would get spawned off from the original job into a separate submission, then it would be running on the node where you made the request for a higher amount of memory.

anything that leaves the originating node is going to get the 4GB default ... anything that continues to run where it initiated will have the higher number requested.

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

BIAC Forums

This page was generated in 0.49 seconds.

Snitz Forums 2000