BIAC Forums - Cluster Jobs Failing Silently

Duke-UNC Brain Imaging and Analysis Center

All Forums

Support Forums

Cluster Support

Cluster Jobs Failing Silently

New Topic

Reply to Topic

Printer Friendly

Author

Topic

Carpenter
New Member

16 Posts

Posted - Oct 30 2011 : 4:25:56 PM

Hi everyone,
Over the past several weeks I have had major problems with jobs failing silently mid-way through running. I don't think that this is a problem with my scripts because they will generally run just fine if I re-submit them. This has happened over a variety of script types (from scripts that run 3rd level FSL analyses to scripts that merely copy folders from one location to another), so I don't think it is a problem of my not requesting enough memory, but maybe I'm wrong?
Just now I ran a set of 3rd level analyses, all of which failed without any indication of error. Even the .out files created by the standard cluster scripts show the job starting, but never get to the step of echoing "----JOB [$JOB_NAME.$JOB_ID] STOP [`date`]----" and were never moved out of my home directory on the cluster (i.e. out of carpenter@hugin ~ and into my Analysis folder).
Has anyone else experienced this recently? Any ideas what could be causing this and how to fix it?
Many thanks,
Kim

clithero
Junior Member

37 Posts

Posted - Oct 30 2011 : 5:39:29 PM

Hi Kim (and everyone else),

Yes, I have experienced similar phenomena with several different FSL scripts. I have also had some jobs run through completely, but remain hanging on the cluster, despite having completed the FSL analyses.

John

petty
BIAC Staff

USA
453 Posts

Posted - Oct 30 2011 : 5:46:48 PM

Well, there have been some unexplained phenomenon happening ... However, if this is directly related to FSL i installed a beta option to the current FSL install. Apparently this also cause a Feat version bump ( which was unexpected ). So some people had issues running their older templates after i added the options.

I was unaware of the version change .. So this afternoon around 330pm i reinstalled the older version, and moved my beta version.

If you made a template this week, then you may need to re-make it with the non-beta version of fsl ... Sorry for any issues that may have contributed.

-chris

petty
BIAC Staff

USA
453 Posts

Posted - Oct 31 2011 : 2:06:28 PM

this appears memory related.

qacct -j 2633222

3 things stick out:

failed 100 : assumedly after job
exit_status 137
maxvmem 4.212G

This means that your job failed ... and 137 typically means it was killed ( either by you or SGE ). However this maxvmen, means that you went over 4G, which is the default assigned by the grid. 4G is a hard limit ... meaning if you exceed it, then your jobs gets killed. So piece all those together and you need to request more ram.

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

BIAC Forums

This page was generated in 0.42 seconds.

Snitz Forums 2000