Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
Username:
Password:
Save Password   Forgot your Password?
 All Forums
 Support Forums
 Cluster Support
 Cluster Jobs Failing Silently
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Carpenter
New Member

16 Posts

Posted - Oct 30 2011 :  4:25:56 PM  Show Profile  Reply with Quote
Hi everyone,
Over the past several weeks I have had major problems with jobs failing silently mid-way through running. I don't think that this is a problem with my scripts because they will generally run just fine if I re-submit them. This has happened over a variety of script types (from scripts that run 3rd level FSL analyses to scripts that merely copy folders from one location to another), so I don't think it is a problem of my not requesting enough memory, but maybe I'm wrong?
Just now I ran a set of 3rd level analyses, all of which failed without any indication of error. Even the .out files created by the standard cluster scripts show the job starting, but never get to the step of echoing "----JOB [$JOB_NAME.$JOB_ID] STOP [`date`]----" and were never moved out of my home directory on the cluster (i.e. out of carpenter@hugin ~ and into my Analysis folder).
Has anyone else experienced this recently? Any ideas what could be causing this and how to fix it?
Many thanks,
Kim

clithero
Junior Member

37 Posts

Posted - Oct 30 2011 :  5:39:29 PM  Show Profile  Reply with Quote
Hi Kim (and everyone else),

Yes, I have experienced similar phenomena with several different FSL scripts. I have also had some jobs run through completely, but remain hanging on the cluster, despite having completed the FSL analyses.

John
Go to Top of Page

petty
BIAC Staff

USA
453 Posts

Posted - Oct 30 2011 :  5:46:48 PM  Show Profile  Reply with Quote
Well, there have been some unexplained phenomenon happening ... However, if this is directly related to FSL i installed a beta option to the current FSL install. Apparently this also cause a Feat version bump ( which was unexpected ). So some people had issues running their older templates after i added the options.

I was unaware of the version change .. So this afternoon around 330pm i reinstalled the older version, and moved my beta version.

If you made a template this week, then you may need to re-make it with the non-beta version of fsl ... Sorry for any issues that may have contributed.

-chris
Go to Top of Page

petty
BIAC Staff

USA
453 Posts

Posted - Oct 31 2011 :  2:06:28 PM  Show Profile  Reply with Quote
this appears memory related.

qacct -j 2633222

3 things stick out:

failed 100 : assumedly after job
exit_status 137
maxvmem 4.212G


This means that your job failed ... and 137 typically means it was killed ( either by you or SGE ). However this maxvmen, means that you went over 4G, which is the default assigned by the grid. 4G is a hard limit ... meaning if you exceed it, then your jobs gets killed. So piece all those together and you need to request more ram.
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.15 seconds. Snitz Forums 2000