Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
 All Forums
 Support Forums
 Cluster Support
 Cluster Jobs Failing Silently

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

Screensize:
UserName:
Password:
Format Mode:
Format: BoldItalicizedUnderlineStrikethrough Align LeftCenteredAlign Right Horizontal Rule Insert HyperlinkInsert EmailInsert Image Insert CodeInsert QuoteInsert List
   
Message:

* HTML is OFF
* Forum Code is ON
Smilies
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Clown [:o)]
Black Eye [B)] Eight Ball [8] Frown [:(] Shy [8)]
Shocked [:0] Angry [:(!] Dead [xx(] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
Carpenter Posted - Oct 30 2011 : 4:25:56 PM
Hi everyone,
Over the past several weeks I have had major problems with jobs failing silently mid-way through running. I don't think that this is a problem with my scripts because they will generally run just fine if I re-submit them. This has happened over a variety of script types (from scripts that run 3rd level FSL analyses to scripts that merely copy folders from one location to another), so I don't think it is a problem of my not requesting enough memory, but maybe I'm wrong?
Just now I ran a set of 3rd level analyses, all of which failed without any indication of error. Even the .out files created by the standard cluster scripts show the job starting, but never get to the step of echoing "----JOB [$JOB_NAME.$JOB_ID] STOP [`date`]----" and were never moved out of my home directory on the cluster (i.e. out of carpenter@hugin ~ and into my Analysis folder).
Has anyone else experienced this recently? Any ideas what could be causing this and how to fix it?
Many thanks,
Kim
3   L A T E S T    R E P L I E S    (Newest First)
petty Posted - Oct 31 2011 : 2:06:28 PM
this appears memory related.

qacct -j 2633222

3 things stick out:

failed 100 : assumedly after job
exit_status 137
maxvmem 4.212G


This means that your job failed ... and 137 typically means it was killed ( either by you or SGE ). However this maxvmen, means that you went over 4G, which is the default assigned by the grid. 4G is a hard limit ... meaning if you exceed it, then your jobs gets killed. So piece all those together and you need to request more ram.
petty Posted - Oct 30 2011 : 5:46:48 PM
Well, there have been some unexplained phenomenon happening ... However, if this is directly related to FSL i installed a beta option to the current FSL install. Apparently this also cause a Feat version bump ( which was unexpected ). So some people had issues running their older templates after i added the options.

I was unaware of the version change .. So this afternoon around 330pm i reinstalled the older version, and moved my beta version.

If you made a template this week, then you may need to re-make it with the non-beta version of fsl ... Sorry for any issues that may have contributed.

-chris
clithero Posted - Oct 30 2011 : 5:39:29 PM
Hi Kim (and everyone else),

Yes, I have experienced similar phenomena with several different FSL scripts. I have also had some jobs run through completely, but remain hanging on the cluster, despite having completed the FSL analyses.

John

BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.2 seconds. Snitz Forums 2000