| Author |
Topic  |
|
|
Carpenter
New Member

16 Posts |
Posted - Oct 30 2011 : 4:25:56 PM
|
Hi everyone, Over the past several weeks I have had major problems with jobs failing silently mid-way through running. I don't think that this is a problem with my scripts because they will generally run just fine if I re-submit them. This has happened over a variety of script types (from scripts that run 3rd level FSL analyses to scripts that merely copy folders from one location to another), so I don't think it is a problem of my not requesting enough memory, but maybe I'm wrong? Just now I ran a set of 3rd level analyses, all of which failed without any indication of error. Even the .out files created by the standard cluster scripts show the job starting, but never get to the step of echoing "----JOB [$JOB_NAME.$JOB_ID] STOP [`date`]----" and were never moved out of my home directory on the cluster (i.e. out of carpenter@hugin ~ and into my Analysis folder). Has anyone else experienced this recently? Any ideas what could be causing this and how to fix it? Many thanks, Kim |
|
|
clithero
Junior Member
 
37 Posts |
Posted - Oct 30 2011 : 5:39:29 PM
|
Hi Kim (and everyone else),
Yes, I have experienced similar phenomena with several different FSL scripts. I have also had some jobs run through completely, but remain hanging on the cluster, despite having completed the FSL analyses.
John |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 30 2011 : 5:46:48 PM
|
Well, there have been some unexplained phenomenon happening ... However, if this is directly related to FSL i installed a beta option to the current FSL install. Apparently this also cause a Feat version bump ( which was unexpected ). So some people had issues running their older templates after i added the options.
I was unaware of the version change .. So this afternoon around 330pm i reinstalled the older version, and moved my beta version.
If you made a template this week, then you may need to re-make it with the non-beta version of fsl ... Sorry for any issues that may have contributed.
-chris |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 31 2011 : 2:06:28 PM
|
this appears memory related.
qacct -j 2633222
3 things stick out:
failed 100 : assumedly after job exit_status 137 maxvmem 4.212G
This means that your job failed ... and 137 typically means it was killed ( either by you or SGE ). However this maxvmen, means that you went over 4G, which is the default assigned by the grid. 4G is a hard limit ... meaning if you exceed it, then your jobs gets killed. So piece all those together and you need to request more ram. |
 |
|
| |
Topic  |
|