Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
 All Forums
 Support Forums
 Cluster Support
 Failing cluster jobs

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

Screensize:
UserName:
Password:
Format Mode:
Format: BoldItalicizedUnderlineStrikethrough Align LeftCenteredAlign Right Horizontal Rule Insert HyperlinkInsert EmailInsert Image Insert CodeInsert QuoteInsert List
   
Message:

* HTML is OFF
* Forum Code is ON
Smilies
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Clown [:o)]
Black Eye [B)] Eight Ball [8] Frown [:(] Shy [8)]
Shocked [:0] Angry [:(!] Dead [xx(] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
mullette-gillman Posted - Nov 05 2008 : 3:13:21 PM
Hi there,

Over the last week I have seen many FSL 4.1 jobs that fail once, but perform perfectly fine if I just rerun them. This is for both prestats and first level analyses, and the failure rate can be as high as 20% for a given set of jobs (40+). Again, I just delete the failure and rerun the script and it works fine; I do not believe this is a path error. I have spoken to one other person that has noticed the same thing happening over the last week.

For the 1st level jobs that fail I am told it is a memory issue in the FSL log. For prestats jobs it gives me one or more of the following error in the log:
/usr/local/fsl-4.1.0-centos4_64/bin/slicetimer -i prefiltered_func_data_mcf --out=prefiltered_func_data_st -r 2.0 --odd
++ WARNING: nifti_read_buffer(prefiltered_func_data_mcf.nii.gz):
data bytes needed = 557056
data bytes input = 480347
number missing = 76709 (set to 0)
--------

The log files generated and emailed to me show the jobs concluding properly, but using only a max of 500megs of ram while correctly completed jobs run between 1 and 1.3 gigs of ram. Jobs have failed on at least nodes 3 and 7, so I don't think the error is node specific. The data is being stored on Goldman, btw.

Again, if I just resubmit the job it works fine.

Any thoughts?

Thanks,
O'Dhaniel
8   L A T E S T    R E P L I E S    (Newest First)
petty Posted - Jul 15 2012 : 2:44:06 PM
++ WARNING: nifti_read_buffer(stats/res4d.nii.gz):
data bytes needed = 606208
data bytes input = 457970
number missing = 148238 (set to 0)


Thats telling you that there's 148238 missing from the data file. According to the header it should have the number listed in "data bytes needed", however there's only "data bytes input".

You need to go back to the step where this file was written to determine the write error and likely re-run that step to get complete data
rl100 Posted - Jul 15 2012 : 1:50:36 PM
I've also been getting this type of error on a small subset of jobs (~8-10/~600) and on different types of files in the stats folders - 2 examples below.

They were all relatively clustered together in terms of when they would have been submitted, so I wonder if it was a memory issue at a specific time?

++ WARNING: nifti_read_buffer(stats/res4d.nii.gz):
data bytes needed = 606208
data bytes input = 457970
number missing = 148238 (set to 0)

++ WARNING: nifti_read_buffer(stats/corrections.nii.gz):
data bytes needed = 606208
data bytes input = 323569
number missing = 282639 (set to 0)
dvsmith Posted - Nov 06 2008 : 3:47:42 PM
since no one else seems to have any ideas, try reporting it to the fsl forums after running fslerrorreport in the failed directory. i suspect they'll probably just blame it on memory or a corrupt data file.
mullette-gillman Posted - Nov 06 2008 : 2:38:01 PM
In the lastest batch, I had one failure in prestats, and this is the error message I received:

/usr/local/fsl-4.1.0-centos4_64/bin/fslmaths prefiltered_func_data_st -Tmean mean_func
++ WARNING: nifti_read_buffer(prefiltered_func_data_st.nii.gz):
data bytes needed = 557056
data bytes input = 93254
number missing = 463802 (set to 0)
mullette-gillman Posted - Nov 06 2008 : 10:47:27 AM
Does anyone else have any ideas?
dvsmith Posted - Nov 05 2008 : 8:57:13 PM
well, this is pretty weird and it's hard to make any conclusions without other error messages. maybe someone else can respond.

i can tell you that whenever i've had failed jobs in the past, there was never any evidence of them even starting because the cluster thought it was a path error on attempt #1 and worked on attempt #2 (i.e., the cluster didn't mount the experiment correctly).

if i were you, i would run things twice and have the second loop run the job again only if it failed the first time.
mullette-gillman Posted - Nov 05 2008 : 4:18:09 PM
I have already deleted the other failed jobs (so that I could rerun them and maintain my naming formats). For all the prestats jobs that I looked into (the majority) I saw the failure at the same stage.

For all 7 of 48 failures in this lastest batch, they succeeded in the first attempted re-running of their scripts.
dvsmith Posted - Nov 05 2008 : 4:10:35 PM
does it always fail during slice-timing correction (i.e., the slicetimer program)? can you post other error messages? surely this is not the only one you're getting.

maybe updating fsl will help. they've released one or two patches since releasing 4.1.0 back in august.

BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.32 seconds. Snitz Forums 2000