BIAC Forums

Duke-UNC Brain Imaging and Analysis Center

All Forums

Support Forums

Cluster Support

Failing cluster jobs

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

Screensize:

UserName:

Password:

Format Mode:

Format:

Message:

* HTML is OFF
* Forum Code is ON

Smilies

Check here to include your profile signature.
Check here to subscribe to this topic.

T O P I C R E V I E W
mullette-gillman	Posted - Nov 05 2008 : 3:13:21 PM Hi there, Over the last week I have seen many FSL 4.1 jobs that fail once, but perform perfectly fine if I just rerun them. This is for both prestats and first level analyses, and the failure rate can be as high as 20% for a given set of jobs (40+). Again, I just delete the failure and rerun the script and it works fine; I do not believe this is a path error. I have spoken to one other person that has noticed the same thing happening over the last week. For the 1st level jobs that fail I am told it is a memory issue in the FSL log. For prestats jobs it gives me one or more of the following error in the log: /usr/local/fsl-4.1.0-centos4_64/bin/slicetimer -i prefiltered_func_data_mcf --out=prefiltered_func_data_st -r 2.0 --odd ++ WARNING: nifti_read_buffer(prefiltered_func_data_mcf.nii.gz): data bytes needed = 557056 data bytes input = 480347 number missing = 76709 (set to 0) -------- The log files generated and emailed to me show the jobs concluding properly, but using only a max of 500megs of ram while correctly completed jobs run between 1 and 1.3 gigs of ram. Jobs have failed on at least nodes 3 and 7, so I don't think the error is node specific. The data is being stored on Goldman, btw. Again, if I just resubmit the job it works fine. Any thoughts? Thanks, O'Dhaniel
8 L A T E S T R E P L I E S (Newest First)
petty	Posted - Jul 15 2012 : 2:44:06 PM ++ WARNING: nifti_read_buffer(stats/res4d.nii.gz): data bytes needed = 606208 data bytes input = 457970 number missing = 148238 (set to 0) Thats telling you that there's 148238 missing from the data file. According to the header it should have the number listed in "data bytes needed", however there's only "data bytes input". You need to go back to the step where this file was written to determine the write error and likely re-run that step to get complete data
rl100	Posted - Jul 15 2012 : 1:50:36 PM I've also been getting this type of error on a small subset of jobs (~8-10/~600) and on different types of files in the stats folders - 2 examples below. They were all relatively clustered together in terms of when they would have been submitted, so I wonder if it was a memory issue at a specific time? ++ WARNING: nifti_read_buffer(stats/res4d.nii.gz): data bytes needed = 606208 data bytes input = 457970 number missing = 148238 (set to 0) ++ WARNING: nifti_read_buffer(stats/corrections.nii.gz): data bytes needed = 606208 data bytes input = 323569 number missing = 282639 (set to 0)
dvsmith	Posted - Nov 06 2008 : 3:47:42 PM since no one else seems to have any ideas, try reporting it to the fsl forums after running fslerrorreport in the failed directory. i suspect they'll probably just blame it on memory or a corrupt data file.
mullette-gillman	Posted - Nov 06 2008 : 2:38:01 PM In the lastest batch, I had one failure in prestats, and this is the error message I received: /usr/local/fsl-4.1.0-centos4_64/bin/fslmaths prefiltered_func_data_st -Tmean mean_func ++ WARNING: nifti_read_buffer(prefiltered_func_data_st.nii.gz): data bytes needed = 557056 data bytes input = 93254 number missing = 463802 (set to 0)
mullette-gillman	Posted - Nov 06 2008 : 10:47:27 AM Does anyone else have any ideas?
dvsmith	Posted - Nov 05 2008 : 8:57:13 PM well, this is pretty weird and it's hard to make any conclusions without other error messages. maybe someone else can respond. i can tell you that whenever i've had failed jobs in the past, there was never any evidence of them even starting because the cluster thought it was a path error on attempt #1 and worked on attempt #2 (i.e., the cluster didn't mount the experiment correctly). if i were you, i would run things twice and have the second loop run the job again only if it failed the first time.
mullette-gillman	Posted - Nov 05 2008 : 4:18:09 PM I have already deleted the other failed jobs (so that I could rerun them and maintain my naming formats). For all the prestats jobs that I looked into (the majority) I saw the failure at the same stage. For all 7 of 48 failures in this lastest batch, they succeeded in the first attempted re-running of their scripts.
dvsmith	Posted - Nov 05 2008 : 4:10:35 PM does it always fail during slice-timing correction (i.e., the slicetimer program)? can you post other error messages? surely this is not the only one you're getting. maybe updating fsl will help. they've released one or two patches since releasing 4.1.0 back in august.

BIAC Forums

This page was generated in 0.29 seconds.

Snitz Forums 2000