| Author |
Topic  |
|
|
mullette-gillman
Junior Member
 
USA
40 Posts |
Posted - Dec 06 2007 : 1:36:09 PM
|
I have heard that one of the nodes of the cluster was down for the beginning of this week. I had a number of jobs that inexplicably failed to run during that time, which are now working. I had wondered if it might be a cluster problem, but assumed that it was my own buggy-code.
What exactly happened to the cluster, and how can we tell if this happens again?
Thanks! O |
|
|
josh.bizzell
BIAC Staff
   
USA
118 Posts |
Posted - Dec 06 2007 : 2:03:19 PM
|
Twice this week (Tuesday and Wednesday evenings) node7 rebooted itself. I'm not really sure why this happened, however. At any rate, when the nodes are rebooted, all mounts to the data servers are lost and the mount manager needs to be restarted. This will cause the command "biacmount" to fail, and the email sent to you will have something like "Exit code: 32" at the very end (instead of "Exit code: 0", which means everything worked properly).
We are trying to figure out the best way to handle this in the future, but for the time being, if you get an exit code 32, you should just resubmit the job and check to see that it doesn't go to the "bad" node using the qstat command.
-Josh |
 |
|
|
mullette-gillman
Junior Member
 
USA
40 Posts |
Posted - Dec 06 2007 : 5:28:14 PM
|
Thanks Josh,
All of my jobs that inexplicably didn't work (at the time) occurred tuesday and wed nights (around 10:30pm). They all had exit codes of 32. Thanks for explaining what happened, explaining how to recognize a re-occurrence, and how to repsond to it!
-O
|
 |
|
|
tankersley
BIAC Alum
   
USA
143 Posts |
Posted - Jan 07 2008 : 5:46:19 PM
|
What does "Eqw" in the status column of qstat indicate?
Dharol |
 |
|
|
clithero
Junior Member
 
37 Posts |
Posted - Jan 17 2008 : 11:19:42 AM
|
| I am running MELODIC on the cluster and sometimes the ICA finishes all the components, other times it only spits out 2 or 3. Rerunning seems to work sometimes. Could this be a memory issue? |
 |
|
|
francis.favorini
Forum Admin
    
USA
618 Posts |
Posted - Jan 17 2008 : 2:43:41 PM
|
Does MELODIC/ICA generate any useful log file or error code?
|
IT Director, Brain Imaging and Analysis Center |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Jan 17 2008 : 7:15:07 PM
|
I think this might be a problem with melodic and/or memory. I'm getting the error below and I'm also getting "segmentation fault" errors when I run into the problem that John is having.
-David
/bin/rm -rf prefiltered_func_data* *** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3c89c6ea60] /lib64/libc.so.6(cfree+0x8c)[0x3c89c7217c] /usr/local/fsl/bin/melodic[0x511804] /usr/local/fsl/bin/melodic[0x511a3f] /usr/local/fsl/bin/melodic[0x511d28] /usr/local/fsl/bin/melodic[0x513509] /usr/local/fsl/bin/melodic[0x447f7d] /usr/local/fsl/bin/melodic[0x44ce97] /usr/local/fsl/bin/melodic[0x456a3d] /usr/local/fsl/bin/melodic[0x45b093] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c89c1da44] /usr/local/fsl/bin/melodic(__gxx_personality_v0+0xba)[0x404a5a] ======= Memory map: ======== 00400000-0063d000 r-xp 00000000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic 0073d000-0075d000 rw-p 0023d000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic 0075d000-0528e000 rw-p 0075d000 00:00 0 [heap] 3c88c00000-3c88c1a000 r-xp 00000000 08:01 18415962 /lib64/ld-2.5.so 3c88e19000-3c88e1a000 r--p 00019000 08:01 18415962 /lib64/ld-2.5.so 3c88e1a000-3c88e1b000 rw-p 0001a000 08:01 18415962 /lib64/ld-2.5.so 3c89c00000-3c89d44000 r-xp 00000000 08:01 18415963 /lib64/libc-2.5.so 3c89d44000-3c89f44000 ---p 00144000 08:01 18415963 /lib64/libc-2.5.so 3c89f44000-3c89f48000 r--p 00144000 08:01 18415963 /lib64/libc-2.5.so 3c89f48000-3c89f49000 rw-p 00148000 08:01 18415963 /lib64/libc-2.5.so 3c89f49000-3c89f4e000 rw-p 3c89f49000 00:00 0 3c8a000000-3c8a082000 r-xp 00000000 08:01 18415964 /lib64/libm-2.5.so 3c8a082000-3c8a281000 ---p 00082000 08:01 18415964 /lib64/libm-2.5.so 3c8a281000-3c8a282000 r--p 00081000 08:01 18415964 /lib64/libm-2.5.so 3c8a282000-3c8a283000 rw-p 00082000 08:01 18415964 /lib64/libm-2.5.so 3c8e400000-3c8e40d000 r-xp 00000000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1 3c8e40d000-3c8e60c000 ---p 0000d000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1 3c8e60c000-3c8e60d000 rw-p 0000c000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1 3c8ec00000-3c8ece7000 r-xp 00000000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8 3c8ece7000-3c8eee7000 ---p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8 3c8eee7000-3c8eeed000 r--p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8 3c8eeed000-3c8eef0000 rw-p 000ed000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8 3c8eef0000-3c8ef02000 rw-p 3c8eef0000 00:00 0 2aaaaaaab000-2aaaaaab0000 rw-p 2aaaaaaab000 00:00 0 2aaaaaaec000-2aaaaab68000 rw-p 2aaaaaaec000 00:00 0 2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 7fff65da3000-7fff65dba000 rw-p 7fff65da3000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vdso] /bin/sh: line 1: 1847 Aborted /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats
|
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Jan 17 2008 : 7:55:04 PM
|
Just to clarify, the segmentation is this:
/bin/rm -rf prefiltered_func_data* /bin/sh: line 1: 3707 Segmentation fault /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats
Both errors arise when calling rm or sh. Both errors are also completely random (i.e., not reproducible).
Any ideas? I'm emailing the FSL list to see if they can provide any insight. |
 |
|
|
francis.favorini
Forum Admin
    
USA
618 Posts |
Posted - Jan 18 2008 : 4:31:36 PM
|
quote: Originally posted by dvsmith *** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 ***
This looks like a bug in MELODIC.
-FF
|
IT Director, Brain Imaging and Analysis Center |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Jan 18 2008 : 4:39:34 PM
|
Yeah, I agree. They haven't responded to my post yet. Both errors are not reproducible and happen very randomly.
-David |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Feb 04 2008 : 6:04:18 PM
|
Christian Beckmann's response:
quote: Hi,
Does not look like a straight-forward melodic bug to me - I suspect that either you have some dodgy RAM on your system somewhere. Can you test if the problem always is associated with the same cluster node? Also, you might want to try other memory intensive tools to see if you can get those to crash. Finally, you might want to compile things from scratch.
hth christian
I don't think this isn't specific to one node, but I'll test this if I re-run MELODIC. I've ran FLAME1+2 (fairly memory intensive) without any trouble. I have not used other FSL tools like Freesurfer or Bedpost, which I presume to be at least as memory intensive as FEAT and MELODIC. Has anyone had any memory issues with these tools or other FSL tools? Does everyone agree with Beckmann that this probably isn't a MELODIC bug?
Thanks, David
|
 |
|
| |
Topic  |
|