Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
Username:
Password:
Save Password   Forgot your Password?
 All Forums
 Support Forums
 Cluster Support
 Cluster issues
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

mullette-gillman
Junior Member

USA
40 Posts

Posted - Dec 06 2007 :  1:36:09 PM  Show Profile  Reply with Quote
I have heard that one of the nodes of the cluster was down for the beginning of this week. I had a number of jobs that inexplicably failed to run during that time, which are now working. I had wondered if it might be a cluster problem, but assumed that it was my own buggy-code.

What exactly happened to the cluster, and how can we tell if this happens again?

Thanks!
O

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - Dec 06 2007 :  2:03:19 PM  Show Profile  Reply with Quote
Twice this week (Tuesday and Wednesday evenings) node7 rebooted itself. I'm not really sure why this happened, however. At any rate, when the nodes are rebooted, all mounts to the data servers are lost and the mount manager needs to be restarted. This will cause the command "biacmount" to fail, and the email sent to you will have something like "Exit code: 32" at the very end (instead of "Exit code: 0", which means everything worked properly).

We are trying to figure out the best way to handle this in the future, but for the time being, if you get an exit code 32, you should just resubmit the job and check to see that it doesn't go to the "bad" node using the qstat command.

-Josh
Go to Top of Page

mullette-gillman
Junior Member

USA
40 Posts

Posted - Dec 06 2007 :  5:28:14 PM  Show Profile  Reply with Quote
Thanks Josh,

All of my jobs that inexplicably didn't work (at the time) occurred tuesday and wed nights (around 10:30pm). They all had exit codes of 32. Thanks for explaining what happened, explaining how to recognize a re-occurrence, and how to repsond to it!

-O

Go to Top of Page

tankersley
BIAC Alum

USA
143 Posts

Posted - Jan 07 2008 :  5:46:19 PM  Show Profile  Reply with Quote
What does "Eqw" in the status column of qstat indicate?


Dharol
Go to Top of Page

clithero
Junior Member

37 Posts

Posted - Jan 17 2008 :  11:19:42 AM  Show Profile  Reply with Quote
I am running MELODIC on the cluster and sometimes the ICA finishes all the components, other times it only spits out 2 or 3. Rerunning seems to work sometimes. Could this be a memory issue?
Go to Top of Page

francis.favorini
Forum Admin

USA
618 Posts

Posted - Jan 17 2008 :  2:43:41 PM  Show Profile  Visit francis.favorini's Homepage  Reply with Quote
Does MELODIC/ICA generate any useful log file or error code?

IT Director, Brain Imaging and Analysis Center
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jan 17 2008 :  7:15:07 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote

I think this might be a problem with melodic and/or memory. I'm getting the error below and I'm also getting "segmentation fault" errors when I run into the problem that John is having.

-David




/bin/rm -rf prefiltered_func_data*
*** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3c89c6ea60]
/lib64/libc.so.6(cfree+0x8c)[0x3c89c7217c]
/usr/local/fsl/bin/melodic[0x511804]
/usr/local/fsl/bin/melodic[0x511a3f]
/usr/local/fsl/bin/melodic[0x511d28]
/usr/local/fsl/bin/melodic[0x513509]
/usr/local/fsl/bin/melodic[0x447f7d]
/usr/local/fsl/bin/melodic[0x44ce97]
/usr/local/fsl/bin/melodic[0x456a3d]
/usr/local/fsl/bin/melodic[0x45b093]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3c89c1da44]
/usr/local/fsl/bin/melodic(__gxx_personality_v0+0xba)[0x404a5a]
======= Memory map: ========
00400000-0063d000 r-xp 00000000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic
0073d000-0075d000 rw-p 0023d000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic
0075d000-0528e000 rw-p 0075d000 00:00 0 [heap]
3c88c00000-3c88c1a000 r-xp 00000000 08:01 18415962 /lib64/ld-2.5.so
3c88e19000-3c88e1a000 r--p 00019000 08:01 18415962 /lib64/ld-2.5.so
3c88e1a000-3c88e1b000 rw-p 0001a000 08:01 18415962 /lib64/ld-2.5.so
3c89c00000-3c89d44000 r-xp 00000000 08:01 18415963 /lib64/libc-2.5.so
3c89d44000-3c89f44000 ---p 00144000 08:01 18415963 /lib64/libc-2.5.so
3c89f44000-3c89f48000 r--p 00144000 08:01 18415963 /lib64/libc-2.5.so
3c89f48000-3c89f49000 rw-p 00148000 08:01 18415963 /lib64/libc-2.5.so
3c89f49000-3c89f4e000 rw-p 3c89f49000 00:00 0
3c8a000000-3c8a082000 r-xp 00000000 08:01 18415964 /lib64/libm-2.5.so
3c8a082000-3c8a281000 ---p 00082000 08:01 18415964 /lib64/libm-2.5.so
3c8a281000-3c8a282000 r--p 00081000 08:01 18415964 /lib64/libm-2.5.so
3c8a282000-3c8a283000 rw-p 00082000 08:01 18415964 /lib64/libm-2.5.so
3c8e400000-3c8e40d000 r-xp 00000000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8e40d000-3c8e60c000 ---p 0000d000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8e60c000-3c8e60d000 rw-p 0000c000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8ec00000-3c8ece7000 r-xp 00000000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8ece7000-3c8eee7000 ---p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eee7000-3c8eeed000 r--p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eeed000-3c8eef0000 rw-p 000ed000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eef0000-3c8ef02000 rw-p 3c8eef0000 00:00 0
2aaaaaaab000-2aaaaaab0000 rw-p 2aaaaaaab000 00:00 0
2aaaaaaec000-2aaaaab68000 rw-p 2aaaaaaec000 00:00 0
2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0
2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0
7fff65da3000-7fff65dba000 rw-p 7fff65da3000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vdso]
/bin/sh: line 1: 1847 Aborted /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jan 17 2008 :  7:55:04 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
Just to clarify, the segmentation is this:

/bin/rm -rf prefiltered_func_data*
/bin/sh: line 1: 3707 Segmentation fault /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats

Both errors arise when calling rm or sh. Both errors are also completely random (i.e., not reproducible).

Any ideas? I'm emailing the FSL list to see if they can provide any insight.
Go to Top of Page

francis.favorini
Forum Admin

USA
618 Posts

Posted - Jan 18 2008 :  4:31:36 PM  Show Profile  Visit francis.favorini's Homepage  Reply with Quote
quote:
Originally posted by dvsmith
*** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 ***


This looks like a bug in MELODIC.

-FF

IT Director, Brain Imaging and Analysis Center
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jan 18 2008 :  4:39:34 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
Yeah, I agree. They haven't responded to my post yet. Both errors are not reproducible and happen very randomly.

-David
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Feb 04 2008 :  6:04:18 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
Christian Beckmann's response:

quote:
Hi,

Does not look like a straight-forward melodic bug to me - I suspect that either you have some dodgy RAM on your system somewhere. Can you test if the problem always is associated with the same cluster node? Also, you might want to try other memory intensive tools to see if you can get those to crash. Finally, you might want to compile things from scratch.

hth
christian




I don't think this isn't specific to one node, but I'll test this if I re-run MELODIC. I've ran FLAME1+2 (fairly memory intensive) without any trouble. I have not used other FSL tools like Freesurfer or Bedpost, which I presume to be at least as memory intensive as FEAT and MELODIC. Has anyone had any memory issues with these tools or other FSL tools? Does everyone agree with Beckmann that this probably isn't a MELODIC bug?

Thanks,
David
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.48 seconds. Snitz Forums 2000