Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
 All Forums
 Support Forums
 Cluster Support
 Cluster issues

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

Screensize:
UserName:
Password:
Format Mode:
Format: BoldItalicizedUnderlineStrikethrough Align LeftCenteredAlign Right Horizontal Rule Insert HyperlinkInsert EmailInsert Image Insert CodeInsert QuoteInsert List
   
Message:

* HTML is OFF
* Forum Code is ON
Smilies
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Clown [:o)]
Black Eye [B)] Eight Ball [8] Frown [:(] Shy [8)]
Shocked [:0] Angry [:(!] Dead [xx(] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
mullette-gillman Posted - Dec 06 2007 : 1:36:09 PM
I have heard that one of the nodes of the cluster was down for the beginning of this week. I had a number of jobs that inexplicably failed to run during that time, which are now working. I had wondered if it might be a cluster problem, but assumed that it was my own buggy-code.

What exactly happened to the cluster, and how can we tell if this happens again?

Thanks!
O
10   L A T E S T    R E P L I E S    (Newest First)
dvsmith Posted - Feb 04 2008 : 6:04:18 PM
Christian Beckmann's response:

quote:
Hi,

Does not look like a straight-forward melodic bug to me - I suspect that either you have some dodgy RAM on your system somewhere. Can you test if the problem always is associated with the same cluster node? Also, you might want to try other memory intensive tools to see if you can get those to crash. Finally, you might want to compile things from scratch.

hth
christian




I don't think this isn't specific to one node, but I'll test this if I re-run MELODIC. I've ran FLAME1+2 (fairly memory intensive) without any trouble. I have not used other FSL tools like Freesurfer or Bedpost, which I presume to be at least as memory intensive as FEAT and MELODIC. Has anyone had any memory issues with these tools or other FSL tools? Does everyone agree with Beckmann that this probably isn't a MELODIC bug?

Thanks,
David
dvsmith Posted - Jan 18 2008 : 4:39:34 PM
Yeah, I agree. They haven't responded to my post yet. Both errors are not reproducible and happen very randomly.

-David
francis.favorini Posted - Jan 18 2008 : 4:31:36 PM
quote:
Originally posted by dvsmith
*** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 ***


This looks like a bug in MELODIC.

-FF
dvsmith Posted - Jan 17 2008 : 7:55:04 PM
Just to clarify, the segmentation is this:

/bin/rm -rf prefiltered_func_data*
/bin/sh: line 1: 3707 Segmentation fault /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats

Both errors arise when calling rm or sh. Both errors are also completely random (i.e., not reproducible).

Any ideas? I'm emailing the FSL list to see if they can provide any insight.
dvsmith Posted - Jan 17 2008 : 7:15:07 PM

I think this might be a problem with melodic and/or memory. I'm getting the error below and I'm also getting "segmentation fault" errors when I run into the problem that John is having.

-David




/bin/rm -rf prefiltered_func_data*
*** glibc detected *** /usr/local/fsl/bin/melodic: double free or corruption (fasttop): 0x0000000003d3d270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3c89c6ea60]
/lib64/libc.so.6(cfree+0x8c)[0x3c89c7217c]
/usr/local/fsl/bin/melodic[0x511804]
/usr/local/fsl/bin/melodic[0x511a3f]
/usr/local/fsl/bin/melodic[0x511d28]
/usr/local/fsl/bin/melodic[0x513509]
/usr/local/fsl/bin/melodic[0x447f7d]
/usr/local/fsl/bin/melodic[0x44ce97]
/usr/local/fsl/bin/melodic[0x456a3d]
/usr/local/fsl/bin/melodic[0x45b093]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3c89c1da44]
/usr/local/fsl/bin/melodic(__gxx_personality_v0+0xba)[0x404a5a]
======= Memory map: ========
00400000-0063d000 r-xp 00000000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic
0073d000-0075d000 rw-p 0023d000 08:01 33915183 /usr/local/fsl-4.0.1-centos4_64/bin/melodic
0075d000-0528e000 rw-p 0075d000 00:00 0 [heap]
3c88c00000-3c88c1a000 r-xp 00000000 08:01 18415962 /lib64/ld-2.5.so
3c88e19000-3c88e1a000 r--p 00019000 08:01 18415962 /lib64/ld-2.5.so
3c88e1a000-3c88e1b000 rw-p 0001a000 08:01 18415962 /lib64/ld-2.5.so
3c89c00000-3c89d44000 r-xp 00000000 08:01 18415963 /lib64/libc-2.5.so
3c89d44000-3c89f44000 ---p 00144000 08:01 18415963 /lib64/libc-2.5.so
3c89f44000-3c89f48000 r--p 00144000 08:01 18415963 /lib64/libc-2.5.so
3c89f48000-3c89f49000 rw-p 00148000 08:01 18415963 /lib64/libc-2.5.so
3c89f49000-3c89f4e000 rw-p 3c89f49000 00:00 0
3c8a000000-3c8a082000 r-xp 00000000 08:01 18415964 /lib64/libm-2.5.so
3c8a082000-3c8a281000 ---p 00082000 08:01 18415964 /lib64/libm-2.5.so
3c8a281000-3c8a282000 r--p 00081000 08:01 18415964 /lib64/libm-2.5.so
3c8a282000-3c8a283000 rw-p 00082000 08:01 18415964 /lib64/libm-2.5.so
3c8e400000-3c8e40d000 r-xp 00000000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8e40d000-3c8e60c000 ---p 0000d000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8e60c000-3c8e60d000 rw-p 0000c000 08:01 18415968 /lib64/libgcc_s-4.1.1-20061011.so.1
3c8ec00000-3c8ece7000 r-xp 00000000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8ece7000-3c8eee7000 ---p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eee7000-3c8eeed000 r--p 000e7000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eeed000-3c8eef0000 rw-p 000ed000 08:01 25533528 /usr/lib64/libstdc++.so.6.0.8
3c8eef0000-3c8ef02000 rw-p 3c8eef0000 00:00 0
2aaaaaaab000-2aaaaaab0000 rw-p 2aaaaaaab000 00:00 0
2aaaaaaec000-2aaaaab68000 rw-p 2aaaaaaec000 00:00 0
2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0
2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0
7fff65da3000-7fff65dba000 rw-p 7fff65da3000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vdso]
/bin/sh: line 1: 1847 Aborted /usr/local/fsl/bin/melodic -i filtered_func_data -o filtered_func_data.ica -v --nobet --bgthreshold=3 --tr=2.0 --report --guireport=../../report.html -d 0 --mmthresh=0.5 --Ostats
francis.favorini Posted - Jan 17 2008 : 2:43:41 PM
Does MELODIC/ICA generate any useful log file or error code?
clithero Posted - Jan 17 2008 : 11:19:42 AM
I am running MELODIC on the cluster and sometimes the ICA finishes all the components, other times it only spits out 2 or 3. Rerunning seems to work sometimes. Could this be a memory issue?
tankersley Posted - Jan 07 2008 : 5:46:19 PM
What does "Eqw" in the status column of qstat indicate?


Dharol
mullette-gillman Posted - Dec 06 2007 : 5:28:14 PM
Thanks Josh,

All of my jobs that inexplicably didn't work (at the time) occurred tuesday and wed nights (around 10:30pm). They all had exit codes of 32. Thanks for explaining what happened, explaining how to recognize a re-occurrence, and how to repsond to it!

-O

josh.bizzell Posted - Dec 06 2007 : 2:03:19 PM
Twice this week (Tuesday and Wednesday evenings) node7 rebooted itself. I'm not really sure why this happened, however. At any rate, when the nodes are rebooted, all mounts to the data servers are lost and the mount manager needs to be restarted. This will cause the command "biacmount" to fail, and the email sent to you will have something like "Exit code: 32" at the very end (instead of "Exit code: 0", which means everything worked properly).

We are trying to figure out the best way to handle this in the future, but for the time being, if you get an exit code 32, you should just resubmit the job and check to see that it doesn't go to the "bad" node using the qstat command.

-Josh

BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.33 seconds. Snitz Forums 2000