| Author |
Topic  |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Mar 07 2008 : 12:40:14 PM
|
Hi,
How much RAM does each node of the Cluster have? Is this RAM shared across all the processors on the node? For example, if a node had 2GBs of RAM, would 4 jobs on it get about 500MBs of RAM?
I'm trying to figure out why I've been getting some memory errors running MELODIC on the cluster, so any information along these lines would be great.
Thanks, David
|
|
|
syam.gadde
BIAC Staff
    
USA
421 Posts |
Posted - Mar 07 2008 : 12:49:26 PM
|
| I believe they each have 8GB of memory, shared among the 4 cores. |
 |
|
|
francis.favorini
Forum Admin
    
USA
618 Posts |
Posted - Mar 07 2008 : 1:54:39 PM
|
That is correct. Also, unless you are running 64-bit code, you won't be able to address all 8 GB of memory. At most, 32-bit processes can access 4 GB, but it will probably be less, due to how the OS allocates memory.
-Francis
|
IT Director, Brain Imaging and Analysis Center |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Mar 07 2008 : 3:05:23 PM
|
So, am I right in saying that 4 jobs processing on 4 separate processors/cores would essentially be using 1GB of RAM each? Sorry for such a basic question, but it seems like if I'm running jobs that like to have 2GBs of RAM each, then this might be problematic with 4 jobs (requiring 2 GBs each) running in parallel. Is this completely off base?
Thanks, David
|
 |
|
|
syam.gadde
BIAC Staff
    
USA
421 Posts |
Posted - Mar 07 2008 : 4:12:30 PM
|
More likely each would comfortably get 1.5GB or more per process.
Looking through the docs, you might be able to request more per process (and thereby restricting the number of these processes that might be able to run on a node). I'm not sure how you do this, but there is a reference to something like:
qsub -l mem=4096
to request 4GB of memory. |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Mar 07 2008 : 4:48:18 PM
|
Thanks, Syam.
I'll try this out. This should help with some of the issues I hope. I'll investigate the issues with MELODIC a little further -- I need to make sure it just wasn't always crashing on node4. Christian Beckmann (MELODIC guy at FMRIB) assured me it was something wrong with our system or a memory problem, but Francis thought it was a MELODIC issue (http://www.biac.duke.edu/forums/topic.asp?TOPIC_ID=1109). I think Francis is right, but I want to definitively rule out the memory explanation before writing FSL again.
Let me throw one more situation out there regarding node4: What if several people had interactive jobs and regular qsub jobs were still going to node4. It seems as though memory issues could arise since the people with interactive jobs should be consuming a portion of the available RAM. I've noticed lots of jobs crashing on node4, so could this be an explanation? If not, what's an alternative hypothesis for the occasional job failures on node4?
Thanks, David
|
 |
|
|
mcarter
New Member

22 Posts |
Posted - Mar 07 2008 : 6:50:26 PM
|
This seems like it might be even uglier when we are running some of the more memory intensive DTI apps. Is there anyway we can take a look at memory usage and see if we are capping out and compare those times to failed jobs? The failure rate is getting to be high enough that some jobs are being run 5 or so times (not repeating what was completed) so it would free up some computational time if we could get it worked out. |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Mar 10 2008 : 12:04:42 PM
|
do these failed jobs happen to be writing to goldman?
of everything i've run on the cluster, DTI/freesurfer/feat, the only thing thats failed has been because of flakey connection to goldman and a random node6/freesurfer problem.
the dti stuff is pretty memory intensive and i've had no issues. |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Mar 10 2008 : 6:49:28 PM
|
They are writing to Goldman; however, I am pretty confident that it's not a connection issue. I used to see that all the time and I would essentially have to do everything twice (I hardly ever see it anymore now). The errors are completely different.
I think the potential problem McKell was pointing out concerned situations where you might have 3 DTI jobs on a node and there's one additional processor with presumably a lot less memory available. In that situation, would another job be shorted some memory if it were sent to the 4th processor on the node? I'm guessing no if you've been able to have 4 DTI jobs sit on a node without any trouble, but this is something we were thinking about.
Thanks, David
|
 |
|
| |
Topic  |
|