Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
 All Forums
 Support Forums
 Cluster Support
 High Ram Jobs

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

Screensize:
UserName:
Password:
Format Mode:
Format: BoldItalicizedUnderlineStrikethrough Align LeftCenteredAlign Right Horizontal Rule Insert HyperlinkInsert EmailInsert Image Insert CodeInsert QuoteInsert List
   
Message:

* HTML is OFF
* Forum Code is ON
Smilies
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Clown [:o)]
Black Eye [B)] Eight Ball [8] Frown [:(] Shy [8)]
Shocked [:0] Angry [:(!] Dead [xx(] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
petty Posted - Apr 21 2011 : 09:54:52 AM
There have been some instances where users need a very large amount of ram to run a particular job without causing issues to other jobs that may be running.

The computer nodes have about 30GB of available ram with 8 simultanious jobs allowed. If all the jobs are typical and playing nice, then thats sufficient for most tasks.

If you need much more i've created another queue, which limits the node to 1 job, therefore giving you the 30GB. Please only use this if you definitely need it. There are only a limited number of slots, therefore if its abused your job will just sit in the queue for a long time.

to submit to this queue add "-q highram.q" to your job submission

ie: qsub -q highram.q -v EXPERIMENT=Something.01 script.sh

all other jobs default to the users.q
15   L A T E S T    R E P L I E S    (Newest First)
dvsmith Posted - Oct 26 2011 : 11:12:30 AM
Yeah, it's high-res data, so there's going to be a lot of individual time series. All of the other files that are working with fewer GBs of RAM have exactly the same structure.
petty Posted - Oct 26 2011 : 10:03:19 AM
I don't think it qhas anything to do with that particular file .. i think its more to do with you number of timepoints X number columns in the design X EVs

in your case its:
sizeTS=362
numTS=273884
EV=28

this step also saves the res4d data.
dvsmith Posted - Oct 26 2011 : 01:40:17 AM
I got it to work with 16 GBs of RAM... The output from maxvmem was 4.09 GB, so I'm guessing the FSL (or at least some FSL processes) are behaving like Matlab...

The corrections.nii.gz file was 1.2 GB.
dvsmith Posted - Oct 25 2011 : 4:55:56 PM
OK, sorry to be the trouble maker here... I've done a little more poking around, and I'm still getting some random failures:
file:///Volumes/Huettel/HighRes.01/Analysis/FSL/1007/TrialByTrial/Smooth_0mm/run2/trial01.feat/report_log.html

Key error:
/usr/local/packages/fsl-4.1.8/bin/contrast_mgr stats design.con
Uncaught exception!

This appears to be related to memory issues according to these posts:
JISCmail - FSL Archive - Re: contrast manager error in FEAT (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;c63ccf2a.0702)
JISCmail - FSL Archive - Re: Uncaught exception (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;30d0af21.1010)

The file size for the corrections.nii.gz file here was 1.2 GBs. I requested 8 GBs for my first attempt and 12 GBs for my second and third attempts (the maxvmem ranged between 4.09G and 4.337G). I'll increase to 16 GBs and see if that works.

Thanks!
David


dvsmith Posted - Oct 24 2011 : 10:58:16 PM
OK, will do. I'll try to keep a better track of memory usage. I know these corrections.nii.gz models can get enormous when the model has a lot of autocorrelations (which I guess is prevalent in the trial-to-trial estimations I'm trying to run). I'll just record the size of the file before I delete on my next iteration of models. Maybe that will give an estimate of how much was actually used...
ls -lh corrections.nii.gz | awk '{print $5}'

Thanks!
petty Posted - Oct 24 2011 : 9:56:42 PM
well, fsl could be doing something that SGE can't correctly keep track of.

The other times i've seen this is virtually anything involving matlab ... meaning the amount reported doesn't correctly affect the amount used. I guess when you hit something like this, just use what has worked previously for a similar data size.
dvsmith Posted - Oct 24 2011 : 6:10:32 PM
It seems to say this for pretty much all of my jobs:
maxvmem 4.090G

This is definitely true of one job that failed (due to memory) when I requested 8 GBs but succeeded with 12 GBs. I'll keep a closer eye on it. I'm pretty sure I haven't seen a case of maxvmem being greater than 4.09, but I haven't looked exhaustively. That's kind of why I was curious about what's going on...

I'm pretty sure it's accepting my request for additional memory:
[smith@node40 Logs]$ qstat -j 2550323
==============================================================
job_number: 2550323
exec_file: job_scripts/2550323
submission_time: Mon Oct 24 16:03:27 2011
owner: smith
uid: 2132
group: users
gid: 100
sge_o_home: /home/smith
sge_o_log_name: smith
sge_o_path: /usr/kerberos/bin:/opt/gridengine/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/smith/bin
sge_o_shell: /bin/bash
sge_o_workdir: /home/smith/HighRes.01/GLM_stats/4713
sge_o_host: hugin
execution_time: Sat Sep 24 16:03:59 2011
account: sge
stderr_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out
hard resource_list: h_vmem=12G
mail_list: david.v.smith@duke.edu
notify: FALSE
job_name: HiRes_6trial_52_1009_01_3.job
stdout_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out
jobshare: 0
hard_queue_list: users.q
shell_list: NONE:/bin/sh
env_list: EXPERIMENT=HighRes.01
script_file: HiRes_6trial_52_1009_01_3.job
usage 1: cpu=02:04:23, mem=21551.85410 GBs, io=10.03784, vmem=2.912G, maxvmem=4.081G
scheduling info: queue instance "interact.q@node52.local" dropped because it is full
queue instance "interact.q@node17.local" dropped because it is full
queue instance "interact.q@node3.local" dropped because it is full
queue instance "interact.q@node4.local" dropped because it is full
queue instance "interact.q@node42.local" dropped because it is full
queue instance "interact.q@node5.local" dropped because it is full
queue instance "interact.q@node15.local" dropped because it is full
queue instance "interact.q@node31.local" dropped because it is full
queue instance "interact.q@node8.local" dropped because it is full
queue instance "interact.q@node44.local" dropped because it is full
queue instance "interact.q@node6.local" dropped because it is full
queue instance "interact.q@node28.local" dropped because it is full
queue instance "interact.q@node13.local" dropped because it is full
queue instance "interact.q@node50.local" dropped because it is full
queue instance "interact.q@node37.local" dropped because it is full
queue instance "interact.q@node41.local" dropped because it is full
queue instance "interact.q@node51.local" dropped because it is full
queue instance "interact.q@node40.local" dropped because it is full

petty Posted - Oct 24 2011 : 5:22:22 PM
all denseness aside ...

the maxvmem is the amount the job actually used, whether is was successful or not. there's no way for the grid to know how much you would've used .. it only monitors current use. If you hit your designated threshold, the job is killed.



syam.gadde Posted - Oct 24 2011 : 4:40:37 PM
What I was saying (unclearly) was that if SGE is setting a system limit for the process, and it failed because it tried to use more than the prescribed amount of memory, SGE could possibly be correctly reporting the memory used by the process before it made the memory allocation call that failed.

So no, for a process that failed because it ran out of memory, the maxvmem value would not be an accurate reflection of the amount of memory your process would have use if it had completed successfully.

Do you have maxvmem values from the jobs that did succeed? (i.e. where you specified a larger amount of memory, say 12GB) I'd be interested to know if that value was between 8GB and 12GB.
dvsmith Posted - Oct 24 2011 : 4:31:00 PM
Sorry for being dense -- does that mean what gets reported in the maxvmem field is a reflection the max memory used by *completed* (and hence successful, exit_status=0) processes? When my jobs are failing because of memory, they say they ran out of memory, and I don't see any indication that they were killed by the grid.

Unfortunately, I've been deleting the corrections.nii.gz files as they get created (but after they have served their purpose, of course). So, I can't tell how big they really are. I know that they can blow up to several times the size of the data, so that's why I get rid of them.
syam.gadde Posted - Oct 24 2011 : 07:40:39 AM
I don't know how SGE is enforcing the memory limits, but it is possible that it is setting it as limited system resource (using limit/ulimit), in which case the allocation of the large chunk of memory itself would fail, and SGE would never know what you actually asked for. In that case the maxmem field would be (correctly) the maximum amount of memory used by the process, but it was killed before any excessive memory allocation would have returned.
dvsmith Posted - Oct 23 2011 : 8:46:26 PM
It was an FSL job, and it specifically failed when it was trying to save a big file. It said FILM did not complete due to memory. I gave it even more memory and it worked. Does that mean that my request for more was read incorrectly? Or is FILM/FSL not communicating with SGE to indicate that a job should be killed immediately?

Thanks!
petty Posted - Oct 23 2011 : 8:13:49 PM
No because it only reports what is used. If you hit the limit you requested, then SGE immediately kills the job
dvsmith Posted - Oct 23 2011 : 7:31:04 PM
These aren't matlab jobs. Does it correctly report the RAM usage of a failed process? In other words, if I have a process that requires 16 GBs of RAM, and I only give it 12 GBs, will it still report 16 GBs in the maxvmem field even though the process was unsuccessful?
petty Posted - Oct 23 2011 : 6:12:05 PM
that maxvmem is supposed to be the actual amount used ... if its peaking just above 4 and being killed, it would make me think your request for more wasn't correct

however, i've found that matlab jobs don't correctly report their ram usage

BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.35 seconds. Snitz Forums 2000