BIAC Forums - High Ram Jobs

Duke-UNC Brain Imaging and Analysis Center

All Forums

Support Forums

Cluster Support

High Ram Jobs

New Topic

Reply to Topic

Printer Friendly

Author

Topic

Page: of 2

petty
BIAC Staff

USA
453 Posts

Posted - Apr 21 2011 : 09:54:52 AM

There have been some instances where users need a very large amount of ram to run a particular job without causing issues to other jobs that may be running.

The computer nodes have about 30GB of available ram with 8 simultanious jobs allowed. If all the jobs are typical and playing nice, then thats sufficient for most tasks.

If you need much more i've created another queue, which limits the node to 1 job, therefore giving you the 30GB. Please only use this if you definitely need it. There are only a limited number of slots, therefore if its abused your job will just sit in the queue for a long time.

to submit to this queue add "-q highram.q" to your job submission

ie: qsub -q highram.q -v EXPERIMENT=Something.01 script.sh

all other jobs default to the users.q

Edited by - petty on Apr 21 2011 09:55:49 AM

petty
BIAC Staff

USA
453 Posts

Posted - Sep 19 2011 : 11:40:54 AM

The highram.q directive is no longer valid.

I've made RAM a consumable resource, which will allow much more efficient usage of the nodes, since there won't be 4 nodes sitting idle most of the time.

If you believe your jobs needs a larger amount of RAM ( 4GB is our default ), then you can request it with the "-l h_vmem" directive. This will only send your job to a node with the requested amount of resources and will prevent you/others from going over the limits.

> qsub -l h_vmem=30G -v EXPERIMENT=Something.01 script.sh

The above example would be equivalent of sending something to the previous highram.q. Please do not request additional resources unless its necessary.

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 23 2011 : 4:18:09 PM

Is there any way to determine how much RAM was needed (not used) for a particular job? I assume maxvmem is what actually got used by any completed (but not failed) process? Anyway, my maxvmem is peaking at 4.09 (I believe I request 8 GBs for this particular job), and it's crashing right around the point FILM outputs the corrections.nii.gz file, which I assume must be way bigger than 8 GBs. I increased this to 12 GBs and it seems to work for the most part now, but I don't want to request more than necessary.

[smith@hugin nperlabel_equal]$ qacct -j 2413437
==============================================================
qname users.q
hostname node11
group users
owner smith
project NONE
department defaultdepartment
jobname HiRes_6trial_31_1007_04_1.job
jobnumber 2413437
taskid undefined
account sge
priority -10
qsub_time Fri Oct 14 21:59:40 2011
start_time Fri Oct 14 21:59:42 2011
end_time Fri Oct 14 23:48:03 2011
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 6501
ru_utime 6430.700
ru_stime 14.814
ru_maxrss 6945840
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 3275339
ru_majflt 0
ru_nswap 0
ru_inblock 3322244
ru_oublock 2538592
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 122426
ru_nivcsw 8166
cpu 6445.514
mem 16654.524
io 12.751
iow 0.000
maxvmem 4.090G
arid undefined

petty
BIAC Staff

USA
453 Posts

Posted - Oct 23 2011 : 6:12:05 PM

that maxvmem is supposed to be the actual amount used ... if its peaking just above 4 and being killed, it would make me think your request for more wasn't correct

however, i've found that matlab jobs don't correctly report their ram usage

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 23 2011 : 7:31:04 PM

These aren't matlab jobs. Does it correctly report the RAM usage of a failed process? In other words, if I have a process that requires 16 GBs of RAM, and I only give it 12 GBs, will it still report 16 GBs in the maxvmem field even though the process was unsuccessful?

petty
BIAC Staff

USA
453 Posts

Posted - Oct 23 2011 : 8:13:49 PM

No because it only reports what is used. If you hit the limit you requested, then SGE immediately kills the job

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 23 2011 : 8:46:26 PM

It was an FSL job, and it specifically failed when it was trying to save a big file. It said FILM did not complete due to memory. I gave it even more memory and it worked. Does that mean that my request for more was read incorrectly? Or is FILM/FSL not communicating with SGE to indicate that a job should be killed immediately?

Thanks!

syam.gadde
BIAC Staff

USA
421 Posts

Posted - Oct 24 2011 : 07:40:39 AM

I don't know how SGE is enforcing the memory limits, but it is possible that it is setting it as limited system resource (using limit/ulimit), in which case the allocation of the large chunk of memory itself would fail, and SGE would never know what you actually asked for. In that case the maxmem field would be (correctly) the maximum amount of memory used by the process, but it was killed before any excessive memory allocation would have returned.

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 24 2011 : 4:31:00 PM

Sorry for being dense -- does that mean what gets reported in the maxvmem field is a reflection the max memory used by *completed* (and hence successful, exit_status=0) processes? When my jobs are failing because of memory, they say they ran out of memory, and I don't see any indication that they were killed by the grid.

Unfortunately, I've been deleting the corrections.nii.gz files as they get created (but after they have served their purpose, of course). So, I can't tell how big they really are. I know that they can blow up to several times the size of the data, so that's why I get rid of them.

syam.gadde
BIAC Staff

USA
421 Posts

Posted - Oct 24 2011 : 4:40:37 PM

What I was saying (unclearly) was that if SGE is setting a system limit for the process, and it failed because it tried to use more than the prescribed amount of memory, SGE could possibly be correctly reporting the memory used by the process before it made the memory allocation call that failed.

So no, for a process that failed because it ran out of memory, the maxvmem value would not be an accurate reflection of the amount of memory your process would have use if it had completed successfully.

Do you have maxvmem values from the jobs that did succeed? (i.e. where you specified a larger amount of memory, say 12GB) I'd be interested to know if that value was between 8GB and 12GB.

petty
BIAC Staff

USA
453 Posts

Posted - Oct 24 2011 : 5:22:22 PM

all denseness aside ...

the maxvmem is the amount the job actually used, whether is was successful or not. there's no way for the grid to know how much you would've used .. it only monitors current use. If you hit your designated threshold, the job is killed.

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 24 2011 : 6:10:32 PM

It seems to say this for pretty much all of my jobs:
maxvmem 4.090G

This is definitely true of one job that failed (due to memory) when I requested 8 GBs but succeeded with 12 GBs. I'll keep a closer eye on it. I'm pretty sure I haven't seen a case of maxvmem being greater than 4.09, but I haven't looked exhaustively. That's kind of why I was curious about what's going on...

I'm pretty sure it's accepting my request for additional memory:
[smith@node40 Logs]$ qstat -j 2550323
==============================================================
job_number: 2550323
exec_file: job_scripts/2550323
submission_time: Mon Oct 24 16:03:27 2011
owner: smith
uid: 2132
group: users
gid: 100
sge_o_home: /home/smith
sge_o_log_name: smith
sge_o_path: /usr/kerberos/bin:/opt/gridengine/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/smith/bin
sge_o_shell: /bin/bash
sge_o_workdir: /home/smith/HighRes.01/GLM_stats/4713
sge_o_host: hugin
execution_time: Sat Sep 24 16:03:59 2011
account: sge
stderr_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out
hard resource_list: h_vmem=12G
mail_list: david.v.smith@duke.edu
notify: FALSE
job_name: HiRes_6trial_52_1009_01_3.job
stdout_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out
jobshare: 0
hard_queue_list: users.q
shell_list: NONE:/bin/sh
env_list: EXPERIMENT=HighRes.01
script_file: HiRes_6trial_52_1009_01_3.job
usage 1: cpu=02:04:23, mem=21551.85410 GBs, io=10.03784, vmem=2.912G, maxvmem=4.081G
scheduling info: queue instance "interact.q@node52.local" dropped because it is full
queue instance "interact.q@node17.local" dropped because it is full
queue instance "interact.q@node3.local" dropped because it is full
queue instance "interact.q@node4.local" dropped because it is full
queue instance "interact.q@node42.local" dropped because it is full
queue instance "interact.q@node5.local" dropped because it is full
queue instance "interact.q@node15.local" dropped because it is full
queue instance "interact.q@node31.local" dropped because it is full
queue instance "interact.q@node8.local" dropped because it is full
queue instance "interact.q@node44.local" dropped because it is full
queue instance "interact.q@node6.local" dropped because it is full
queue instance "interact.q@node28.local" dropped because it is full
queue instance "interact.q@node13.local" dropped because it is full
queue instance "interact.q@node50.local" dropped because it is full
queue instance "interact.q@node37.local" dropped because it is full
queue instance "interact.q@node41.local" dropped because it is full
queue instance "interact.q@node51.local" dropped because it is full
queue instance "interact.q@node40.local" dropped because it is full

petty
BIAC Staff

USA
453 Posts

Posted - Oct 24 2011 : 9:56:42 PM

well, fsl could be doing something that SGE can't correctly keep track of.

The other times i've seen this is virtually anything involving matlab ... meaning the amount reported doesn't correctly affect the amount used. I guess when you hit something like this, just use what has worked previously for a similar data size.

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 24 2011 : 10:58:16 PM

OK, will do. I'll try to keep a better track of memory usage. I know these corrections.nii.gz models can get enormous when the model has a lot of autocorrelations (which I guess is prevalent in the trial-to-trial estimations I'm trying to run). I'll just record the size of the file before I delete on my next iteration of models. Maybe that will give an estimate of how much was actually used...
ls -lh corrections.nii.gz | awk '{print $5}'

Thanks!

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 25 2011 : 4:55:56 PM

OK, sorry to be the trouble maker here... I've done a little more poking around, and I'm still getting some random failures:
file:///Volumes/Huettel/HighRes.01/Analysis/FSL/1007/TrialByTrial/Smooth_0mm/run2/trial01.feat/report_log.html

Key error:
/usr/local/packages/fsl-4.1.8/bin/contrast_mgr stats design.con
Uncaught exception!

This appears to be related to memory issues according to these posts:
JISCmail - FSL Archive - Re: contrast manager error in FEAT (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;c63ccf2a.0702)
JISCmail - FSL Archive - Re: Uncaught exception (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;30d0af21.1010)

The file size for the corrections.nii.gz file here was 1.2 GBs. I requested 8 GBs for my first attempt and 12 GBs for my second and third attempts (the maxvmem ranged between 4.09G and 4.337G). I'll increase to 16 GBs and see if that works.

Thanks!
David

dvsmith
Advanced Member

USA
218 Posts

Posted - Oct 26 2011 : 01:40:17 AM

I got it to work with 16 GBs of RAM... The output from maxvmem was 4.09 GB, so I'm guessing the FSL (or at least some FSL processes) are behaving like Matlab...

The corrections.nii.gz file was 1.2 GB.

Page: of 2

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

BIAC Forums

This page was generated in 0.52 seconds.

Snitz Forums 2000