| Author |
Topic  |
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Apr 21 2011 : 09:54:52 AM
|
There have been some instances where users need a very large amount of ram to run a particular job without causing issues to other jobs that may be running.
The computer nodes have about 30GB of available ram with 8 simultanious jobs allowed. If all the jobs are typical and playing nice, then thats sufficient for most tasks.
If you need much more i've created another queue, which limits the node to 1 job, therefore giving you the 30GB. Please only use this if you definitely need it. There are only a limited number of slots, therefore if its abused your job will just sit in the queue for a long time.
to submit to this queue add "-q highram.q" to your job submission
ie: qsub -q highram.q -v EXPERIMENT=Something.01 script.sh
all other jobs default to the users.q |
Edited by - petty on Apr 21 2011 09:55:49 AM |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Sep 19 2011 : 11:40:54 AM
|
The highram.q directive is no longer valid.
I've made RAM a consumable resource, which will allow much more efficient usage of the nodes, since there won't be 4 nodes sitting idle most of the time.
If you believe your jobs needs a larger amount of RAM ( 4GB is our default ), then you can request it with the "-l h_vmem" directive. This will only send your job to a node with the requested amount of resources and will prevent you/others from going over the limits.
> qsub -l h_vmem=30G -v EXPERIMENT=Something.01 script.sh
The above example would be equivalent of sending something to the previous highram.q. Please do not request additional resources unless its necessary.
|
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 23 2011 : 4:18:09 PM
|
Is there any way to determine how much RAM was needed (not used) for a particular job? I assume maxvmem is what actually got used by any completed (but not failed) process? Anyway, my maxvmem is peaking at 4.09 (I believe I request 8 GBs for this particular job), and it's crashing right around the point FILM outputs the corrections.nii.gz file, which I assume must be way bigger than 8 GBs. I increased this to 12 GBs and it seems to work for the most part now, but I don't want to request more than necessary.
[smith@hugin nperlabel_equal]$ qacct -j 2413437 ============================================================== qname users.q hostname node11 group users owner smith project NONE department defaultdepartment jobname HiRes_6trial_31_1007_04_1.job jobnumber 2413437 taskid undefined account sge priority -10 qsub_time Fri Oct 14 21:59:40 2011 start_time Fri Oct 14 21:59:42 2011 end_time Fri Oct 14 23:48:03 2011 granted_pe NONE slots 1 failed 0 exit_status 0 ru_wallclock 6501 ru_utime 6430.700 ru_stime 14.814 ru_maxrss 6945840 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 3275339 ru_majflt 0 ru_nswap 0 ru_inblock 3322244 ru_oublock 2538592 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 122426 ru_nivcsw 8166 cpu 6445.514 mem 16654.524 io 12.751 iow 0.000 maxvmem 4.090G arid undefined
|
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 23 2011 : 6:12:05 PM
|
that maxvmem is supposed to be the actual amount used ... if its peaking just above 4 and being killed, it would make me think your request for more wasn't correct
however, i've found that matlab jobs don't correctly report their ram usage |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 23 2011 : 7:31:04 PM
|
| These aren't matlab jobs. Does it correctly report the RAM usage of a failed process? In other words, if I have a process that requires 16 GBs of RAM, and I only give it 12 GBs, will it still report 16 GBs in the maxvmem field even though the process was unsuccessful? |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 23 2011 : 8:13:49 PM
|
| No because it only reports what is used. If you hit the limit you requested, then SGE immediately kills the job |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 23 2011 : 8:46:26 PM
|
It was an FSL job, and it specifically failed when it was trying to save a big file. It said FILM did not complete due to memory. I gave it even more memory and it worked. Does that mean that my request for more was read incorrectly? Or is FILM/FSL not communicating with SGE to indicate that a job should be killed immediately?
Thanks!
|
 |
|
|
syam.gadde
BIAC Staff
    
USA
421 Posts |
Posted - Oct 24 2011 : 07:40:39 AM
|
| I don't know how SGE is enforcing the memory limits, but it is possible that it is setting it as limited system resource (using limit/ulimit), in which case the allocation of the large chunk of memory itself would fail, and SGE would never know what you actually asked for. In that case the maxmem field would be (correctly) the maximum amount of memory used by the process, but it was killed before any excessive memory allocation would have returned. |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 24 2011 : 4:31:00 PM
|
Sorry for being dense -- does that mean what gets reported in the maxvmem field is a reflection the max memory used by *completed* (and hence successful, exit_status=0) processes? When my jobs are failing because of memory, they say they ran out of memory, and I don't see any indication that they were killed by the grid.
Unfortunately, I've been deleting the corrections.nii.gz files as they get created (but after they have served their purpose, of course). So, I can't tell how big they really are. I know that they can blow up to several times the size of the data, so that's why I get rid of them.
|
 |
|
|
syam.gadde
BIAC Staff
    
USA
421 Posts |
Posted - Oct 24 2011 : 4:40:37 PM
|
What I was saying (unclearly) was that if SGE is setting a system limit for the process, and it failed because it tried to use more than the prescribed amount of memory, SGE could possibly be correctly reporting the memory used by the process before it made the memory allocation call that failed.
So no, for a process that failed because it ran out of memory, the maxvmem value would not be an accurate reflection of the amount of memory your process would have use if it had completed successfully.
Do you have maxvmem values from the jobs that did succeed? (i.e. where you specified a larger amount of memory, say 12GB) I'd be interested to know if that value was between 8GB and 12GB. |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 24 2011 : 5:22:22 PM
|
all denseness aside ...
the maxvmem is the amount the job actually used, whether is was successful or not. there's no way for the grid to know how much you would've used .. it only monitors current use. If you hit your designated threshold, the job is killed.
|
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 24 2011 : 6:10:32 PM
|
It seems to say this for pretty much all of my jobs: maxvmem 4.090G
This is definitely true of one job that failed (due to memory) when I requested 8 GBs but succeeded with 12 GBs. I'll keep a closer eye on it. I'm pretty sure I haven't seen a case of maxvmem being greater than 4.09, but I haven't looked exhaustively. That's kind of why I was curious about what's going on...
I'm pretty sure it's accepting my request for additional memory: [smith@node40 Logs]$ qstat -j 2550323 ============================================================== job_number: 2550323 exec_file: job_scripts/2550323 submission_time: Mon Oct 24 16:03:27 2011 owner: smith uid: 2132 group: users gid: 100 sge_o_home: /home/smith sge_o_log_name: smith sge_o_path: /usr/kerberos/bin:/opt/gridengine/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/smith/bin sge_o_shell: /bin/bash sge_o_workdir: /home/smith/HighRes.01/GLM_stats/4713 sge_o_host: hugin execution_time: Sat Sep 24 16:03:59 2011 account: sge stderr_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out hard resource_list: h_vmem=12G mail_list: david.v.smith@duke.edu notify: FALSE job_name: HiRes_6trial_52_1009_01_3.job stdout_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out jobshare: 0 hard_queue_list: users.q shell_list: NONE:/bin/sh env_list: EXPERIMENT=HighRes.01 script_file: HiRes_6trial_52_1009_01_3.job usage 1: cpu=02:04:23, mem=21551.85410 GBs, io=10.03784, vmem=2.912G, maxvmem=4.081G scheduling info: queue instance "interact.q@node52.local" dropped because it is full queue instance "interact.q@node17.local" dropped because it is full queue instance "interact.q@node3.local" dropped because it is full queue instance "interact.q@node4.local" dropped because it is full queue instance "interact.q@node42.local" dropped because it is full queue instance "interact.q@node5.local" dropped because it is full queue instance "interact.q@node15.local" dropped because it is full queue instance "interact.q@node31.local" dropped because it is full queue instance "interact.q@node8.local" dropped because it is full queue instance "interact.q@node44.local" dropped because it is full queue instance "interact.q@node6.local" dropped because it is full queue instance "interact.q@node28.local" dropped because it is full queue instance "interact.q@node13.local" dropped because it is full queue instance "interact.q@node50.local" dropped because it is full queue instance "interact.q@node37.local" dropped because it is full queue instance "interact.q@node41.local" dropped because it is full queue instance "interact.q@node51.local" dropped because it is full queue instance "interact.q@node40.local" dropped because it is full
|
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 24 2011 : 9:56:42 PM
|
well, fsl could be doing something that SGE can't correctly keep track of.
The other times i've seen this is virtually anything involving matlab ... meaning the amount reported doesn't correctly affect the amount used. I guess when you hit something like this, just use what has worked previously for a similar data size. |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 24 2011 : 10:58:16 PM
|
OK, will do. I'll try to keep a better track of memory usage. I know these corrections.nii.gz models can get enormous when the model has a lot of autocorrelations (which I guess is prevalent in the trial-to-trial estimations I'm trying to run). I'll just record the size of the file before I delete on my next iteration of models. Maybe that will give an estimate of how much was actually used... ls -lh corrections.nii.gz | awk '{print $5}' Thanks! |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 26 2011 : 01:40:17 AM
|
I got it to work with 16 GBs of RAM... The output from maxvmem was 4.09 GB, so I'm guessing the FSL (or at least some FSL processes) are behaving like Matlab...
The corrections.nii.gz file was 1.2 GB. |
 |
|
Topic  |
|