| T O P I C R E V I E W |
| petty |
Posted - Apr 21 2011 : 09:54:52 AM There have been some instances where users need a very large amount of ram to run a particular job without causing issues to other jobs that may be running.
The computer nodes have about 30GB of available ram with 8 simultanious jobs allowed. If all the jobs are typical and playing nice, then thats sufficient for most tasks.
If you need much more i've created another queue, which limits the node to 1 job, therefore giving you the 30GB. Please only use this if you definitely need it. There are only a limited number of slots, therefore if its abused your job will just sit in the queue for a long time.
to submit to this queue add "-q highram.q" to your job submission
ie: qsub -q highram.q -v EXPERIMENT=Something.01 script.sh
all other jobs default to the users.q |
| 15 L A T E S T R E P L I E S (Newest First) |
| dvsmith |
Posted - Oct 26 2011 : 11:12:30 AM Yeah, it's high-res data, so there's going to be a lot of individual time series. All of the other files that are working with fewer GBs of RAM have exactly the same structure. |
| petty |
Posted - Oct 26 2011 : 10:03:19 AM I don't think it qhas anything to do with that particular file .. i think its more to do with you number of timepoints X number columns in the design X EVs
in your case its: sizeTS=362 numTS=273884 EV=28
this step also saves the res4d data. |
| dvsmith |
Posted - Oct 26 2011 : 01:40:17 AM I got it to work with 16 GBs of RAM... The output from maxvmem was 4.09 GB, so I'm guessing the FSL (or at least some FSL processes) are behaving like Matlab...
The corrections.nii.gz file was 1.2 GB. |
| dvsmith |
Posted - Oct 25 2011 : 4:55:56 PM OK, sorry to be the trouble maker here... I've done a little more poking around, and I'm still getting some random failures: file:///Volumes/Huettel/HighRes.01/Analysis/FSL/1007/TrialByTrial/Smooth_0mm/run2/trial01.feat/report_log.html
Key error: /usr/local/packages/fsl-4.1.8/bin/contrast_mgr stats design.con Uncaught exception!
This appears to be related to memory issues according to these posts: JISCmail - FSL Archive - Re: contrast manager error in FEAT (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;c63ccf2a.0702) JISCmail - FSL Archive - Re: Uncaught exception (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=FSL;30d0af21.1010)
The file size for the corrections.nii.gz file here was 1.2 GBs. I requested 8 GBs for my first attempt and 12 GBs for my second and third attempts (the maxvmem ranged between 4.09G and 4.337G). I'll increase to 16 GBs and see if that works.
Thanks! David
|
| dvsmith |
Posted - Oct 24 2011 : 10:58:16 PM OK, will do. I'll try to keep a better track of memory usage. I know these corrections.nii.gz models can get enormous when the model has a lot of autocorrelations (which I guess is prevalent in the trial-to-trial estimations I'm trying to run). I'll just record the size of the file before I delete on my next iteration of models. Maybe that will give an estimate of how much was actually used... ls -lh corrections.nii.gz | awk '{print $5}' Thanks! |
| petty |
Posted - Oct 24 2011 : 9:56:42 PM well, fsl could be doing something that SGE can't correctly keep track of.
The other times i've seen this is virtually anything involving matlab ... meaning the amount reported doesn't correctly affect the amount used. I guess when you hit something like this, just use what has worked previously for a similar data size. |
| dvsmith |
Posted - Oct 24 2011 : 6:10:32 PM It seems to say this for pretty much all of my jobs: maxvmem 4.090G
This is definitely true of one job that failed (due to memory) when I requested 8 GBs but succeeded with 12 GBs. I'll keep a closer eye on it. I'm pretty sure I haven't seen a case of maxvmem being greater than 4.09, but I haven't looked exhaustively. That's kind of why I was curious about what's going on...
I'm pretty sure it's accepting my request for additional memory: [smith@node40 Logs]$ qstat -j 2550323 ============================================================== job_number: 2550323 exec_file: job_scripts/2550323 submission_time: Mon Oct 24 16:03:27 2011 owner: smith uid: 2132 group: users gid: 100 sge_o_home: /home/smith sge_o_log_name: smith sge_o_path: /usr/kerberos/bin:/opt/gridengine/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/smith/bin sge_o_shell: /bin/bash sge_o_workdir: /home/smith/HighRes.01/GLM_stats/4713 sge_o_host: hugin execution_time: Sat Sep 24 16:03:59 2011 account: sge stderr_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out hard resource_list: h_vmem=12G mail_list: david.v.smith@duke.edu notify: FALSE job_name: HiRes_6trial_52_1009_01_3.job stdout_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out jobshare: 0 hard_queue_list: users.q shell_list: NONE:/bin/sh env_list: EXPERIMENT=HighRes.01 script_file: HiRes_6trial_52_1009_01_3.job usage 1: cpu=02:04:23, mem=21551.85410 GBs, io=10.03784, vmem=2.912G, maxvmem=4.081G scheduling info: queue instance "interact.q@node52.local" dropped because it is full queue instance "interact.q@node17.local" dropped because it is full queue instance "interact.q@node3.local" dropped because it is full queue instance "interact.q@node4.local" dropped because it is full queue instance "interact.q@node42.local" dropped because it is full queue instance "interact.q@node5.local" dropped because it is full queue instance "interact.q@node15.local" dropped because it is full queue instance "interact.q@node31.local" dropped because it is full queue instance "interact.q@node8.local" dropped because it is full queue instance "interact.q@node44.local" dropped because it is full queue instance "interact.q@node6.local" dropped because it is full queue instance "interact.q@node28.local" dropped because it is full queue instance "interact.q@node13.local" dropped because it is full queue instance "interact.q@node50.local" dropped because it is full queue instance "interact.q@node37.local" dropped because it is full queue instance "interact.q@node41.local" dropped because it is full queue instance "interact.q@node51.local" dropped because it is full queue instance "interact.q@node40.local" dropped because it is full
|
| petty |
Posted - Oct 24 2011 : 5:22:22 PM all denseness aside ...
the maxvmem is the amount the job actually used, whether is was successful or not. there's no way for the grid to know how much you would've used .. it only monitors current use. If you hit your designated threshold, the job is killed.
|
| syam.gadde |
Posted - Oct 24 2011 : 4:40:37 PM What I was saying (unclearly) was that if SGE is setting a system limit for the process, and it failed because it tried to use more than the prescribed amount of memory, SGE could possibly be correctly reporting the memory used by the process before it made the memory allocation call that failed.
So no, for a process that failed because it ran out of memory, the maxvmem value would not be an accurate reflection of the amount of memory your process would have use if it had completed successfully.
Do you have maxvmem values from the jobs that did succeed? (i.e. where you specified a larger amount of memory, say 12GB) I'd be interested to know if that value was between 8GB and 12GB. |
| dvsmith |
Posted - Oct 24 2011 : 4:31:00 PM Sorry for being dense -- does that mean what gets reported in the maxvmem field is a reflection the max memory used by *completed* (and hence successful, exit_status=0) processes? When my jobs are failing because of memory, they say they ran out of memory, and I don't see any indication that they were killed by the grid.
Unfortunately, I've been deleting the corrections.nii.gz files as they get created (but after they have served their purpose, of course). So, I can't tell how big they really are. I know that they can blow up to several times the size of the data, so that's why I get rid of them.
|
| syam.gadde |
Posted - Oct 24 2011 : 07:40:39 AM I don't know how SGE is enforcing the memory limits, but it is possible that it is setting it as limited system resource (using limit/ulimit), in which case the allocation of the large chunk of memory itself would fail, and SGE would never know what you actually asked for. In that case the maxmem field would be (correctly) the maximum amount of memory used by the process, but it was killed before any excessive memory allocation would have returned. |
| dvsmith |
Posted - Oct 23 2011 : 8:46:26 PM It was an FSL job, and it specifically failed when it was trying to save a big file. It said FILM did not complete due to memory. I gave it even more memory and it worked. Does that mean that my request for more was read incorrectly? Or is FILM/FSL not communicating with SGE to indicate that a job should be killed immediately?
Thanks!
|
| petty |
Posted - Oct 23 2011 : 8:13:49 PM No because it only reports what is used. If you hit the limit you requested, then SGE immediately kills the job |
| dvsmith |
Posted - Oct 23 2011 : 7:31:04 PM These aren't matlab jobs. Does it correctly report the RAM usage of a failed process? In other words, if I have a process that requires 16 GBs of RAM, and I only give it 12 GBs, will it still report 16 GBs in the maxvmem field even though the process was unsuccessful? |
| petty |
Posted - Oct 23 2011 : 6:12:05 PM that maxvmem is supposed to be the actual amount used ... if its peaking just above 4 and being killed, it would make me think your request for more wasn't correct
however, i've found that matlab jobs don't correctly report their ram usage |