| Author |
Topic  |
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 28 2011 : 9:13:25 PM
|
Are some of the other nodes having issues now? I ask because I have some jobs that don't seem to be doing anything. None of my jobs should be taking more than about 45 minutes or so, but some have been hanging without error all day. When I check the job, I can't see anything wrong:
[smith@hugin nperlabel_equal]$ qstat -j 2580107 ============================================================== job_number: 2580107 exec_file: job_scripts/2580107 submission_time: Fri Oct 28 13:59:08 2011 owner: smith uid: 2132 group: users gid: 100 sge_o_home: /home/smith sge_o_log_name: smith sge_o_path: /usr/kerberos/bin:/opt/gridengine/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/smith/bin sge_o_shell: /bin/bash sge_o_workdir: /home/smith/Imagene.02/neglect_mvpa/nperlabel_equal/18779 sge_o_host: hugin execution_time: Wed Sep 28 13:59:19 2011 account: sge stderr_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out hard resource_list: h_vmem=3G mail_list: david.v.smith@duke.edu notify: FALSE job_name: vox_MVPA_perm_00005 stdout_path_list: NONE:NONE:$HOME/$JOB_NAME.$JOB_ID.out jobshare: 0 hard_queue_list: users.q shell_list: NONE:/bin/sh env_list: EXPERIMENT=Imagene.02 script_file: vox_MVPA_perm_00005 usage 1: cpu=07:08:27, mem=19348.85555 GBs, io=0.09795, vmem=865.883M, maxvmem=1.778G scheduling info: queue instance "interact.q@node15.local" dropped because it is disabled queue instance "users.q@node15" dropped because it is disabled queue instance "interact.q@node17.local" dropped because it is full queue instance "interact.q@node3.local" dropped because it is full queue instance "interact.q@node4.local" dropped because it is full queue instance "interact.q@node42.local" dropped because it is full queue instance "interact.q@node38.local" dropped because it is full queue instance "interact.q@node5.local" dropped because it is full queue instance "interact.q@node46.local" dropped because it is full queue instance "interact.q@node50.local" dropped because it is full queue instance "users.q@node52.local" dropped because it is full queue instance "users.q@node53.local" dropped because it is full queue instance "users.q@node41" dropped because it is full queue instance "users.q@node3" dropped because it is full queue instance "users.q@node4" dropped because it is full
Here are some of the offending jobs/nodes: 2580107 0.25866 vox_MVPA_p smith r 10/28/2011 13:59:11 users.q@node30 1 2580113 0.25865 vox_MVPA_p smith r 10/28/2011 13:59:41 users.q@node44 1 2580138 0.25862 vox_MVPA_p smith r 10/28/2011 14:01:21 users.q@node50 1 2580161 0.25858 vox_MVPA_p smith r 10/28/2011 14:03:13 users.q@node58.local 1 2580163 0.25858 vox_MVPA_p smith r 10/28/2011 14:03:28 users.q@node41 1 2580187 0.25854 vox_MVPA_p smith r 10/28/2011 14:05:13 users.q@node6 1 2580289 0.25848 vox_MVPA_p smith r 10/28/2011 14:10:13 users.q@node54.local 1 2580294 0.25847 vox_MVPA_p smith r 10/28/2011 14:11:13 users.q@node56.local 1 2580300 0.25847 vox_MVPA_p smith r 10/28/2011 14:12:28 users.q@node19 1 2580356 0.25838 vox_MVPA_p smith r 10/28/2011 14:20:44 users.q@node59.local 1 2580425 0.25829 vox_MVPA_p smith r 10/28/2011 14:30:10 users.q@node32 1 2580461 0.25824 vox_MVPA_p smith r 10/28/2011 14:36:10 users.q@node27 1 2580485 0.25821 vox_MVPA_p smith r 10/28/2011 14:39:40 users.q@node24 1 2580491 0.25820 vox_MVPA_p smith r 10/28/2011 14:40:23 users.q@node23 1 2580502 0.25819 vox_MVPA_p smith r 10/28/2011 14:41:23 users.q@node41 1 2580514 0.25817 vox_MVPA_p smith r 10/28/2011 14:43:12 users.q@node17 1 2580518 0.25816 vox_MVPA_p smith r 10/28/2011 14:43:12 users.q@node50 1 2580524 0.25815 vox_MVPA_p smith r 10/28/2011 14:44:12 users.q@node7 1 2580538 0.25813 vox_MVPA_p smith r 10/28/2011 14:46:17 users.q@node16 1 2580570 0.25806 vox_MVPA_p smith r 10/28/2011 14:51:46 users.q@node9 1 2580663 0.25781 vox_MVPA_p smith r 10/28/2011 14:58:30 users.q@node39 1
Should I just kill these and piece together whatever happened later on once everything is done?
Thanks! David
|
 |
|
|
syam.gadde
BIAC Staff
    
USA
421 Posts |
Posted - Oct 28 2011 : 9:47:30 PM
|
I only looked at one of your jobs (node30) and it is merrily using 100% of a CPU, so it's doing something. Whether it is doing anything useful, I could not tell you. If it produces any real-time logs maybe that could give you a clue, or if it tends to write output or temporary files regularly you can look at the timestamps.
There is a tool called lsof that will tell you all the files that are open (by processes owned by you). You can play with that and see if it gives you any clues. |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 29 2011 : 02:03:20 AM
|
It's not doing anything useful for some reason... The logs are still sitting in my home directory and they're not even getting to the point of loading the data, but it is filled with the warnings that Chris has mentioned before... I used lsof for one of the presumably stuck jobs on node50, and I did see a lot of Python processes/files in use, so it was at least accessing files that it need to access.
I'm going to kill most of the ones that I think are stuck, but I'll leave one or two in the hopes that someone can tell whether anything is wrong with the node or packages (or at least why they failed to do anything). They definitely shouldn't take more than 45 minutes at most, so sorry about that.
----JOB [vox_MVPA_perm_01024.2583605] START [Fri Oct 28 20:10:34 EDT 2011] on HOST [node46]---- /usr/lib64/python2.6/site-packages/mvpa/base/verbosity.py:19: DeprecationWarning: the sets module is deprecated from sets import Set /usr/lib/python2.6/site-packages/scipy-0.9.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py:4088: RuntimeWarning: divide by zero encountered in power return np.power((1.0-x*x),c/2.0-1) / special.beta(0.5,c/2.0) /usr/lib/python2.6/site-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py:1881: RuntimeWarning: invalid value encountered in _cdf_single_call (vectorized) _res = array(self.ufunc(*newargs),copy=False, Loaded lars 0.9-8 /usr/lib/python2.6/site-packages/matplotlib-1.0.1-py2.6-linux-x86_64.egg/matplotlib/numerix/__init__.py:18: DeprecationWarning: ********************************************************** matplotlib.numerix and all its subpackages are deprecated. They will be removed soon. Please use numpy instead. **********************************************************
warnings.warn(msg, DeprecationWarning) /usr/lib64/python2.6/site-packages/mvpa/misc/errorfx.py:99: RuntimeWarning: invalid value encountered in divide ([0], N.cumsum(t)/t.sum(dtype=N.float), [1])) /usr/lib64/python2.6/site-packages/mvpa/misc/errorfx.py:103: RuntimeWarning: invalid value encountered in divide ([0], N.cumsum(~t)/(~t).sum(dtype=N.float), [1])) /usr/lib/python2.6/site-packages/scipy-0.9.0-py2.6-linux-x86_64.egg/scipy/stats/stats.py:274: RuntimeWarning: invalid value encountered in double_scalars return np.mean(x,axis)/factor /usr/lib64/python2.6/site-packages/mvpa/clfs/transerror.py:586: RuntimeWarning: invalid value encountered in divide stats['PPV'] = stats['TP'] / (1.0*stats["P'"]) /usr/lib64/python2.6/site-packages/mvpa/clfs/transerror.py:587: RuntimeWarning: invalid value encountered in divide stats['NPV'] = stats['TN'] / (1.0*stats["N'"]) /usr/lib64/python2.6/site-packages/mvpa/clfs/transerror.py:588: RuntimeWarning: invalid value encountered in divide stats['FDR'] = stats['FP'] / (1.0*stats["P'"])
|
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 29 2011 : 5:54:16 PM
|
OK, well even my successful jobs (those lasting < 45 minutes) have the above warning, so that's not it. I have no idea what's going on. It seems like a good *random* third of these jobs -- and only these jobs -- are getting hung up before they do anything.
2599824 1 vox_MVPA_perm_05023 smith 10/29/2011 11:55:46 users.q 2599844 1 vox_MVPA_perm_05028 smith 10/29/2011 11:57:16 users.q 2599928 1 vox_MVPA_perm_05049 smith 10/29/2011 12:00:01 users.q 2599976 1 vox_MVPA_perm_05061 smith 10/29/2011 12:02:01 users.q 2600047 1 vox_MVPA_perm_05079 smith 10/29/2011 12:05:01 users.q 2600060 1 runfree.sh vmb8 10/29/2011 12:05:46 users.q 2600061 1 runfree.sh vmb8 10/29/2011 12:06:01 users.q 2600196 1 vox_MVPA_perm_05115 smith 10/29/2011 12:11:01 users.q 2600280 1 vox_MVPA_perm_05136 smith 10/29/2011 12:14:31 users.q 2600308 1 vox_MVPA_perm_05143 smith 10/29/2011 12:15:46 users.q 2600319 1 vox_MVPA_perm_05146 smith 10/29/2011 12:16:16 users.q 2600327 1 vox_MVPA_perm_05148 smith 10/29/2011 12:16:31 users.q 2600436 1 vox_MVPA_perm_05175 smith 10/29/2011 12:21:01 users.q 2600569 1 vox_MVPA_perm_05208 smith 10/29/2011 12:26:31 users.q 2600648 1 Pr1 truong 10/29/2011 12:29:46 users.q 2600649 1 vox_MVPA_perm_05228 smith 10/29/2011 12:30:01 users.q 2600660 1 Pr2 truong 10/29/2011 12:30:16 users.q 2600662 1 Pr3 truong 10/29/2011 12:30:31 users.q 2600667 1 Pr4 truong 10/29/2011 12:30:31 users.q 2600672 1 Pr5 truong 10/29/2011 12:30:46 users.q 2600677 1 Pr6 truong 10/29/2011 12:30:46 users.q 2600738 1 vox_MVPA_perm_05249 smith 10/29/2011 12:33:31 users.q 2600795 1 vox_MVPA_perm_05263 smith 10/29/2011 12:35:46 users.q 2600824 1 vox_MVPA_perm_05270 smith 10/29/2011 12:37:01 users.q 2600832 1 vox_MVPA_perm_05272 smith 10/29/2011 12:37:16 users.q 2600840 1 vox_MVPA_perm_05274 smith 10/29/2011 12:37:31 users.q 2600961 1 runfree.sh vmb8 10/29/2011 12:42:46 users.q 2601003 1 vox_MVPA_perm_05315 smith 10/29/2011 12:44:31 users.q 2601404 1 vox_MVPA_perm_05400 smith 10/29/2011 12:59:01 users.q 2601441 1 vox_MVPA_perm_05409 smith 10/29/2011 13:00:46 users.q 2601469 1 vox_MVPA_perm_05416 smith 10/29/2011 13:02:16 users.q 2601697 1 vox_MVPA_perm_05473 smith 10/29/2011 13:11:46 users.q 2601772 1 vox_MVPA_perm_05492 smith 10/29/2011 13:15:24 users.q 2601913 1 vox_MVPA_perm_05527 smith 10/29/2011 13:20:54 users.q 2601933 1 vox_MVPA_perm_05532 smith 10/29/2011 13:21:39 users.q 2601953 1 vox_MVPA_perm_05537 smith 10/29/2011 13:22:24 users.q 2601995 1 vox_MVPA_perm_05547 smith 10/29/2011 13:23:54 users.q 2602200 1 vox_MVPA_perm_05597 smith 10/29/2011 13:32:59 users.q 2602332 1 vox_MVPA_perm_05630 smith 10/29/2011 13:37:59 users.q 2602345 1 vox_MVPA_perm_05633 smith 10/29/2011 13:38:59 users.q 2602352 1 vox_MVPA_perm_05635 smith 10/29/2011 13:39:14 users.q 2602452 1 vox_MVPA_perm_05660 smith 10/29/2011 13:43:29 users.q 2602592 1 vox_MVPA_perm_05695 smith 10/29/2011 13:49:14 users.q 2602690 1 vox_MVPA_perm_05719 smith 10/29/2011 13:52:29 users.q 2602742 1 vox_MVPA_perm_05732 smith 10/29/2011 13:54:14 users.q 2602853 1 vox_MVPA_perm_05760 smith 10/29/2011 13:58:59 users.q 2602922 1 vox_MVPA_perm_05777 smith 10/29/2011 14:01:44 users.q 2602958 1 vox_MVPA_perm_05786 smith 10/29/2011 14:03:14 users.q 2603098 1 vox_MVPA_perm_05821 smith 10/29/2011 14:09:14 users.q 2603126 1 vox_MVPA_perm_05828 smith 10/29/2011 14:10:14 users.q 2603214 1 vox_MVPA_perm_05850 smith 10/29/2011 14:13:59 users.q 2603230 1 vox_MVPA_perm_05854 smith 10/29/2011 14:14:44 users.q 2603250 1 vox_MVPA_perm_05859 smith 10/29/2011 14:15:29 users.q 2603337 1 vox_MVPA_perm_05881 smith 10/29/2011 14:19:14 users.q 2603346 1 vox_MVPA_perm_05883 smith 10/29/2011 14:19:29 users.q 2603350 1 vox_MVPA_perm_05884 smith 10/29/2011 14:19:44 users.q 2603435 1 vox_MVPA_perm_05905 smith 10/29/2011 14:23:14 users.q 2603484 1 vox_MVPA_perm_05917 smith 10/29/2011 14:26:02 users.q 2603704 1 vox_MVPA_perm_05972 smith 10/29/2011 14:35:17 users.q 2603724 1 vox_MVPA_perm_05977 smith 10/29/2011 14:36:02 users.q 2603745 1 vox_MVPA_perm_05982 smith 10/29/2011 14:36:47 users.q 2603778 1 vox_MVPA_perm_05990 smith 10/29/2011 14:38:17 users.q 2603812 1 vox_MVPA_perm_05998 smith 10/29/2011 14:39:32 users.q 2603918 1 vox_MVPA_perm_06024 smith 10/29/2011 14:43:47 users.q 2603979 1 vox_MVPA_perm_06038 smith 10/29/2011 14:46:17 users.q 2604040 1 vox_MVPA_perm_06051 smith 10/29/2011 14:48:17 users.q 2604223 1 vox_MVPA_perm_06087 smith 10/29/2011 14:54:32 users.q 2604278 1 vox_MVPA_perm_06100 smith 10/29/2011 14:56:32 users.q 2604378 1 vox_MVPA_perm_06125 smith 10/29/2011 15:00:47 users.q 2604414 1 vox_MVPA_perm_06134 smith 10/29/2011 15:02:17 users.q 2604490 1 vox_MVPA_perm_06153 smith 10/29/2011 15:05:32 users.q 2604610 1 vox_MVPA_perm_06183 smith 10/29/2011 15:10:32 users.q 2604682 1 vox_MVPA_perm_06201 smith 10/29/2011 15:13:32 users.q 2604770 1 vox_MVPA_perm_06223 smith 10/29/2011 15:17:02 users.q 2604850 1 vox_MVPA_perm_06243 smith 10/29/2011 15:20:32 users.q 2604858 1 vox_MVPA_perm_06245 smith 10/29/2011 15:20:47 users.q 2604927 1 vox_MVPA_perm_06262 smith 10/29/2011 15:23:47 users.q 2605079 1 vox_MVPA_perm_06299 smith 10/29/2011 15:29:47 users.q 2605402 1 vox_MVPA_perm_06379 smith 10/29/2011 15:43:17 users.q 2605546 1 vox_MVPA_perm_06415 smith 10/29/2011 15:49:17 users.q 2605599 1 vox_MVPA_perm_06428 smith 10/29/2011 15:51:32 users.q 2605631 1 vox_MVPA_perm_06436 smith 10/29/2011 15:52:47 users.q 2605691 1 vox_MVPA_perm_06451 smith 10/29/2011 15:55:17 users.q 2606015 1 vox_MVPA_perm_06532 smith 10/29/2011 16:08:47 users.q 2606059 1 vox_MVPA_perm_06543 smith 10/29/2011 16:10:32 users.q 2606115 1 vox_MVPA_perm_06557 smith 10/29/2011 16:13:02 users.q 2606155 1 vox_MVPA_perm_06567 smith 10/29/2011 16:14:47 users.q 2606520 1 vox_MVPA_perm_06658 smith 10/29/2011 16:29:58 users.q 2606568 1 vox_MVPA_perm_06670 smith 10/29/2011 16:31:58 users.q 2606672 1 vox_MVPA_perm_06696 smith 10/29/2011 16:36:13 users.q 2606688 1 vox_MVPA_perm_06700 smith 10/29/2011 16:36:58 users.q 2607104 1 vox_MVPA_perm_06804 smith 10/29/2011 16:54:28 users.q 2607144 1 vox_MVPA_perm_06814 smith 10/29/2011 16:55:58 users.q 2607364 1 vox_MVPA_perm_06869 smith 10/29/2011 17:05:13 users.q 2607392 1 vox_MVPA_perm_06876 smith 10/29/2011 17:06:28 users.q 2607580 1 vox_MVPA_perm_06923 smith 10/29/2011 17:14:13 users.q 2607592 1 vox_MVPA_perm_06926 smith 10/29/2011 17:14:43 users.q
|
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 29 2011 : 6:07:32 PM
|
| are you saying that they aren't doing anything once they begin running? |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 29 2011 : 6:33:14 PM
|
Yeah (sorry for not being clear about that). I have a print command before the script really does anything intensive and it's never getting that far in these stuck jobs.
###packages###
import os, sys, glob, time
import numpy as N
#import pylab as P #needs display
from mvpa.suite import *
#from matplotlib.font_manager import FontProperties #needs display
###parameters and pathways###
num_sub = 140
exp_dir = 'MAINDIR' #to be replaced by sed
data_dir = os.path.join(exp_dir,'Data') #might to change this, too
###naming conventions###
test_name = "neglect" #to be replaced by sed
dummyROI = "000_000_000"
###naming conventions###
perm = "SED_PERM_SED" #to be replaced by sed
#S/Attributes/permuted_labels/permuted_attr_00001.txt
###attributes###
#participants with and without Neglect#
attr_dir = os.path.join(exp_dir, 'Attributes','permuted_labels')
attr_file = "%s/permuted_attr_%s.txt" %(attr_dir, perm)
attr = SampleAttributes(attr_file)
mask_type = "old" #to be replaced by sed
#data_types = [ 'raw', 'normed' ]
data_type = "normed"
##Load MRI data##
if data_type == 'normed':
wb_file = "%s/size_LesionData_CN.nii.gz" %(data_dir)
else:
wb_file = "%s/LesionData_CN.nii.gz" %(data_dir)
msg = "used %s data" %(data_type)
print msg
I am pretty sure it's not systematically failing when it reads in my permuted design SampleAttributes(attr_file) because the same exact permutation works for other jobs. |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 29 2011 : 6:41:29 PM
|
is each one of your jobs accessing the same file/directory/etc at the same time ?
Also, in that long list of jobs, how can you know other people's jobs are doing anything? |
Edited by - petty on Oct 29 2011 6:43:47 PM |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 29 2011 : 6:50:49 PM
|
All of the attr files are in the same directory -- but it's unlikely that any two jobs are accessing the same file at the same time. In this analysis, there is one permuted_attr_%s.txt file for each job.
I was only talking about my jobs in that list. |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 29 2011 : 7:31:31 PM
|
is this one of those things where you have like a million files in the same directory?
the load on all the nodes is so high, maybe its getting bogged down with file access |
 |
|
|
dvsmith
Advanced Member
    
USA
218 Posts |
Posted - Oct 29 2011 : 7:44:35 PM
|
just 10,000 (one for each permutation i need to do)
load because of everyone's jobs or just my jobs? mine jobs only use 3 GBs of RAM and they're generally pretty quick (e.g., 30-45 minutes) except when they break like this. are they failing because of the specific node they land on (e.g., one with one of trong-kha's jobs)? |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Oct 29 2011 : 9:06:53 PM
|
everyones jobs, the nodes are completely booked to the max and all the resources have been reserved.
they aren't over prescribed ... if you look there's available slots on most nodes .. which means that all the memory has been allocated. |
Edited by - petty on Oct 29 2011 9:08:48 PM |
 |
|
|
ark19
Junior Member
 
27 Posts |
Posted - Nov 14 2011 : 5:08:20 PM
|
I'm seeing this error again:
mkdir: cannot create directory `/mnt/BIAC/munin.dhe.duke.edu/Hariri/DNS.01/Analysis/SPM/Processed/20110915_13694': No such file or directory /opt/gridengine/hugin/spool/node3/job_scripts/2807590: line 130: /mnt/BIAC/munin.dhe.duke.edu/Hariri/DNS.01/Analysis/SPM/Processed/20110915_13694/spm_batch1_1.m: No such file or directory
Is it the same issue as before?
Thanks! |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Nov 14 2011 : 6:02:38 PM
|
| nope, that node is behaving normally and i was able to access your "Processed" folder as you on node3. |
 |
|
|
ark19
Junior Member
 
27 Posts |
Posted - Nov 14 2011 : 6:53:52 PM
|
Ok, thanks. Any ideas about what could be happening here? This exact script ran perfectly 2 hours ago - now I get the above error messages, and further, an empty folder for this subject (20110915_13694) is created in /DNS.01/Analysis/SPM/Processed and somehow I cannot delete it:
[ark19@node53 Processed]$ ls -l ... drwx------ 7 ark19 root 2048 Nov 12 18:23 20110914_13690 drwx------ 0 ark19 root 0 Nov 14 16:55 20110915_13694 drwx------ 7 ark19 root 2048 Nov 14 14:25 20110916_13702 ... [ark19@node53 Processed]$ rm 20110915_13694 rm: cannot remove `20110915_13694': No such file or directory [ark19@node53 Processed]$ rm -rf 20110915_13694 [ark19@node53 Processed]$ ls -l ... drwx------ 7 ark19 root 2048 Nov 12 18:23 20110914_13690 drwx------ 0 ark19 root 0 Nov 14 16:55 20110915_13694 drwx------ 7 ark19 root 2048 Nov 14 14:25 20110916_13702 ... |
 |
|
|
petty
BIAC Staff
    
USA
453 Posts |
Posted - Nov 14 2011 : 9:07:39 PM
|
the script ran for this same subject previously?
Also, that folder no longer exists. |
 |
|
Topic  |
|