Duke-UNC Brain Imaging and Analysis Center
BIAC Forums | Profile | Register | Active Topics | Members | Search | FAQ
Username:
Password:
Save Password   Forgot your Password?
 All Forums
 Support Forums
 Cluster Support
 Cluster down?
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

clithero
Junior Member

37 Posts

Posted - May 25 2008 :  1:27:29 PM  Show Profile  Reply with Quote
Hi,
I am trying to run some FSL jobs (first level FEAT and flirt) using scripts that had previously worked. It seems they either failed or started hanging. Killing jobs from some nodes (node 9) seemed to work, but all the jobs I had running on nodes 8 and 6 are remaining "stalled" from qstatall.
I would be grateful for some insight.
Thanks!
John

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - May 27 2008 :  10:24:47 AM  Show Profile  Reply with Quote
Is this still an issue? If so, can you provide job IDs and the node the job was submitted to?

-Josh
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jul 20 2008 :  1:19:01 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
I'm having the same problem that John was having. My jobs are stalled and wasting space on the cluster. Their current status is "dr". How can I get rid of them?

They're all on node5, so maybe there's just a problem with that node?

The job IDs are below:
422292
422294
422295
422297

Thanks,
David

Go to Top of Page

petty
BIAC Staff

USA
453 Posts

Posted - Jul 20 2008 :  6:50:38 PM  Show Profile  Reply with Quote
david you can kill your jobs with the "qdel" function.

ie: qdel 422292
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jul 20 2008 :  10:52:45 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
no... that's what got them in the stalled state.

when i type
ps -u smith
, i do not see any processes that seem to be supporting the hung jobs on the head node, so i don't think i can kill them from there. i also can't ssh onto node5 to kill any aberrant processes there, which is where they are all stuck.
Go to Top of Page

dvsmith
Advanced Member

USA
218 Posts

Posted - Jul 21 2008 :  1:37:04 PM  Show Profile  Visit dvsmith's Homepage  Reply with Quote
So what was the deal with node5? My stalled out jobs are gone, so I assume Josh another cluster admin killed them?
Go to Top of Page

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - Jul 21 2008 :  1:39:37 PM  Show Profile  Reply with Quote
Node 5 was in a hung state, so it needed to be rebooted, which was done this morning and killed all of your jobs.

I'm testing some things out, and once all of those tests pass, I'll get node 5 up and running to the users queue.

- Josh
Go to Top of Page

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - Jul 21 2008 :  4:49:32 PM  Show Profile  Reply with Quote
Node 5 has been tested and is up and running.

-Josh
Go to Top of Page

clithero
Junior Member

37 Posts

Posted - Sep 04 2008 :  10:20:37 AM  Show Profile  Reply with Quote
I think a couple of jobs randomly stalled last night on node7. Any ideas for what happened?

Two from me (both LIBSVM jobs). Jobs 493918 and 493920. Vinod also had one, but I am not sure which node. They are still sitting on qstatall.

Thanks.
Go to Top of Page

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - Sep 04 2008 :  10:26:19 AM  Show Profile  Reply with Quote
It looks like node7 has crashed. You are correct about the jobs that are hung. You and Vinod will need to resubmit the jobs. I'll try to get node7 up and running as soon as possible.

Sorry for any inconvenience,
Josh
Go to Top of Page

josh.bizzell
BIAC Staff

USA
118 Posts

Posted - Sep 05 2008 :  10:07:10 AM  Show Profile  Reply with Quote
Node7 should be up and running again. Please let us know if you experience any problems.

-Josh
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
BIAC Forums © 2000-2010 Brain Imaging and Analysis Center Go To Top Of Page
This page was generated in 0.36 seconds. Snitz Forums 2000