Friday 27 July 2007

It's always a good feeling putting a timely end to a long running problem. We've been plagued by the pnfs permission denied error for a little over a month- but hopefully (touch wood) we won't be seeing it again.

So what was the problem? It appears to have been a gridftp door nfs mount asynchronisation problem. Or something like that. Essentially the nfs mounts that the gridftp door reads occasionally failed to update quickly enough, so tests like the SAM tests that copy in a file then immediately try to access it occasionally barfed- it checked to see if the file was there but as it hadn't been updated it didn't. A subsequent test might occur from the door after it's synced, or of an already synced door, and thus pass. I only tracked this down after trying to recreate my own transfer tests and finding I only failed copying out "fresh" files. Mentioning this during the UKI meeting Chris Brew pointed me towards a problem with my PNFS mounts and Greig helped point me towards what they should be (thanks dudes!). I found that my mounts where missing the "noac" (no attribute caching) option-this option is what Grieg and Chris have on their mounts and recommended for multiple clients accessing one server. So a quick remount of all my doors and things seem miraculously all better- my homemade tests all worked and Lancaster's SAM tests are a field of calming green. Thanks to everyone for their help and ideas.

In other news, we're gearing up for the SL4 move- ETA for that about a fortnight. We plan to lay the ground work then try to reinstall all the WNs in one day of downtime, we'll have to see if this is a feasible plan.

Have a good weekend all, I know I will!

Friday 20 July 2007

Replica manager failure solution? Permission denied

We're still facing down the same dcache problem. Consultation with the experts has directed us to the fact that the problem isn't (directly) a gPlazma one-the request are being assigned to the wrong user (or no user) and suffering permission problems. Even if immediately preceding and proceding access requests of a similar nature with the same proxy succeed without a glitch. The almost complete lack of a sign of this problem in the logs is increasing the frustration. I've sent the developers every config file they've asked for and some they didn't, upped the logging levels and scoured logs till my eyes were almost bleeding. And I haven't a clue how to fix this thing. There's no real pattern-other then the fact that failures seem to happen more often during the working day (but then it's a small statistical sample)--- there are no corresponding load spikes. We have a ticket open against us about this and the word quarantine got mentioned-which is never a good thing. It sometimes feels like we have a process running in our srm that rolls a dice every now and again and if it comes up as a 1 we fail our saving throw vs. SAM test failure. If we could just find a pattern or cause then we'd be in such a better position. All we can do is keep scouring and maybe the odd tweak and see if something presents itself.

Friday 13 July 2007

Torque scripts

Added two scripts to the sysadmin repository:

https://www.sysadmin.hep.ac.uk/wiki/Nodes_Down_in_Torque

https://www.sysadmin.hep.ac.uk/wiki/Counting_Jobs

Happy Friday 13th

Business at usual for Lancaster, with CA updates and procuring new hardware on our agenda this week. We're also being plagued by a(nother) wierd intermittant dcache problem that's sullying what would be an otherwise flawless SAM test run. We've managed to track it down to some kind of authorisation/load problem after chatting to the dcache experts (which makes me groan after the the fun and games of configuring gPlazma to work for us). At least we know which tree we need to bark up, hopefully the next week will see the death of this problem and the green 'ok's will reign supreme for us.

Pretty graphs via MonAMI


Paul in Glasgow has written ganglia scripts to display a rich set of graphs using torque info collected by MonAMI. We've added these to the Lancaster ganglia page. Installation was straight forward although we had some problems due to the ancient version of rrdtools installed on this SL3 box. Paul made some quick patches and things are now compatible with this older version and results are not too different compared to ScotGrid's SL4 example. Useful stuff!

Wednesday 11 July 2007

Site BDII glite 3.0.2 Update 27

Updating the glite-BDII meta rpm doesn't update the rpms to the required version. I opened a ticket.

http://gus.fzk.de/pages/ticket_details.php?ticket=24586


Monday 9 July 2007

glite CE passing arguments to Torque

Started testing glite CE passing arguments to Torque server. Installed a gCE with two WN. Should be inserted in a BDII at IC. The script that does the translation from Glue schema attributes to submission parameters is not there. I have the LSF equivalent. Script /opt/glite/bin/pbs_submit.sh works standalone so I could start to look at it on my laptop where I have a 1 WN system.

The problem is not passing hte argument but the number of ways a sys admin can setup a configuration to do the same. For the memory parameters it is not a very big problem but for the rest of the possible configurations it is. A discussion on a standard configuration is required. Snooping around, most of the sites that allows connection have YAIM standard configuration, but possibly that's because quite few things break if they don't.

Sheffield week 28

Now begins the trying to unpick what wrong here at Sheffield.
So far main problem seams to be with the CE

On the plus the DPM patching went well

sheffield week 27

Arrrr it all gone wrong things very broke and not enough time to fix them.

Power outages