Northgrid-tech: November 2007

Monday 19 November 2007

Manchester black holes for atlas

Atlas job failing because of the following errors:
====================================================
All jobs fail because 2 bad nodes fail like
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
submit-helper script running on host bohr1428 gave error: could not add entry in the local gass cache for stdout
===================================================
Problem caused by

${GLOBUS_LOCATION}/libexec/globus-script-initializer

being empty.

Tuesday 13 November 2007

Some SAM tests don't respect downtime

Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=28983

Sunday 11 November 2007

Manchester sw repository reorganised

To simplify the maintainance of multiple releases and architectures I reorganised the software (yum) repository in Manchester.

While before we had to maintain a yum.conf for each release and architecture now we just need to add links to the right place. I wrote a recipe on my favourite site:

http://www.sysadmin.hep.ac.uk/wiki/Yum

This will allow to remove also the complications introduced in cfengine conf files to maintain multiple yum.conf versions.

Friday 9 November 2007

1,000,000th job has passed by

This week the batch system ticked over it's millionth job. The lucky user was biomed005, and no, it was nothing to do with rsa768. The 0th job was way back in August 2005 when we replaced the old batch system with torque. How many of these million were successful? I shudder to think but I'm sure it's improving :-)

In other news, we're having big problems with our campus firewall, it blocks outgoing port 80 and 443 to ensure that traffic passes through the university proxy server. Unfortunately some web clients such as wget and curl make it impossible to use the proxy for these ports whilst bypassing the proxy for all other ports. Atlas need this with the new PANDA pilot job framework. We installed a squid proxy of our own (good idea Graeme) which allows for greater control. No luck with handling https traffic so we really need to get a hole punched in the campus firewall. I'm confident the uni systems guys will oblige ;-)

Sheffield in downtime

Sheffield has been put in downtime until Monday 12/11/2007 at 5 pm.

Reason: Power cut affecting much of central sheffield. Substation exploded. Not even allowed inside the physics building.

Matt is also back in the GOCDB now as site admin.

Saturday 3 November 2007

Manchester CEs and RGMA problems

Still don't know what happened to ce02 and why ops test didn't work and my jobs hanged forever while anybody else could run (atlas claims 88 percent efficiency in the last 24 hours). Anyway I updated ce02 manually (rpm -ihv) to the same set of rpms that are on ce01 and now the problem I had, globus hanging, has disappeared . The ops tests are successful again and we got out of the atlas blacklisting. I fixed also ce01 that yesterday picked up the wrong java version. I need to change a couple of things on the kickstart server so that these incidents don't happen.

Also I had to manually kill tomcat which was not responding and restart it on the MON box. Accounting published successfully after this.

Friday 2 November 2007

Sheffield accounting

From Matt:

/opt/glite/bin/apel-pbs-log-parser
is trying o contact the ce on 2170, I think expecting the site bdii to be there.
I changed ce_node</GII> to mon_node in /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml
and now thing seem much improved.

However, I am getting this

Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Record/s found: 8539
Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Checking Archiver is Online
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Unable to retrieve any response while querying the GOC
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Archiver Not Responding: Please inform apel-support@listserv.cclrc.ac.uk
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - WARNING - Received a 'null' result set while querying the 'LcgRecords' table using rgma, this probably means the GOC is currently off-line, will therefore cancel attempt to re-publish

running /opt/glite/bin/apel-publisher on the mon box.

I the goc machine is really off-line, I'll have to wait to publish the missing data for sheffield.

Manchester SL4

Not big news for other sites but I have installed an SL4 UI in manchester. Still 32bit cause the UIs at the Tier2 are old machines. However I'd like to express my relief that once the missing 3rd parties rpms were in place the installation
went smoothly.

After some struggling with cfengine keys which I was dispairing tosolve by the end of the evening I managed to install also a WN 32bit. At least cfengine doesn't give any more errors and runs happily.

Tackling dcache now and the new yaim structure.

Thursday 1 November 2007

Sheffield

Quiet night for sheffield after reimaged nodes where taken offline in PBS. Matt also increased the number of ssh connections allowed on the CE from 10 to 100 to reduce the time outs between the WN and CE and reduce the incidence of Maradona errors.