Thursday 20 December 2007

Lancaster's Winter Dcache Dramas

It's been a tough couple of months for Lancaster, with our SE giving us a number of problems.

Our first drama, at the start of the month, was caused by unforseen complications with our upgrade to dcache 1.8. Knowing that we were low on the support list due to being only a Tier 2, but feeling emboldened by the highly useful srm 2.2 workshop in Edinburgh and the good few years we've spent in the dcache trenches we decided to take the plunge. And faced a good few days of downtime beyond the one we had scheduled as we faced down a number of bugs with the early versions of dcache 1.8 (fixed by upgrading to higher patch levels), then faced problems due to changes in the gridftp protocol highlighted inconsisencies with the users on our pnfs node and gridftp door node. Due to a hack long ago several VOs had different user.conf entries and therefore UIDs on our door nodes and pnfs node. This never caused problems before, but after the upgrade the doors were passing the uids to the pnfs node so new files and directories were created with the correct group (as the gids were consistent) but a wrong uid, causing permission troubles whenever a delete was called. This was a classic case of a problem that was hell to the cause behind but once figured out was thankfully easy to solve. Once we fixed that one it was green tests for a while.

Then dcache drama number two came along a week later- a massive postgres db failure on our pnfs node. The postgres database contains all the information that dcache uses to match the fairly anonymously named files on the poolnodes to entries in the pnfs namespace- without it dcache has no idea which files are which, so with it bust the files are almost as good as lost. Which is why it should be backed up regularly. We did this twice daily, as least we thought we did- a cron problem had meant that our backups hadn't been made for a while and a rollback to it would mean a fair amount of data might be lost. So we spent 3 days doing arcane sql rituals to try and bring back the database, but it had too heavily corrupted itself and we had to rollback.

The cause of the database crash and corruption was a "wrap around" error. Postgres requires regular "vacuuming" to clean up after itself, otherwise it essentially starts writing over itself. This crash took us by surprise, as we not only have postgres looking after itself with optimised auto-vacuuming occuring regularly, but during the 1.8 upgrade I took the time to do a manual full vacuum, which was only a week before this one. Also postgres is designed to freeze in the event of being at risk of a wraparound error rather then overwrite itself, and this didn't happen. The first we heard of it pnfs and postgres had stopped responding and there were wraparound error messages in the logs, no warning of the impending disaster.

Luckily the data rollback seems to have not affected the VOs too much. We had one ticket from Atlas, who after we explained our situation to them handily cleaned up their file catalogues. The guys over at dcache hinted at a possible way of rebuilding the lost databases from the pnfs logs, although sadly this isn't simply a case of recreating pnfs related sql entries and they've been too busy with Tier 1 support to look into this further.

Since then we've fixed our backups and applied a nagios test to ensure the backups are less then a day old-the biggest trouble here was that the reluctance to use an old backup meant we wasted over 3 days banging our heads trying to bring back a dead database rather then a few hours it would take to restore from backup and verify things were working. And it appears the experiments were more affected by us being in downtime then by the loss of easily replicatable data. In the end I think I caused more trouble going over the top on my data recovery attempts then if I had been gung ho and used the old backup once things looked a bit bleak for the remains of the postgres database. At least we've now set things up so the likeliness of it happening again is slim, but the circumstances behind the original database errors are still unknown, which leaves me a little worried.

Have a good Winter Festival and Holday Season everyone- but before you head off to your warm fires and cold beers check the age of your backups just in case...

Thursday 6 December 2007

Manchester various

- core path was set to /tmp/core-various-param in sysctl.conf and was creating a lot of problems to dzero jobs. It was also creating problems to others as they were filling /tmp and consequently maradona errors were looming around. The path has been changed back to the default and also I set core size 0 in limits.conf to prevent any other problem repeating itself with a lesser degree in /scratch.

- dcache doors were open on the wrong nodes. node_config is the correct one but it was copied before stopping dcache-core service and now /etc/init.d/dcache-core stop doesn't have any effect. The doors have also a keep alive script so it is not enough to kill the java proesses one has to kill also the parents.

- cfengine config files are being rewritten to make them less criptic.

Monday 19 November 2007

Manchester black holes for atlas

Atlas job failing because of the following errors:
====================================================
All jobs fail because 2 bad nodes fail like
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
submit-helper script running on host bohr1428 gave error: could not add entry in the local gass cache for stdout
===================================================
Problem caused by

${GLOBUS_LOCATION}/libexec/globus-script-initializer

being empty.

Tuesday 13 November 2007

Some SAM tests don't respect downtime

Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=28983

Sunday 11 November 2007

Manchester sw repository reorganised

To simplify the maintainance of multiple releases and architectures I reorganised the software (yum) repository in Manchester.

While before we had to maintain a yum.conf for each release and architecture now we just need to add links to the right place. I wrote a recipe on my favourite site:

http://www.sysadmin.hep.ac.uk/wiki/Yum

This will allow to remove also the complications introduced in cfengine conf files to maintain multiple yum.conf versions.

Friday 9 November 2007

1,000,000th job has passed by

This week the batch system ticked over it's millionth job. The lucky user was biomed005, and no, it was nothing to do with rsa768. The 0th job was way back in August 2005 when we replaced the old batch system with torque. How many of these million were successful? I shudder to think but I'm sure it's improving :-)

In other news, we're having big problems with our campus firewall, it blocks outgoing port 80 and 443 to ensure that traffic passes through the university proxy server. Unfortunately some web clients such as wget and curl make it impossible to use the proxy for these ports whilst bypassing the proxy for all other ports. Atlas need this with the new PANDA pilot job framework. We installed a squid proxy of our own (good idea Graeme) which allows for greater control. No luck with handling https traffic so we really need to get a hole punched in the campus firewall. I'm confident the uni systems guys will oblige ;-)

Sheffield in downtime

Sheffield has been put in downtime until Monday 12/11/2007 at 5 pm.


Reason: Power cut affecting much of central sheffield. Substation exploded. Not even allowed inside the physics building.

Matt is also back in the GOCDB now as site admin.

Saturday 3 November 2007

Manchester CEs and RGMA problems

Still don't know what happened to ce02 and why ops test didn't work and my jobs hanged forever while anybody else could run (atlas claims 88 percent efficiency in the last 24 hours). Anyway I updated ce02 manually (rpm -ihv) to the same set of rpms that are on ce01 and now the problem I had, globus hanging, has disappeared . The ops tests are successful again and we got out of the atlas blacklisting. I fixed also ce01 that yesterday picked up the wrong java version. I need to change a couple of things on the kickstart server so that these incidents don't happen.

Also I had to manually kill tomcat which was not responding and restart it on the MON box. Accounting published successfully after this.

Friday 2 November 2007

Sheffield accounting

From Matt:

/opt/glite/bin/apel-pbs-log-parser
is trying o contact the ce on 2170, I think expecting the site bdii to be there.
I changed ce_node</GII> to mon_node in /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml
and now thing seem much improved.

However, I am getting this

Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Record/s found: 8539
Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Checking Archiver is Online
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Unable to retrieve any response while querying the GOC
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Archiver Not Responding: Please inform apel-support@listserv.cclrc.ac.uk
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - WARNING - Received a 'null' result set while querying the 'LcgRecords' table using rgma, this probably means the GOC is currently off-line, will therefore cancel attempt to re-publish

running /opt/glite/bin/apel-publisher on the mon box.

I the goc machine is really off-line, I'll have to wait to publish the missing data for sheffield.

Manchester SL4

Not big news for other sites but I have installed an SL4 UI in manchester. Still 32bit cause the UIs at the Tier2 are old machines. However I'd like to express my relief that once the missing 3rd parties rpms were in place the installation
went smoothly.

After some struggling with cfengine keys which I was dispairing tosolve by the end of the evening I managed to install also a WN 32bit. At least cfengine doesn't give any more errors and runs happily.

Tackling dcache now and the new yaim structure.

Thursday 1 November 2007

Sheffield

Quiet night for sheffield after reimaged nodes where taken offline in PBS. Matt also increased the number of ssh connections allowed on the CE from 10 to 100 to reduce the time outs between the WN and CE and reduce the incidence of Maradona errors.

Wednesday 31 October 2007

Manchester hat tricks

Manchester CE ce02 has been blacklisted by atlas since yesterday because it fails
the ops tests and therefore it is also failing the Steve lloyds tests and has avaialbility 0. However there is no apparent reason why these tests should fail. Besides ce02 is doing some magic: there were 576 jobs running from 5 different VOs when I started writing this among which atlas production jobs, and now there 12 hours later there are 1128. I'm baffled.

Tuesday 30 October 2007

Regional VOs

vo.northgrid.ac.uk
vo.southgrid.ac.uk

both created no users yet in them. Need to enable them at sites probably to get more progress.

user certificates: p12 to pem

since I was renewing my certificate I added a small script (p12topem.sh) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:

https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance

It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.

Monday 29 October 2007

Links to monitoring pages update

I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.

http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages


I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.

Friday 19 October 2007

Sheffield latest

Trying to stabilize Sheffield cluster.
After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch

https://savannah.cern.ch/bugs/?16625


which is getting into the release after 1.5 years Hurray!!!

The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.

Thursday 18 October 2007

Manchester availability

SAM tests both ops and atlas, were failing due to dcache problems. Part of it was due to the fact that Judit has changed her DN and somehow the cron job to build the dcache kpwd file wasn't working. In addition to that dcahce02 had to be restarted (both core and pnfs) as usual it started to work again after that without any apparent reason of why it failed in the first place. gPlazma is not enabled yet.

Mostly that's the reason of the drop in october.

Monday 15 October 2007

Availability update

Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems. The record from July-September has been updated on Jeremy's page. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.

  • June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
  • mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
  • mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.
Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.

Friday 12 October 2007

Sys Admin Requests wiki pages

YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.

http://www.sysadmin.hep.ac.uk/wiki/Wishlist

Tuesday 9 October 2007

BDII doc page

After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.

http://www.sysadmin.hep.ac.uk/wiki/BDII

Monday 8 October 2007

Manchester RGMA fixed

Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:

http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA

EGEE '07

EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.

Talk can be found at

http://indico.cern.ch/materialDisplay.py?contribId=30&sessionId=49&materialId=slides&confId=18714


and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.

http://indico.cern.ch/contributionDisplay.py?contribId=25&confId=12807

which had some follow up with SA3 that can be found here

https://savannah.cern.ch/task/?5267

Its alive !

Sheffield problems reviewed

During the last update to the DPM I ran in to several problem.

1. DPM update failed due to changes in the way the password are
stored in mysql
2. A miss understandind with the new version of yaim that rolled out
at the same time
3. Config errors with the sBDii
4. mds-vo-name
5. Too many roll outs in one go for me to have a clue which broke and where to start
looking.

DPM update fails
I would like to thank Graeme for the great update instructions, they
helped lots. The problems came when the update script used a
different hashing method to the one used by mysql problem found here
http://<>. This took some finding, it also means every
time we run yaim config on the SE we have to go back and fixs the
passwords again, this is because yain still uses the old hash not the
new one.

Yaim update half way and congig errors
This confused the hell out of me one minute I'm using yaim scripts to
run updates. Next I have an updated version of yaim that I had to
pass flags to and is where I guess I started to make the mistakes that
lead to me setting the SE as a sBDii. After getting lost with the new
yain I told the wrong machine that it was a sBDii and never relised.

mds-vo-name
With the help of Henry, we found out that our information was wrong ie we had

mds-vo-name=local it is now mds-vo-name=resource

Once this was changed in the site-info.def and yaim was re ran on our mon box which is
also out sBDii it alll seamed to work.

Tuesday 25 September 2007

Sheffield

Hi all

Sorry it been so quiet on the Sheffield front. I've been out of the country, and it currently registration here.

Whats the state of the LCG here. I feel like I'm chasing my tail hence there will shortly be a back of email asking for help from TB Support. I have notice there have been several updates while I've been away so I will add them before getting back to the main problems of why our SE seam to not have an entry in the BDii

Friday 14 September 2007

Latest on WN SL4/64 upgrade

I've created a gridpp wiki page which lists the cfengine config we're using to satisfy various VO requirements. Things have changed recently with Atlas no longer requiring a 32 bit version of python to be installed, it's now included in the KITS release. We still have build problems with release 12.0.6 as used by Steve's tests so would be interested to see how others get on with that. The Atlas experts advise a move to the 13.0.X branch. Atlas production looks healthy again with plenty of queued jobs for the weekend so hopefully smooth sailing from now on.

Advice to Atlas sites upgrading to SL4/64:
  • Expect failures when building code with release 12.0.6

Sunday 26 August 2007

Some new links about security

This article is an interesting example of how even someone with very little experience can still do some basic forensic.

http://blog.gnist.org/article.php?story=HollidayCracking

I added the link under the forensic section on the sys admin wiki

http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic

Since I was at it I added a firewall section to

http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports

Dcache Troubleshooting page

My tests on dcache started to fail for obscure reasons due to gsidcap doors misbehaving. I started a trouble shooting page for dcache

http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting

Thursday 23 August 2007

Another biomed user banned

Manchester has banned another biomed user for filling /tmp and causing trouble for other users. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=26147

Wednesday 22 August 2007

WNs installed with SL4

Last Wednesday was a scheduled downtime for Lancaster in order to do the SL4 upgrade on the WNs, as well as some assorted spring cleaning of other services. We can safely say this was our worst upgrade experience so far, a single day turned into a three-day downtime. Fortunately for most, this was self-inflicted pain rather than middleware issues. The fabric stuff (PXE, kickstart) went fine, our main problem was getting consistent users.conf and groups.conf files for the YAIM configuration, especially with the pool sgm/prd accounts and the dns-style VO names such as supernemo.vo.eu-egee.org. The latest YAIM 3.1 documentation provides a consistent description but our CE still used the 3.0 version so a few tweaks were needed (YAIM 3.1 has since been released for glite 3.0). Another issue was due to our wise old CE (lcg-CE) having a lot of crust from previous installations, in particular some environment variables which affected the YAIM configuration such that the newer vo.d/files were not considered. Finally, we needed to ensure the new sgm/prd pool groups were added to torque ACLs but YAIM does a fine job with this should you choose to use it along with the _GROUP_ENABLE variables.

Anyway, things look good again with many biomed jobs, some atlas, dzero, hone and even a moderate number of lhcb jobs which supposed to have issues with SL4.


On the whole, the YAIM configuration went well although the VO information at CIC could still be improved with mapping requirements from VOMS groups to GID. LHCb provide a good example to other VOs, with explanations.

Monday 20 August 2007

Week 33 start of 34

Main goings on have been the DMP update this was hampered by password version problems in MySQL, was resolved with help from here http://www.digitalpeer.com/id/mysql . More problem came with the change to BDii new firewall ports open and yet there was still no data coming out.

The BDii was going to be fixed today, however Sheffield has suffered several power cut over the last 24 hours. This has affected the hole of the LCG here, recovery work is ongoing.

Tuesday 14 August 2007

GOCDB3 permission denied

I can't edit NorthGrid sites anymore. I opened a ticket.

https://gus.fzk.de/pages/ticket_details.php?ticket=25846

I would be mildly curious to know if other people are experiencing the same or if I'm the only one.

Monday 13 August 2007

Manchester MON box overloaded again with a massive amount of CLOSE_WAIT connections.

https://gus.fzk.de/pages/ticket_details.php?ticket=25647

the problem seems to have been fixed but it affected the accounting for 2 or 3 days.

lcg_utils bug closed

Ticket about lcg_util bugs has been answered and closed

https://gus.fzk.de/pages/ticket_details.php?ticket=25406&from=allt

Correct version of rpms to install is

[aforti@niels003 aforti]$ rpm -qa GFAL-client lcg_util
lcg_util-1.5.1-1
GFAL-client-1.9.0-2

Update (2007/08/17): The problem was incorrect dependencies expressed in the meta rpms. Maarten opened a savannah bug.

https://savannah.cern.ch/bugs/?28738



Friday 10 August 2007

Updating Glue schema

This is sort of old news as the request of updating the BDII is one month old.

To update the Glue schema you need to update the BDII on the BDII machine and on the CE and SE (dcache and classic). DPM SE uses BDII instead of globus-mds now so you should check the recipe for that.

The first problem I found was that

yum update glite-BDII

doesn't update the dependencies but only the meta-rpm. Apparently it works for apt-get but not for yum. So if you use yum you have 3 alternatives

1) yum -y update and risk to screw your machine
2) yum update and check each rpm
3) Look the list of rpms here

http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-BDII/3.0.2-12/glite-BDII-3.0.2-12.html

yum update

Reconfiguring the BDII doesn't pose a threat so you can

cd
./scripts/configure_node BDII_site

On the CE and SE... you can upgrade the CE and SE and reconfigure the nodes. But I didn't want to do that because you never know what might happen and with the farm full of jobs and the SE being dcache I don't see the point to risk it for a schema upgrade. So what follows is a simple recipe to upgrade the glue schema on CE and SE other than DPM without reconfiguing the nodes.

service globus-mds stop
yum update glue-schema
cd /opt/glue/schema
ln -s openldap-2.0 ldap
service globus-mds start

To check that it worked:

ps -afx -o etime,args | grep slapd

if your BDII is not on the CE and you find slapd instances on ports 2171-2173 it means you are running site BDIIs also on your CE and you should turn it off and remove it from the startup services.

The ldap link is needed because the schema path has changed and unless you want to edit the configuration file (/opt/globus/etc/grid-info-slapd.conf) the easiest thing is to add a link.

Most of this is in this ticket

https://gus.fzk.de/pages/ticket_details.php?ticket=24586&from=allt

including where to find the new schema documentation.

Thursday 9 August 2007

Documentation for Manchester local users

Yesterday at a meeting with Manchester users who tried to use the grid it turned out that what they missed most is a page to collect the links of information sparse around the world (a common disease). As a consequence we have started pages to collect information useful to local users to use the grid.

https://www.gridpp.ac.uk/wiki/Manchester


Current links are of general usefulness. Users will add their own personal tips and tricks later.

Wednesday 8 August 2007

How to check accounting is working properly

Obviously when you look at the accounting pages at the bottom there is a graph showing running VOs, but that is not straightforward. Other two ways are

The accounting enforcement page showing sites that are not publishing and for how many days they haven't published.

http://www3.egee.cesga.es/acctenfor

which I linked from

https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting

or you could setup RSS feeds as suggested in the Apel FAQ.

I also created an Apel page with this information on the sysadmin wiki

http://www.sysadmin.hep.ac.uk/wiki/Apel

Monday 6 August 2007

Progress on SL4

As part of our planned upgrade to SL4 at Manchester, we've been looking at getting dcache running.
The biggest stumbling block is a lack of glite-SE_dcache* profile, luckily it seems that all of the needed components apart from dcache-server are in the glite-WN profile. Even the GSIFtp Door appears to work.

Friday 3 August 2007

Green fields of Lancaster

After sending the dcache problem the way of the dodo last week we've been enjoying 100% SAM test passes over the past 7 days. It's nice to have to do next to nothing to fill in your weekly report. Not a very exciting week otherwise, odd jobs and maintenance here and there. Our CE has been very busy the last week, which has caused occasional problems with the Steve Lloyd tests-we've had a few failures due to there being no job slots available, despite measures to prevent that. We'll see if we can improve things.

We're gearing up for the SL4 move- after Monday's very useful Northgrid meeting at Sheffield we have a time frame for it-sometime during the week starting the 13th of August. We'll pin it down to an exact day at the start of the coming week. We've took a worker offline as a guinea pig and will do hideous SL4 experimentations to it. The whole site will be in downtime for 9-5 on the day we do the move, with luck we won't need that long but we intend to use the time to upgrade the whole site (no SL3 kernels will be left within our domain). Lucky for us Manchester have offered to go first in Northgrid, so we'll have veterans of the SL4 upgrade nearby to call on for assistance.

Thursday 2 August 2007

lcg-utils bugs

https://gus.fzk.de/pages/ticket_details.php?ticket=25406

Laptop reinstalled

EVO didn't work on my laptop. I reinstalled it with latest version of ubuntu and java 1.6.0. It works now. With my great disappointment facebook aquarium still doesn't ;-)

Fixed Manchester accouting

https://www.ggus.org/pages/ticket_details.php?ticket=25215

Glue Schema 2.0 use cases

Sent two broadcasts to collect Glue Schema use cases for the new 2.0 version. Received only two replies.

https://savannah.cern.ch/task/index.php?5229

How to kill a hanging job?

There is a policy being discussed about this. See:

https://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.doc

written by Graeme and Matt.

Part of the problem is that the user doesn't see any difference between a job that died and one that was killed by a system administrator. One of the request is to get the job wrapper catching the signal the standard tools send so that an appropriate message can be returned and possibly also some cleanup be done. This last part is being discussed at the TCG.

https://savannah.cern.ch/task/index.php?5221

SE downtime

Tried to publish GlueSEStatus to fix the SE downtime problem

https://savannah.cern.ch/task/?5222

Connected to this is ggus ticket

https://www.ggus.org/pages/ticket_details.php?ticket=24586


which was originally opened to get a recipe for sites to upgrade the BDII in a painless way.

VO deployment

Sent comments for final report of VO deployment WG to Frederic Schaer.

I wrote a report about this over a year ago:

https://mmm.cern.ch/public/archive-list/p/project-eu-egee-tcg/Why%20it%20is%20a%20problem%20adding%20VOs.-770951728.EML?Cmd=open

The comments in my email to the TCG are still valid.

I think the time to find the information for a vo is too short. It takes more than 30 mins and normally people ask other sys admins. I found the cic portal tool inadequate up to now. It would be better if the VOs maintained themselves a yaim snapshot on the cic portal that can be downloaded rather than inventing a tool. In UK that's the way we have chosen at the end to avoid this problem.

http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

This is maintained by sysadmins and it is only site-info.def. group.conf is not maintained by anyone but it should at the moment sysadmins are simply replicating the default, when a VO like LHCB or dzero deviates from that there is trouble.

2) YAIM didn't use to have a VO creation/deletion function for each service that can be run. It reconfigure the whole service that makes the
sys admins wary of adding a VO in production in case something went wrong in other parts. From you report this seems to be still the case.

Dashboard updated

Dashboard updated with new security advisories link
https://www.gridpp.ac.uk/wiki/Northgrid-Dashboard#Security_Advisories

Sheffield Jully looking back

July had to point of outage the worst being at the start of the month and just after a gLite 3.0 upgrade it did take a bit of time to find the problem and the solution.

Error message: /opt/glue/schema/ldap/Glue-CORE.schema: No such file or directory
ldap_bind: Can't contact LDAP server
Solution was found here: http://wiki.grid.cyfronet.pl/Pre-production/CYF-PPS-gLite3.0.2-UPDATE33

At the end of the month we has a strange error that was spotted quickly and turn out to be the result of a DNS server crash on the LGC here at Sheffield not resolving the worker nodes IPs

Sheffield hosted the monthly North grid meeting and all in all it was a good event.

Yesterday the LCG got it's own dedicated 1gig link to YHman and beyond we also now have our own firewall which will make changes quicker and easier.

Fun at Manchester SL4, lcg_util and pbs

In the midst of getting a successful upgrade-to-SL4 profile working, we upgraded our release of lcg_util from 1.3.7-5 to 1.5.1-1 this prooved to be unsuccessful, SAM test failures galore. After looking around for a solution on the internet I settled for rolling back to the previous version, thanks to the wonders of cfengine this didn't take long, and happily cfengine should be forcing that version on all nodes.

This Morning I came in to find we were again failing the SAM tests, this time the ever-so-helpful
"Cannot plan: BrokerHelper: no compatible resources"


Pointing to a problem deep in the depths of the batch system. Looking at our queues (via showq), there were a lot of Idle jobs yet more than enough CPUs. The PBS logs revealed a new error message,
Cannot execute at specified host because of checkpoint or
stagein files
for two of the jobs, eventually I managed to track it down to a node. Seeing as there wasn't any sign of the job file anymore, and pbs was refusing to re-run the job on another node, I had to resort to the trusty `qdel`, after thinking about it for the barest of moments, all of the Idle jobs woke up and started running.

Just for some gratuitous cross-linking, Steve Traylen appears to have provided a solution Over at the Scotgrid blog.

Friday 27 July 2007

It's always a good feeling putting a timely end to a long running problem. We've been plagued by the pnfs permission denied error for a little over a month- but hopefully (touch wood) we won't be seeing it again.

So what was the problem? It appears to have been a gridftp door nfs mount asynchronisation problem. Or something like that. Essentially the nfs mounts that the gridftp door reads occasionally failed to update quickly enough, so tests like the SAM tests that copy in a file then immediately try to access it occasionally barfed- it checked to see if the file was there but as it hadn't been updated it didn't. A subsequent test might occur from the door after it's synced, or of an already synced door, and thus pass. I only tracked this down after trying to recreate my own transfer tests and finding I only failed copying out "fresh" files. Mentioning this during the UKI meeting Chris Brew pointed me towards a problem with my PNFS mounts and Greig helped point me towards what they should be (thanks dudes!). I found that my mounts where missing the "noac" (no attribute caching) option-this option is what Grieg and Chris have on their mounts and recommended for multiple clients accessing one server. So a quick remount of all my doors and things seem miraculously all better- my homemade tests all worked and Lancaster's SAM tests are a field of calming green. Thanks to everyone for their help and ideas.

In other news, we're gearing up for the SL4 move- ETA for that about a fortnight. We plan to lay the ground work then try to reinstall all the WNs in one day of downtime, we'll have to see if this is a feasible plan.

Have a good weekend all, I know I will!

Friday 20 July 2007

Replica manager failure solution? Permission denied

We're still facing down the same dcache problem. Consultation with the experts has directed us to the fact that the problem isn't (directly) a gPlazma one-the request are being assigned to the wrong user (or no user) and suffering permission problems. Even if immediately preceding and proceding access requests of a similar nature with the same proxy succeed without a glitch. The almost complete lack of a sign of this problem in the logs is increasing the frustration. I've sent the developers every config file they've asked for and some they didn't, upped the logging levels and scoured logs till my eyes were almost bleeding. And I haven't a clue how to fix this thing. There's no real pattern-other then the fact that failures seem to happen more often during the working day (but then it's a small statistical sample)--- there are no corresponding load spikes. We have a ticket open against us about this and the word quarantine got mentioned-which is never a good thing. It sometimes feels like we have a process running in our srm that rolls a dice every now and again and if it comes up as a 1 we fail our saving throw vs. SAM test failure. If we could just find a pattern or cause then we'd be in such a better position. All we can do is keep scouring and maybe the odd tweak and see if something presents itself.

Friday 13 July 2007

Torque scripts

Added two scripts to the sysadmin repository:

https://www.sysadmin.hep.ac.uk/wiki/Nodes_Down_in_Torque

https://www.sysadmin.hep.ac.uk/wiki/Counting_Jobs

Happy Friday 13th

Business at usual for Lancaster, with CA updates and procuring new hardware on our agenda this week. We're also being plagued by a(nother) wierd intermittant dcache problem that's sullying what would be an otherwise flawless SAM test run. We've managed to track it down to some kind of authorisation/load problem after chatting to the dcache experts (which makes me groan after the the fun and games of configuring gPlazma to work for us). At least we know which tree we need to bark up, hopefully the next week will see the death of this problem and the green 'ok's will reign supreme for us.

Pretty graphs via MonAMI


Paul in Glasgow has written ganglia scripts to display a rich set of graphs using torque info collected by MonAMI. We've added these to the Lancaster ganglia page. Installation was straight forward although we had some problems due to the ancient version of rrdtools installed on this SL3 box. Paul made some quick patches and things are now compatible with this older version and results are not too different compared to ScotGrid's SL4 example. Useful stuff!

Wednesday 11 July 2007

Site BDII glite 3.0.2 Update 27

Updating the glite-BDII meta rpm doesn't update the rpms to the required version. I opened a ticket.

http://gus.fzk.de/pages/ticket_details.php?ticket=24586


Monday 9 July 2007

glite CE passing arguments to Torque

Started testing glite CE passing arguments to Torque server. Installed a gCE with two WN. Should be inserted in a BDII at IC. The script that does the translation from Glue schema attributes to submission parameters is not there. I have the LSF equivalent. Script /opt/glite/bin/pbs_submit.sh works standalone so I could start to look at it on my laptop where I have a 1 WN system.

The problem is not passing hte argument but the number of ways a sys admin can setup a configuration to do the same. For the memory parameters it is not a very big problem but for the rest of the possible configurations it is. A discussion on a standard configuration is required. Snooping around, most of the sites that allows connection have YAIM standard configuration, but possibly that's because quite few things break if they don't.

Sheffield week 28

Now begins the trying to unpick what wrong here at Sheffield.
So far main problem seams to be with the CE

On the plus the DPM patching went well

sheffield week 27

Arrrr it all gone wrong things very broke and not enough time to fix them.

Power outages

Friday 29 June 2007

Liverpool Weekly Update

Most of the work at Liverpool this week was the usual networking and ongoing hardware repairs. We should have some more nodes becoming available (approximately 50) over the next few weeks thanks in particular to the work Dave Muskett has been doing.

We've started to look at configuration deployment systems, i.e. cfengine (Colin's talk at HEPSysMan on this was helpful), Puppet and Quattor. We're presently evaluating Puppet on some of our non-LCG systems, and we look forward to discussing this subject at the next technical meeting.

And as mentioned last week, Paul Trepka is in the process of adding support for additional VOs. A couple of errors were encountered in this process yesterday resulting in failed SAM tests overnight, but these have (hopefully!) been rectified now.

Lancaster Weekly Update

This week has not been fun, having been dominated by a dcache configuration problem that's caused us to fail the week's worth of SAM tests. The gPlazma plugin- the voms module for dcache, had started complaining about not being able to determine which username to map an access request to from a given certificate proxy. This problem was made worse by dcache then not falling back to use the "old fashioned" kpwd file method of doing the mapping. So users were getting "access denied" type messages. Well, all users except some, who had no problem at all, these privilaged few included my good self so diagnosing this problem involved a lot of asking Brian if it was still broke.

After some masterful efforts from Greig Cowan and Owen Synge we finally got things back up and running again. Eventually we fixed things by:

Upgrading to Owen's latest version of glite-yaim (3.0.2.13-3), and installing his config script dcacheVoms2GPlasma. After some bashing and site-info tweaking this got the gPlazma config files looking a bit more usable.

Fiddling with the permissions so that the directories in pnfs were group writable (as now some users were being mapped to the sgm/prd vo accounts).

Upgrading to dcache-server-1.7.0-38.

Between all these steps we seem to have things working. We're still unsure why things broke in the first place, why gPlazma wouldn't fall back to the kpwd way of doing things or why it still worked for some and not for others. I'd like to try and get to the bottom of these things before I draw a line under this problem.

Thursday 28 June 2007

Sheffield week26

Wet wet and did I mention it rained here.

Not much to report to do with the cluster it is still up and running although we started failing lloyds tests yesterday afternoon. I will look into this when I get time

The University's power is in a state of "At Risk" until midday Friday. As a result Sheffield might go off line with out warning.

Monday 25 June 2007

Manchester weekly update

This week, after passing the Dell Diagnostic Engineer course, I've been diagnosing Dell hardware issues, and getting Dell to provide component replacements, or send an engineer. Finally they aren't treating us like a home user. I've also been sorting out issues between a recently installed SL4 node, kickstart leaving partitions intact, and cfengine.

Colin has been working on a new nagios plugin (and no doubt other things).

Friday 22 June 2007

Liverpool Weekly Update

This week's work at Liverpool was mostly a continuation of last week's - more networking as we bring the new firewall/router into operation, and more Nagios and Ganglia tweaking as we add more systems and services to the monitoring.

Plans to add a second 1Gbps link from our cluster room to Computing Services to create a combined 2Gbps link, along with an additional third 1Gbps link to a different building for resilience, have taken a step forward. A detailed proposal for for this has now been agreed with Computing Services and funding approved.

Alessandra made a useful visit yesterday, providing help with adding additional VOs (which is being done today, all going well) and investigating problems with ATLAS software installation amongst other things.

Monday 18 June 2007

Flatline!


Last week was moderately annoying for the Lancaster CE, with hundreds of jobs immediately failing on WNs due to skewed clocks. The ntpd service was running correctly so we were in the dark about the cause. After trying to re-sync manually with ntpdate it was apparent something was wrong with the university ntp server, it only responded to a fraction of requests. Turned out to be problem with the server "ntp.lancs.ac.uk" which is an alias for these machines:
ntp.lancs.ac.uk has address 148.88.0.11
ntp.lancs.ac.uk has address 148.88.0.8
ntp.lancs.ac.uk has address 148.88.0.9
ntp.lancs.ac.uk has address 148.88.0.10

Only 148.88.0.11 is responding so I raised a ticket with ISS and look forward to a fix. In the meantime the server has been changed to 148.88.0.11 in the ntp.conf file managed by cfengine and it's rolled out without a problem.

Just to stick the boot in, an unrelated issue caused our job slots to be completely vacated over the weekend and we've started to fail Steve's Atlas tests. This is due to a bad disk on a node which went read-only. Need to find the exact failure mode in order to make yet another WN health check, this oneslipped past existing checks. :-( Currently at Michigan State Uni (Go Spartans!) for the DZero workshop and the crippled wireless net makes debugging painful.

Friday 15 June 2007

Lancaster Weekly Update -the Sequel

A bit of an unexciting week. A lot of intermittant short (one test) replica manager failures might point to a small stability issue for our SE- however SAM problems this week have prevented me from finding the details of each failure. The best I could do was if we failed a test poke our SRM to make sure that it was working. The trouble looking at the SAM records made this week's weekly report for Lancaster quite dull.

After last week's PNFS move and postgres mysteriously behaving after being restarted a few times on Tuesday our CPU load on the SRM admin node is now under control, which should greatly improve our performance and make timeouts caused by our end a thing of the past. Finger's crossed.

Another notable tweak this week was an increase of the number of pool partitions given to atlas-they now have exclusive access to 5/6 of our dcache. Our dcache is likely to grow in size in the near future as we reclaim a pool node that was being used for testing, which will increase our SRM by over 10TB, this 10TB will be split in the same way as the rest of the dcache.

My last job with the SRM (before we end up upgrading to the next dcache version whenever that comes) is to deal with a replica infestation. During a test of the replica manager quite a while ago now we ended up with a number of files replicated 3-4 times and for some reason all replicas were marked as being precious- preventing them being cleaned up via the usual mechanisms. Attempts to force the replica manager to clean up after itself have failed, even giving it weeks to do it's job yielded no results. It looks like we might need a VERY carefully written script to clean things up and remove the few TB of "dead space" we have at the moment.

Liverpool Weekly Report

Recent work at Liverpool has included:
  • Monitoring improvements - I've configured Nagios and John Bland is rolling out Ganglia, both of which have already proved very useful. We're also continuing to work on improving environmental monitoring here, particularly as relates to detecting failures in the water-cooling system.
  • Significant hardware maintenance, including replacing two failed Dell Powerconnect 5224 switches in a couple of the water-cooled racks with new HP Procurve 1800s - more difficult than it should be due to the water-cooling design - and numerous node repairs.
  • Network topology improvements, including installation of a new firewall/router.

Most of this week was spent trying to identify the reason why Steve Lloyd's ATLAS tests were mostly being aborted and why large numbers of ATLAS production jobs were failing here, mostly with the EXECG_GETOUT_EMPTYOUT error. I eventually identified the main problem as being with the existing ssh configuration on our batch cluser, where a number of host keys for worker nodes were missing from the CE. This (along with a couple of other issues) has now been fixed, and hopefully we'll see a large improvement in site efficiency as a result.

While investigating this, I also noticed a large number of defunct tar processes left over on multiple nodes by the atlasprd user, which had been there for up to 16 days. We're not sure what caused these processes to fail to exit, so any insights on that would be welcome.

Finally, Paul Trepka has been bringing up a new deployment system for the LCG racks - see him for details.

Sheffield week 24

I think I'm slowly getting my head round all this now {but don't test me ;)}

Technicaly there is not much new to report some down workers have had new disks put in them. Plans are bing made to upgrade the worker and to finish sorting out Andy's legacy.

Main problem the building where the machine room is housed is a no access building site and I have been warned about a power outage in July.

Wednesday 13 June 2007

Sheffield Update

I go away for a long weekend and we start failing SAM tests again. After a few email from Greig and some time waiting for the next tests we are now passing all the tests.

Our failings over the past few weeks seam to be down to one of 2 things cert upgrades not automatically working on all machines and me not knowing when and how to change the DN.

We have fixed the gridice information about disk sizes on the SE, as well as looking into adding more pools.

back to my day job

Friday 8 June 2007

Lancaster Weekly Update

A busy week for Lancaster on the SE front. We had the "PNFS move", where the PNFS services were moved from the admin node onto their own host. There were complications, mainly caused by the fact that several key details were missing from the recipe I found was missing 1 or 2 key details that I overlooked when preparing for it.

I am going to wikify my fun and games, but essentially my problems can be summed up as:

Make sure in the node_config both the admin and pnfs nodes are marked down as "custom" in their node type. Keeping the admin node as "admin" causes it to want to run PNFS as well.

In the pnfs exports directory make sure the srm node is in there, and that on the srm node the pnfs directory on the pnfs node is mounted (similar to how door nodes are mounted-although not quite the same-to be honest I'm not sure I have it right but it seems to work.

Start things in the right order- the pnfs server on the PNFS node, the dcache-core services on the admin node, then the PNFSmanager on the PNFS node. I found that on a restart of the admin node services I had to restart the PNFSmanager. I'm not sure how I can fix this to enable automatic startups of our dcache services in the correct order.

Make sure that postgres is running on the admin node- it won't produce an error on startup if postgres isn't up (as it would have done if running pnfs on the node), but it will simply not work when you attempt transfers.

Don't do things with a potential to go wrong on a Friday afternoon if you can avoid it!

Since the move we have yet to see a significant performance increase, but then it's yet to be seriously challenged. We performed some more postgres housekeeping on the admin node after the move which made it a lot happier. Since the move we have noticed occasional srm SFT failures with a "permission denied" type failure, although checking things in the pnfs namespace we don't see any glaring ownership errors. I'm investigating it.

We have had some other site problems this week caused by the timing to be off on several nodes to be off by a good few minutes. It seems Lancaster's ntp server is unwell.

The room where we keep our Pool Nodes is suffering from heat issues. This always leaves us on edge, as our SE has had to be shut down before because of this, and the heat can make things flakey. Hopefully that machine room will get more cooling power and soon.

Other site news from Peter:
Misconfigured VO_GEANT4_SW_DIR caused some WNs to have a full /
partition, becoming blackholes. On top of this, a typo (extra quote) in
the site-info.conf caused lcg-env.sh to be messed up, failing jobs
immediately. Fixed now but flags up how sensitive the system is to
tweaks. Our most stable production month was when we implemented a
no-tweak policy.

Manchester Weekly Update

So far this week, we've had duplicate tickets from GGUS about a failure with dcache01 (affect the ce01 SAM tests), all transfers were stalling, I couldn't debug this as my certificate expired the day i returned from a week off and my dteam membership still hasn't been updated. Restarting the dcache headnode fixed this.
And this morning I discovered that a number of our Worker Nodes had full /scratch partitions, the problem has been tracked to a phenogrid user, and we're working with him on attempting to isolate the issue.

Thursday 24 May 2007

lcg-vomscerts updated

lcg-vomscerts rpm needs to be updated to version 4.5.0-1.
The configuration needs to be changed on the UIs. YAIM site-info.def is not correct. Easiest recipe to change it (from Santanu on TB-SUPPORT) is

sed -c 's/C=CH\/O=CERN\/OU=GRID\/CN=host\//DC=ch\/DC=cern\/OU=computers\/CN=/g ' -i .old site-info.def

and then run the yaim function config_vomses

Thursday 10 May 2007

Manchester Tier2 dcache goes resilient II

Yesterday we completed the scheduled downtime, and now dcache02 is up and resilient, it's still chewing through the list of files and making copies of them, going by past experience it will probably finish somewhere around lunchtime tomorrow. It's so nice to know we're not in the dark ages of dcache-1.6.6 any more. Of course, there's still small niggles to iron out and we've yet to really throw a big load at it, but it's looking a lot, lot better.

Friday 4 May 2007

Manchester Tier2 dcache goes resilient

We're half way through the combined upgrade from dcache-1.6.6-vanilla to dcache1.7.0-with-replica-manager, so far only one of the two head-nodes has been upgraded, but so far so good, the other is scheduled for upgrade next week, and I appear to have scheduled the queue shutdown at 8am on bank-holiday Monday!

Tuesday 24 April 2007

Atlas Freedom of Choice for Resources procedure

http://www.gridpp.ac.uk/wiki/Atlas_FCR_Procedure
New CIC portal feature. It is possible to receive notifications of SAM tests failures. Page to subscribe

https://cic.gridops.org/index.php?section=rc&page=alertnotification

Tuesday 17 April 2007

- To enable RSS for this blog in firefox bookmark http://northgrid-tech.blogspot.com/atom.xml as a live book mark.