Northgrid-tech: 2009

Thursday 15 October 2009

Squid cache for atlas on a 32bit machine

I installed the squid cache for atlas on a SL5 32bit machine. There are no rpms from the project in 32bit. There is a default OS squid rpm but it is apparently bugged and the request is to install a 2.7-STABLE7 version. So I got the source rpm from here

https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2

rpmbuild --rebuild frontier-squid-2.7.STABLE7-4.sl5.src.rpm

it will compile the squid for your system. And create a binary rpm in

/usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm

rpm -ihv /usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm

It will install everything in /home/squid - apparently it is relocatable but I don't mind the location so I left it.

Edit /home/squid/etc/squid.conf

Not everything you find in the BNL instructions is necessary. Here is my list of changes

acl SUBNET-NAME src SUBNET-IPS <---- there are different ways to express this
http_access allow SUBNET-NAME
hierarchy stoplist cgi-bin ?
cache_mem 256 MB
maximum_object_size_in_memory 128 KB
cache_dir ufs /home/squid/var/cache 100000 16 256
maximum_object_size 1048576 KB
update_headers off
cache_log /home/squid/var/logs/cache.log
cache_store_log none
strip_query_terms off
refresh_pattern -i /cgi-bin/ 0 0% 0
cache_effective_user squid <--- default is nobody doesn't have access to /home/squid
icp_port 0

Edit /home/squid/sbin/fn-local-squid.sh
Add these two lines

# chkconfig: - 99 21
# description: Squid cache startup script

then

ln -s /home/squid/sbin/fn-local-squid.sh /etc/init.d/squid
chkconfig --add squid
chkconfig squid on

Write to Rod Walker to authorize your machine in the gridka Frontier server (until RAL is up that's the server for Europe) if you can set up an alias for the machine do it before writing him.

To test the setup

wget http://frontier.cern.ch/dist/fnget.py
export http_proxy=http://YOU-SQUID-CAHE:3128
python fnget.py --url=http://atlassq1-fzk.gridka.de:8021/fzk/Frontier --sql="SELECT TABLE_NAME FROM ALL_TABLES"

if you get many lines similar to those below

COMP200_F0027_TAGS_SEQ
COMP200_F0037_IOVS_SEQ
COMP200_F0020_IOVS_SEQ

your cache is working.

Wednesday 19 August 2009

Manchester update

I fixed the site bdii problem i.e. the site static information 'disappeared'. It didn't actually disappear it was just declared under mds-vo-name=resource instead of mds-vo-name=UKI-NORTHGRID-MAN-HEP AND THEREFORE GSTAT COULDN'T FIND IT. This was due to rgma and site bdii conflict. The rgma bdii (that didn't exist in very old versions) needs to be declared in the BDII_REGIONS in YAIM. I knew it but forgot completely I already fixed it when I reinstalled the machine few months ago so I spent a delightful afternoon parsing ldif files and ldap output, hacked the ldif, sort of fixed it and then asked for a proper solution. So... here we go I'm writing it down this time so I can google for myself. On the positive side I upgraded now to the latest version both site and top bdiii and the resource bdii on the CEs. So we now have shiny new attributes like Spec2006 &Co.

I also upgraded the CEs trying to fix our random instability problem which afflicts us. However I upgraded online without reinstalling everything and it makes me a bit nervous thinking that some files that needed change might have not been edited because they already exist. So I will completely reinstall the CEs starting with ce01 today.

Tuesday 5 May 2009

Howto publish Users DNs accounting records

To publish the User DN records in the accounting you should add to your site-info.def the following line

APEL_PUBLISH_USER_DN="yes"

and reconfigure your MON box. This will change the parser configuration file

/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml

replacing this line

<JoinProcessor publishGlobalUserName="no">

with this one

<JoinProcessor publishGlobalUserName="yes">

this will affect only on the new records.

If you want to republish everything you need to replace in the same file

<Republish>missing</Republish>

with this line using the apropriate dates

<Republish recordStart="2006-02-01" recordEnd="2006-04-25">gap</Republish>

and publish a chunk of data at the time. The documentation suggests one month at the time to avoid to run out of memory. When you have finished put back the line

<Republish>missing</Republish>

Wednesday 29 April 2009

NFS bug in older SL5 kernel

As mentioned previously ( http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html ) we have recently upgraded our NFS servers and they now run on SL5. Shortly after going into production all LHCb jobs stalled at Manchester and we were blacklisted by the VO.

We were advised that it may be a lockd error, and asked to use the following python code to diagnose this:

-------------------------------------------------------------------
import fcntl
fp = open("lock-test.txt", "a")
fcntl.lockf(fp.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)
-------------------------------------------------------------------

The code did not give any errors and we therefore discounted this as the problem. Wind the clock on a fortnight (including a week's holiday over Easter) and we still have not found the problem so I tried the above code again, and bingo lockd was the problem. A quick search of the SL mailing list pointed me to this kernel bug
https://bugzilla.redhat.com/show_bug.cgi?id=459083

A quick update of the kernel and reboot and the problem was fixed.

Friday 3 April 2009

Fixed MPI installation

Few months ago we installed MPI using glite packages and YAIM.

http://northgrid-tech.blogspot.com/2008/11/mpi-enabled.html

We never really tested it though until now. We have found few problems with YAIM:

YAIM creates an mpirun script that assumes ./ is in the path so the job was landing on WN but mpirun couldn't find the user script/executable. I corrected it prepending `pwd`/ in front of the script arguments at the end of the sript so it runs `pwd`/$@ instead of $@. I added this using yaim post functionality.

The if else statement that if used to build MPIEXEC_PATH is written in a contorted way and needs to be corrected. For example:

1) MPI_MPIEXEC_PATH is used in the if but YAIM doesn't write it in any system file that sets the env variable like grid-env.sh where the other MPI_* variable are set.

2) In the else statement there is an hardcoded path which atcually is chosen splitting the mpiexec executable MPI_MPICH_MPIEXEC points to from its directory.

3) YAIM doesn't rewrite mpirun once it's written so the hardcoded path can't be changed reconfiguring the node without manually deleting mpirun before. This make difficult to update or correct mistakes.

4) The existence of MPIEXEC_PATH is not checked and it should.

Anyway eventually we managed to run mpi jobs and we reported to the new TMB MPI working group what we have done because another site was experiencing the same problems. Hopefully they will correct these problems. Special thanks go to Chris Glasman who hunted down the inital problem with the path and patiently tested the changes we applied.

Wednesday 25 March 2009

New Storage and atlas space tokens

We have finally installed all the units. They are ~84TB of usable space. 42TB are dedicated to atlas space tokens, the other 42TB are shared for now but will be moved into atlas space tokens when we see more usage.

We also have finally enabled all the space tokens requested by atlas. They are waiting to be inserted in Tier Of Atlas but below I report what we publish in the BDII.

ldapsearch -x -H ldap://site-bdii.tier2.hep.manchester.ac.uk:2170 -b o=grid '(GlueSALocalID=atlas*)' GlueSAStateAvailableSpace GlueSAStateUsedSpace| grep Glu
dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache01.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 33411318000
GlueSAStateUsedSpace: 21533521683
dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache02.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 48171274000
GlueSAStateUsedSpace: 4168774302
dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3223.tier2.he
GlueSAStateAvailableSpace: 1610612610
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3223.tier2.hep
GlueSAStateAvailableSpace: 2154863252
GlueSAStateUsedSpace: 44160003
dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3223.tier2.
GlueSAStateAvailableSpace: 3298534820
GlueSAStateUsedSpace: 62
dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3223.tie
GlueSAStateAvailableSpace: 3298534883
GlueSAStateUsedSpace: 0
dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3223.tier2.hep
GlueSAStateAvailableSpace: 8052760941
GlueSAStateUsedSpace: 302738
dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3223.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 28580000000
GlueSAStateUsedSpace: 709076288
dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3223.tier2.hep.m
GlueSAStateAvailableSpace: 3298534758
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3226.tier2.hep
GlueSAStateAvailableSpace: 2199023130
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3226.tie
GlueSAStateAvailableSpace: 3298534883
GlueSAStateUsedSpace: 0
dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3226.tier2.he
GlueSAStateAvailableSpace: 1610612610
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3226.tier2.hep.m
GlueSAStateAvailableSpace: 3298534758
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3226.tier2.hep
GlueSAStateAvailableSpace: 8053063554
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3226.tier2.
GlueSAStateAvailableSpace: 3298534820
GlueSAStateUsedSpace: 62
dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3226.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 35730390000
GlueSAStateUsedSpace: 1172758

Tuesday 24 March 2009

Replaced NFS servers

The NFS servers have been replaced in Manchester with two more powerful machines and two 1TB raided SATA disks. This should hopefully put a stop to the space problems we have suffered in the past few months both with atlas and lhcb and should also allow us to keep a bit more releases than before.

We also have a nice nagios graphs to monitor the space now as well as cfengine alerts.

http://tinyurl.com/d5n7eo

Thursday 5 March 2009

Machine room update

After some sweet-talking we managed to get two extra air-con units installed in our old machine room. This room houses our 2005 cluster and our more recent CPU and storage purchased last year. The extra cooling was noticeable and allowed us to switch on a couple of racks which were otherwise offline.

In other news, the new data centre is coming along nicely and will be ready for handover in 3/4 months from now. If you're ever racing past Lancaster on the M6 you'll get a good view of the Borg mothership on the hill, the sleak black cladding is going up now...

Friday 13 February 2009

This week's DPM troubles at Lancaster.

We've had some interesting time this week in Lancaster, a tale of Gremlins, Greedy Daemons and Magical Faeries who come in the night and fix your DPM problems.

On Tuesday evening, when we've all gone home for the night, the DPM srmv1 daemon (and to a lesser extent the srmv2.2 and dpm daemons) started gobbling up system resources, sending our headnode into a swapping frenzy. There are known memory leak problems in the DPM code, and we've been victim of them before but in those instances we've always been saved by a swift restart of the affected services and the worse that happened was a sluggish DPM. This time the DPM servies completely froze up, and around 7 pm we started failing tests.

So coming into this disaster on Wednesday morning we leaped into action. Restarting the services fixed the load on the headnode, but the DPM still wouldn't work. Checking the logs showed that all requests were being queued, apparently forever. The trail led to some error messages in the mysqld.log;

090211 12:05:37 [ERROR] /usr/libexec/mysqld: Lock wait timeout exceeded;
try restarting transaction
090211 12:05:37 [ERROR] /usr/libexec/mysqld: Sort aborted

The oracle Google pointed that these kind of errors were indicative of a mysql server in a bad state after suddenly loosing connection to a client but not accounting for this. Various restarts, reboots and threats were used, but nothing would get the dpm working and we had to go into downtime.

Rather then dive blindly into the bowels of the DPM backend mysql we got in contact with the DPM developers on the DPM support list. They were really quick to respond, and after recieving 40MB of (zipped!) log files from us set to work developing a strategy to fix us. It appears that our mysql had grown much larger then it should have, "bloating" with historical data, which contributed to it getting into a bad state and made the task of repairing the database harder- partly as we simply couldn't restore from backups as these too would be "bloated".

After a while of bashing our heads, scouring logs and waiting for news from the DPM chaps we decided to make use of the downtime and upgrade the RAM on our headnode to 4 GB (from 2), a task we had been saving for the scheduled downtime when we finally upgrade to the Holy Grail that is DPM 1.7.X. So we slapped in the RAM, brung the machine up clearly, and left it.

A bit over an hour after it came up after the upgrade the headnode started working again. As if by magic. Nothing notable in the logs, it just started working again. The theory is that the added RAM allowed the mysql to chug through a backlog of requests and start working again. But that's just speculation. The dpm chaps are still puzzling over what happened, and our databases are still bloated, but the crisis (for now).

So there are 2 morals to this tale;
1) I wouldn't advise running a busy DPM headnode with less then 4GB of RAM, it leads to unpredictable behaviour.
2) If you get stuck in an Unscheduled Downtime you might as well make use of it to do any work, you never know when something magical might happen!

Monday 9 February 2009

Jobmanager pbsqueue cache locked

Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. Steve's test jobs all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.

Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:

find /home/*/.lcgjm/pbsqueue.cache.proc.{hold.localhost.*,locked} -mtime +7 -ls

We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had maintenance work with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!