Wednesday 17 December 2008

Manchester General Changes

* Enabled users pilots for Atlas and Lhcb. Currently Lhcb is running a lot of jobs and although most is production many are from their generic users. Atlas instead seems to have almost disappeared.

* Enabled NGS VO and passed the first tests. Currently in the conformance test week.

* Enabled one shared queue and completely phased out the VO queues. This has required a transition period for some VOs to give the time to clear the jobs from the old queue and or to reconfigure their tools. This has greatly simplified the maintainance.

* Installed a top-level BDII and reconfigured the nodes to query the local top level BDII instead of the RAL one. This was actually quite easy and we should have done it earlier.

* Cleaned up old parts of cfengine that were causing the servers to be overloaded, not serve the nodes correctly and fire off thousands of emails a day. Mostly this was due to an overlap in the way cfexecd was run both as a cron job and as a daemon. However we also increased TimeOut and SplayTime values and introduced explicitely the schedule parameter in cfagent.conf. Since then cfengine hasn't had anymore problems.

* Increased usage of YAIM local/post functions to apply local overrides or minor corrections to yaim default. Compared to inserting the changes in cfengine this method has the benefit of being integrated and predictable. When we run yaim the changes are applied immediately and don't get overridden.

* New storage: our room is full and when the clusters are loaded we hit the power/cooling limit and we risk to draw power from other rooms, due to this problem the CC people don't want us to switch on new equipment without switching off some old one to maitain the balance in the power consumption. So eventually we have bought 96 TB of raw space to get us going. The kit has arrived yesterday and needs to be installed in the rack we have and power measures need to be taken to avoid switching off more than the necessary amount of nodes. Luckily it will not be many anyway (even taking the nominal value on the back of the new machines it would be 8 nodes but with better power measures it could be as many as 4) because the new machines consume much less then the DELL nodes that are now 4 years old. However buying new CPUs/storage cannot be done without switching off a significant fraction of the current CPUs before switching on the new kit and requires working in tight cooperation with CCS people which has now been agreed after a meeting I had last week with them and their management.

Tuesday 16 December 2008

Phasing out VO queues

I've started to phase out VO queues and to create VO shared queues. The plan is eventually to have 4 queues called with a leap of imagination short, medium, long and test with the following characteristics:

test: 3h/4h; all local VOs
short: 6h/12h; ops,lhcbsgm,atlasgm
medium: 12h/24h all VOs and roles but those in short queue and production
long: 24h/48h all VOs and roles but those that can access the short queue

Installing the queues, adding the groups ACLs and publishing them is not difficult. YAIM (glite-yaim-core-4.0.4-1, glite-yaim-lcg-ce-4.0.4-2 or higher) can do it for you. Otherwise it can be done by hand which is still easy but is more difficult to maintain (the risk to override is always high and files need to be maintained in cfengine or cvs or else).

The problem for me is that this scheme works only if the users select the correct ACLs and a suitable queue with the right length for their jobs in their JDL. If they don't the queue chosen by the WMS is random with high probability of jobs failing because they end up in a queue that is too short or into a queue that doesn't have the right ACLs. So I'm not sure if it's really a good idea even if it is much easier to maintain and allows a bit more sophisticated setups.

Anyway if you do it by YAIM all you have to do is to add the queue to

QUEUES="my-new-queue other-queues"

add the right VO/FQAN to the new queue _GROUP_ENABLE variable (remember to convert . and - into _

MY_NEW_QUEUE_GROUP_ENABLE="atlas /atlas/ROLE=pilot other-vos-or-fqans"

the syntax of GROUP_ENABLE has to be the same as the one you have used in group.conf (see previous post http://northgrid-tech.blogspot.com/2008/12/groupsconf-syntax.html)

And finally add to site-info.def

FQANVOVIEWS=yes

to enable publishing of the ACL in the GIP.

Rerun YAIM on the CE as normal.

To check everything is ok on the CE

qmgr -c 'p q my-new-queue'

ldapsearch -x -H ldap://MY-CE.MY-DOMAIN:2170 -b GlueCEUniqueID=MY-CE.MY-DOMAIN:2119/jobmanager-lcgpbs-my-new-queue,Mds-Vo-name=resource,o=grid

among other things, if correctly configured it should list the GlueCEAccessControlBaseRules for each VO and FQAN you have listed in _GROUP_ENABLE.

If a
GlueCEAccessControlBaseRule: DENY:FQAN field appears that's the ACL for VOViews not the access to the queue.

Thanks to Steve and Maria for pointing me to the right combination of YAIM packages and confirming the randomness WMS matchmaking.

Monday 15 December 2008

groups.conf syntax

Elena asked about it few days ago on TB-SUPPORT. Today I investigated a bit further and the result is that for glite-yaim-core versions >4.0.4-1:

* Even if it still works the syntax with VO= and GROUP= is obsolete. The new syntax is much simpler as it uses directly the FQANs as reported in the VO cards (if they are maintained).

* The syntax in /opt/glite/yaim/examples/groups.conf.example is correct and the files in the directory are kept up to date with the correct syntax although the examples might not be valid.

* Further information can be found either in

/opt/glite/examples/groups.conf.README

or

https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM

which is worth to periodically review for changes.

Monday 1 December 2008

RFIO tuning for Atlas analysis jobs

A little info about the RFIO settings we've tested at Liverpool.

Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.

Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing

RFIO IOBUFSIZE XXX

where XXX is the size in bytes. Altering this setting gave the following results

Buffer (MB), CPU (%), Data transferred (GB)
0.125, 60.0, 16.5
1.000, 23.0, 65.5
10.00, 13.5, 174.0
64.00, 62.1, 11.5
128.0, 74.7, 7.5

This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.

Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.

For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.

My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.

My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.

Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.

Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is >= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.

This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.

The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.

Friday 28 November 2008

Nagios Checker

Nagios checker is a firefox plugin that can replace nagios email alerts with colourful, blinking and possibly noisy icons at the bottom of your firefox window.

The icon expands into prolematic hosts lines when the cursor goes on top of it and with the right permissions if you click on the hosts it goes to the hosts nagios page.

https://addons.mozilla.org/en-US/firefox/addon/3607

To configure it to read NorthGrid nagios:

* Go to settings (right-click on the nagios icon on firefox)

* Click Add new

* In the General tab:
** Name: whatever you like
** WEB URL: https://niles004.tier2.hep.manchester.ac.uk/nagios
** Tick the nagios older than 2.0 box
** User name: your DN
** Status script URL: https://niels004.tier2.hep.manchester.ac.uk/nagios/cgi-bin/status.cgi
** Click Ok

If you want only your site machines

* Go to the filters tab and
** Tick the 'Hosts matching regular expressions' box
** Insert your domain name in the test-box
** Tick the reverse expression box.
** Click ok

For the rest you can adjust it as you please I removed the noises and inserted a 3600 sec refresh interval.

Drawbacks if you add other nagios end-points:

* Settings are applied to all of them
* If the host names in your other nagios have a different domain name (or don't have it at all) they don't get filtered.

Perhaps another method might be needed. Investigating.

Manchester and Lhcb

Manchester is now officially part of Lhcb and all they CPU hours will have full weight!! Yuppieee!! :)

Thursday 27 November 2008

Black holes detection

Finding out black holes is always a pain... However the pbs accounting records can be of help. A simple script that counts the number of jobs a node swallows makes some difference:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/black-holes-finder.sh

I post it just in case other people are interested.

An example of the output:

# black-holes-finder.sh

Using accounting file 20081127
[...]
bohr5029: 1330
bohr5030: 1803


clearly the two nodes above have a problem.

MPI enabled

Enabled MPI in Manchester using YAIM and the recipe from Stephen Childs I found in the links below:

http://www.grid.ie/mpi/wiki/YaimConfig

http://www.grid.ie/mpi/wiki/SiteConfig

Caveats:

1) The documentation will move to probably the YAIM official pages

2) The location of the gip files is now under /opt/glite not /opt/lcg

3) Scripts DO interfere with the current setup on the WNs if run on their own so you need to reconfigure the whole node (I made the mistake of running only MPI_WN). On the CE instead it's enough to run MPI_CE.

4) MPI_SUBMIT_FILTER variable in site-info.def is not documented (yet). It enables the section of the scripts that rewrites torque submit_filter script that allocates the correct number of CPUs

5) Yaim doesn't publish MPICH (yet?) so I had to add the following lines

GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7

to /opt/glite/etc/gip/ldif/static-file-Cluster.ldif manually.

Tuesday 25 November 2008

Regional nagios update

I reinstalled the regional nagios with Nagios3 and it works now.

https://niels004.tier2.hep.manchester.ac.uk/nagios

As suggested by Steve I'm also trying the nagios checker plugin

https://addons.mozilla.org/en-US/firefox/addon/3607

instead of the email notification but I still have to configure things properly. At the moment firefox makes some noise every ~30 seconds and there is also a visual alert on the bottom right corner of the firefox window with the number of services in critical state which expands to show the services when the cursor points at it. Really nice. :)

Thursday 20 November 2008

WMS talk

I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.

The talk can be found here:

WMS Overview

Monday 10 November 2008

DPM "File Not Found"- but it's right there!

Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"

However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:

http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&sg=&c=LCG-ServiceNodes&h=fal-pygrid-30.lancs.ac.uk

The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.

Friday 7 November 2008

Regional nagios

I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

I updated it while I was going along instead of writing a parallel document.

Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.

https://niels004.tier2.hep.manchester.ac.uk/nagios

Wednesday 27 August 2008

DPM in Manchester

Manchester has now a fully working DPM with 6 TB. There are 2 space tokens ATLASPRODDISK and ATLASDATADISK. The service has been added to the GOC and the Information System, the space tokens are published. The errors have been corrected and the system has been passing the SAM tests continuously since yesterday.

I added some information to the wiki

https://www.gridpp.ac.uk/wiki/Manchester_DPM#Atlas_Space_tokens
https://www.gridpp.ac.uk/wiki/Manchester_DPM#Errors_along_the_path

Monday 21 July 2008

DPM space tokens

Below the first tests of setting up space tokens on DPM testbed:

https://www.gridpp.ac.uk/wiki/Manchester_DPM

Tuesday 3 June 2008

Slaughtering ATLAS jobs?

These heavy ion jobs have generated lots of discussion in various forums. They are ATLAS heavy ion simulations (Pb-Pb) which are being killed in two ways. (1) by the site's batch system if the queue walltime limit is reach. (2) by the atlas pilot because the log file modification time hasn't change in 24 hrs.

Either way, sites shouldn't worry if they see these, ATLAS production is aware. They're only single event and you might see massive mem usage too > 2G/core. :-)

According to Steve, the new WN should allow jobs to gracefully handle batch kills with a suitable delay between SIGTERM and SIGKILL.

Friday 23 May 2008

buggy glite-yaim-core

glite-yaim-core to version >4.0.4-1 doesn't recognise anymore VO_$vo_VOMSES even if set correctly in the vo.d dir. I'm still wondering how the testing is performed. Primary functionalities like completion without self-evident errors seems to be overlooked. Anyway, looking on the bright side... the bug will be fixed in yaim-core 4.0.5-something. In the meantime I had to downgrade

glite-yaim-core to 4.0.3-13
glite-yaim-lcg-ce to 4.0.2-1
lcg-CE to 3.1.5-0

which was the combination previously working.

CE problems with lcas an update

lcas/lcmaps they have by default debug level set to 5. Apparently it can be changed setting appropriate env variables for the globus-lcas-lcamaps interface. The variables are actually foreseen in yaim. The errors can be generated by a mistyped DN when a non-VOMS proxy is used. This is a very easy way to generate a DoS attack.

After dwngrading yaim/lcg-CE I've reconfigured the CE and it seems to be working now. I haven't seen any of the debug messages for now.

Thursday 22 May 2008

globus-gatekeeper weirdness

globus-gatekeeper has started to spit out level 4 lcas/lcmaps messages from nowhere at 4 o'clock in the morning. The log file reaches few GBs size in few hours and fills /var/log screwing the CE. I contacted nikhef people for help but haven't received an answer yet. The documentation is not helpful.

Monday 19 May 2008

and again

we have upgraded to 1.8. At least we can look at the space manager while waiting for a solution to the number of jobs cap and the replica manager not replicating.

Thursday 15 May 2008

Still dcache

With Vladimir help Sergey managed to start the replica manager changing a java option in replica.batch. This is as far as it goes because it still doesn't work, i.e. it doesn't produce replicas. We have just given Vladimir access to the testbed.

It seems Chris Brew is having the same 'cannot run more than 200 jobs' problem since he upgraded. He sent an email to the dcache user-forum. This makes me think that even if the replica manager might help it will not cure the problem.

Tuesday 13 May 2008

Manchester dcache troubles (2)

fnal developers are looking at the replica manager issue. The error lines found in the admindomain log appear also at fnal and don't seem to be a problem there. The search continues...

In the meantime we have doubled the memory of all dcache head nodes.

Monday 12 May 2008

The data centre in the sky

Sent adverts out this week, trying to shift our ancient DZero farm which was de-commissioned a few years ago. It had a glorious past with many production records set whilst generating DZero's RunII simulation data. It's dual 700MHz PentiumIII CPUs, with 1G RAM can't cope with much these days, and it's certainly not worth the manpower keeping them online. Here is the advert if you're interested.

In other news, our MON box system disk spat it's dummy over the weekend, this was one of the three gridpp machines, not bad going after 4 years.

Tuesday 6 May 2008

Manchester SL4 dcache troubles

Manchester since the upgrade to SL4 is experiencing problems with dcache.

1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.

2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.

3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.

4) Another problem specific to Atlas is that although Manchester has 2 dcache
instances they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't happened yet.

5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.

We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400
atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).

Applied optimizations:

1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)
2) Apply postgres optimizations: no results
3) Apply kernel optimization for networking from CERN: transfers of small files
30% faster but could also be a less loaded cluster.

Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.

Wednesday 9 April 2008

Athena release 14 - new dependency

Athena release 14 has a new dependency on package 'libgfortran'. Sites with local Atlas users may want to check they have this. The runtime error message is rather difficult to decipher, however the buildtime error is explicit. I've added the package to the required packages twiki page.

Monday 4 February 2008

Power outage

Site-wide power outage occured at Lancaster Uni this evening. The juice is now flowing but some intervention is required tomorrow morning before we're back to normal operations.

I hate java!

#$%^&*@!!!!!!

Wednesday 30 January 2008

Liverpool update

From Mike reply:

* We'll stay with dcache and are about to rebuild the whole SE (and the whole cluster including a new multi-core CE) when we shut down for a week soon to install SL4. Everything is under test at present and we are upgrading the rack software servers to 250GB RAID1 to cope with the 100GB size of the ATLAS code.

* We are still testing Puppet (on our non-LCG cluster) as our preferred solution. It looks fine but we are not yet sure it will scale to many 100s of nodes.