Friday 15 June 2007

Lancaster Weekly Update -the Sequel

A bit of an unexciting week. A lot of intermittant short (one test) replica manager failures might point to a small stability issue for our SE- however SAM problems this week have prevented me from finding the details of each failure. The best I could do was if we failed a test poke our SRM to make sure that it was working. The trouble looking at the SAM records made this week's weekly report for Lancaster quite dull.

After last week's PNFS move and postgres mysteriously behaving after being restarted a few times on Tuesday our CPU load on the SRM admin node is now under control, which should greatly improve our performance and make timeouts caused by our end a thing of the past. Finger's crossed.

Another notable tweak this week was an increase of the number of pool partitions given to atlas-they now have exclusive access to 5/6 of our dcache. Our dcache is likely to grow in size in the near future as we reclaim a pool node that was being used for testing, which will increase our SRM by over 10TB, this 10TB will be split in the same way as the rest of the dcache.

My last job with the SRM (before we end up upgrading to the next dcache version whenever that comes) is to deal with a replica infestation. During a test of the replica manager quite a while ago now we ended up with a number of files replicated 3-4 times and for some reason all replicas were marked as being precious- preventing them being cleaned up via the usual mechanisms. Attempts to force the replica manager to clean up after itself have failed, even giving it weeks to do it's job yielded no results. It looks like we might need a VERY carefully written script to clean things up and remove the few TB of "dead space" we have at the moment.

No comments: