Northgrid-tech: 2012

Monday 24 September 2012

Manchester network improvements in Graphs

As I posted here and here we have upgraded the network infrastructure within the Manchester Tier2. Below are some of the measured benefits of this upgrade so far.

Here is the improvement of the outgoing traffic in the atlas sonar tests between Manchester and BNL after we upgraded the cisco blades and replaced the rack switches with the 10G ones

Here instead is the throughput improvement after I have enabled the 10Gbps interface on the perfsonar machine. The test case is Oxford which also has the 10Gbps interfaces enabled.

and here more general rates with different sites. The 10Gbs is evident with sites that have enabled it.

The perfsonar tests have helped also to debug the poor atlas FTS rates the UK had with FZK (https://ggus.eu/ws/ticket_info.php?ticket=84008) in particular Manchester (and Glasgow) had tried already to investigate last year with iperf within atlas what the problem was without much success due to measures not being taken systematically. This year the problem was finally pinned on FZK firewall and the improvement given by bypassing it is below here.

which reflects also in the improved rates in the atlas sonar tests between Manchester and FZK since also the data servers subnets bypass the firewall now.

Finally here is a increased throughput in data distribution to/from other sites as an atlas T2D. August rates were down due to a combination of problems with the storage but there is a growing trend since the rack switches and the cisco blades were upgraded.

Thursday 9 August 2012

10GBE Network cards installation in Manchester

This is a collection of recipes I used to install the 10GBE cards. As I said in a previous post we chose to go 10GBASE-T so we bought X520-T2. They use the same chipset as the X520-DA2 so many things are in common.

The new DELL R610 and C6100 were delivered with the cards already installed. Although due to the fact that DA2 and T2 share the same chipset the C6100 were delivered with the wrong connectors so we are now waiting for a replacement. For the old Viglen WNs and storage we bought additional cards that have to be inserted one by one.

I started the installation process from the R610 because a) they had the cards and b) the perfsonar machines are R610. The aim is to use these cards as primaries and kickstart from them. By default pxe booting is not enabled. So one has to get bootutil from the intel site . What one downloads is for some reason a windows executable but once it is unpacked there are directories for other operating systems. The easiest thing to do is what Andrew has done to zip the unpacked directory and use bootutil from the machine without fussing around with USBs or boot disks. Said that it needs the kernel source to compile. You need to make sure you install the same version as the running kernel.

yum install kernel-devel(-running-kernel-version)
unzip APPS.zip
cd APPS/BootUtils/Linux_x86/
chmod 755 ./install
./install
./bootutil64e -BOOTENABLE=pxe -ALL
./bootutil64e -UP=Combo -FILE=../BootIMG.FLB -ALL

The first bootutil command enables pxe the second updates the firmware.
After this you can reboot and enter the bios to rearrange the order of the network devices to boot from. When this is done you can put the 10GBE interface mac address in the dhcp and reinstall from there.

At kickstart time there are some problems with the machine changing the order of the cards you can solve that using ipappend 2 and ksdevice=bootif as suggested in the RH docs in the pxelinux.cfg files. Thanks to Ewan for pointing that out.

Still the machine might not come back up with the interface working. There might be two problems here:

1) X520-T2 interface take longer to wake up than their little 1GBE sisters. It is necessary to insert a delay after the /sbin/ip command in the network scripts. To do this I didn't have to hack anything, I could just set

LINKDELAY=10

in the ifcfg-eth* configuration files and it worked.

2) It is not guarantueed to have the 10GBE interface as eth0. There are many ways to stop this from happening.

One is to make sure HWADRR if ifcfg-eth0 is assigned the mac address value of the card the administrator want and not what the system decides. It can be done at kickstart time but this might mean to have a kickstart file for each machine which we are trying to get away from.

Dan and Chris suggested this might be corrected with udev The recipe they gave me was this

cat /etc/udev/rules.d/70-persistent-net.rules
KERNEL=="eth*", ID=="0000:01:00.0", NAME="eth0"
KERNEL=="eth*", ID=="0000:01:00.1", NAME="eth1"
KERNEL=="eth*", ID=="0000:04:00.0", NAME="eth2"
KERNEL=="eth*", ID=="0000:04:00.1", NAME="eth3"

and uses the pci device ID value which is the same for the same machine types (R610, C6100...). You can get the ID values using lspci | grep Eth. Not essential but if lspci returns something like Unknown device 151c (rev01) in the description it is just the pci database that is not up to date use update-pciid to update the database. There are other recipes around if you don't like this one, but this simplifies a lot the maintenance of the interfaces naming scheme.

The udev recipe doesn't work if HWADDR are set in the ifcfg-eth* files. If they are you need to remove them to make udev work. A quick way to do this in every file is

sed -i -r '/^HWADDR.*$/d' ifcfg-eth*

in the post kickstart and then install the udev file.

10GBE cards might need different TCP tuning in /etc/sysctl.conf for now I took the perfsonar machine one which is similar to something already discussed long time ago.

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_congestion_control = htcp

The effects of moving to 10GBE can be seen very well in the perfsonar tests.

Friday 20 July 2012

Jobs with memory leaks containment

This week some sites suffered from extreme memory hungry jobs using up to 16GB of memory and killing the nodes. These were most likely due to memory leaks. The user cancelled all of them before he was even contacted but not before he created some annoyance.

We have had some discussion about how to fix this and atlas so far has asked not to limit on memory because their jobs use for brief periods of time more than what is officially requested. And this is true most of their jobs do this infact. According to the logs the production jobs use up to ~3.5GB mem and slightly less than 5GB vmem. See plot below for one random day (other days are similar).

To avoid killing everything but still putting a barrier against the memory leaks what I'm going to do in Manchester is to limit for mem to 4GB and a limit for vmem to 5GB.

If you are worried about memory leaks you might want to go through a similar check. If you are not monitoring your memory consumption on a per job basis you can parse your logs. For PBS I used this command to produce the plot above

grep atlprd /var/spool/pbs/server_priv/accounting/20120716| awk '{ print $17, $19, $20}'| grep status=0|cut -f3,4 -d'='| sed 's/resources_used.vmem=//'|sort -n|sed 's/kb//g'

numbers are already sorted in numerical order so the last one is the highest (mem,vmem) a job has used that day. atlprd is the atlas production group which you can replace with other groups. Atlas users jobs have up to a point similar usage and then every day you might find a handful crazy numbers like 85GB vmem and 40GB mem. These are the jobs we aim at killing.

I thought the batch system was simplest way because it is only two commands in PBS but after lot of reading and a week of testing it is not possible to over allocate memory without affecting the scheduling and ending up with less jobs on the nodes. This is what I found out:

There are various memory parameters that can be set in PBS:

(p)vmem: virtual memory. PBS doesn't interpret vmem as the almost unlimited address space. If you set this value it will interpret it for scheduling purposes as memory+swap available. It might be different with later versions but that's what happens in torque 2.3.6.
(p)mem: physical memory: that's you RAM.

when there is a p in front it means per process rather than per job

If you set them what happens is as follows:

ALL: if a job arrives without memory settings the batch system will assign these limits as allocated memory for the job not only as a limit the job doesn't have to exceed.
ALL: if a job arrives with memory resources settings that exceed the limits it will be rejected.
(p)vmem,pmem: if a job exceeds the settings at run time it will be killed as these parameters set limits at OS level.
mem: if a job exceeds this limit at run time it will not get killed. This is due to a change in the libraries apparently.

To check how the different parameters affect the jobs you can submit directly to pbs this csh command and play with the parameters

echo 'csh -c limit' | qsub -l vmem=5000000kb,pmem=1GB,mem=2GB,nodes=1:ppn=2

If you want to set these parameters you have to do the following

qmgr
qmgr: set queue long resources_max.vmem = 5gb
qmgr: set queue long resources_max.mem = 4gb
qmgr: set queue long resources_max.pmem = 4gb

These settings will affect the whole queue so if you are worried about other VOs you might want to check what sort of memory usage they have. Although I think only CMS might have a similar usage. I know for sure Lhcb uses less. And as said above this will affect the scheduling.

Update 02/08/2012

RAL and Nikhef use a maui parameter to correct the the over allocation problem

NODEMEMOVERCOMMITFACTOR 1.5

this will cause maui to allocate up to 1.5 times more memory than there is on the nodes. So if a machine has 2GB memory a 1.5 factor allows to allocate 3GB. Same with other memory parameters described above. The factor can of course be tailored to your site.

On the atlas side there is a memory parameter that can be set in panda. It sets a ulimit on vmem on a per process basis in the panda wrapper. It didn't seem to have an effect on the memory seen by the batch system but that might be because forked processes are double counted by PBS which opens a whole different can of worms.

Thursday 5 April 2012

The Big Upgrade in pictures

New Cisco blades, engines and power supplies

DELLs boxes among which new switches

Aerial view of the old cabling

Frontal view of the mess

Cables unplugged from the cisco

Cisco old blades with services racks still connected

New cat6a cisco cabling aerial view nice and tidy

Frontal view of the new cisco blades and cabling nice and tidy

Old and new rack switches front view

Old and new rack switches rear view

Emptying and reorganising the racks

Empty racks ready to be filled with new machines

Old DELLs cemetery

Old cables cemetery. All the cat5e cables going under the floor from the racks to the cisco half of the cables from the rack switches to the machines and all the patch cables in front of the cisco shown above have gone.


All the racks but two have now the new switches but the machines are still connected with cat5e cables. Upgrading the network cards will be done in Phase two one rack at the time to minimize service disruption.

The downtime lasted 6 days. Everybody who was involved did a great job and the choice of 10GBASE-T was a good one because the ports auto-negotiation is allowing us to run at 3 different speeds on the same switches: PDU 100Mbps, old WN and storage at 1Gbps, and the connection with the cisco is 10Gbps. We also kept one of the old cisco blades for connections that don't require 10Gbps such as the out-of-band management cables plus two racks of servers that will be upgraded at a later stage are still connected at 1Gbps to the cisco. And we finished perfectly in time for the start of data taking (and Easter). :)

Saturday 31 March 2012

So long and thanks for all the fish

In 2010 we had already decommissioned half of the original mythical 2000 (1800 for us) EM64T CPUs Dell cluster that allowed us to be the 4th of the top 10 countries in EGEE in 2007.

This year we are decommissioning the last 430 machines that served us so well for 6 years and 2 months. So... so long and thanks for all the fish.

Saturday 14 January 2012

DPM database file systems synchronization

The synchronisation of the DPM database with the data servers file systems has been a long standing issue. Last week we had a crash that made more imperative to check all the files and I eventually wrote a bash script that makes use of the GridPP DPM admin tools. I don't think this should be the final version but I'm quicker with bash than with python and therefore I started with that. Hopefully later in the year I'll have more time to write a cleaner version in python that can be inserted in the admin tools based on this one. It does the following:

1) Create a list of files that are in the DB but not on disk
2) Create a list of files that are on disk but not in the DB
3) Create a list of SURLs from the list of files in the DB but not on disk to declare lost (this is mostly for atlas but could be used by LFC administrators for other VOs)
4) If not in dry run mode proceed to delete the orphan files and the orphan entries in the DB.
5) Print stats of how many files were in either list.

Although I put few protections this script should be run with care and unless in dry run mode shouldn't be run automatically AT ALL. However in dry run mode it will tell you how many files are lost and it is a good metric to monitor regularly as well as when there is a big crash.

If you want to run it, it has to run on the data servers where there is access to the file system. As it is now it requires a modified version of /opt/lcg/etc/DPMINFO that point to the head node rather than localhost because one of the admin tools used does a direct mysql query. For the same reason it also requires dpminfo user to have mysql select privileges from the data servers. This is the part that really could benefit from a rewriting in python and perhaps a proper API use as the other tool does. I also had to heavily parse the output of the tools which weren't created exactly for this purpose and this could also be avoided in a python script. There are no options but all the variables that could be options to customize the script with your local settings (head node, fs mount point, dry_run) are easily found at the top.

To create the lists it takes really little time no more than 3 minutes on my system but it depends mostly on how busy is your head node.

If you want to do a cleanup instead it is proportional to how many files have been lost and can take several hours since it does one DB operation per file. The time to delete the orphan files also depends on how many and how big they are but should take less than DB cleanup.

The script is here: http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-synchronise-disk-db.sh