Monday 9 February 2009

Jobmanager pbsqueue cache locked

Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. Steve's test jobs all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.

Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:

find /home/*/.lcgjm/pbsqueue.cache.proc.{hold.localhost.*,locked} -mtime +7 -ls

We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had maintenance work with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!

No comments: