Friday 10 June 2011

DPM Optimization

My quest to optimize DPM continues. Bottlenecks are like Russian dolls and hide behind each other. After optimizing the data servers increasing the block device read ahead; enabling lacp on network channel bonding and multiplying the atlas hotdisk files there is still a problem with mysql on the head node which causes time outs.

When atlas ramps up there is often a increase of connection in TIME_WAIT. I observed >2600 at times. The mysql database becomes completely unresponsive and causes the time outs. Restarting the database causes the connections to finally close and the database to resume normal activity. Although a restart might alleviate the problem as usual it's not a cure. So I went on a quest. What follows might not alleviate my specific problem, I haven't tested in production yet, but it certainly helps with another: DB reload.

Sam already wrote some performance tuning tips here: Performance and Tuning most notably the setting of innodb_buffer_pool_size. After a discussion on the DPM user forum and some testing this is what I'd add:

I set "DPM REQCLEAN 3m" when I upgraded to DPM 1.7.4 and this, after a reload, has reduced Manchester DB file size from 17GB to 7.6GB. Dumping the db took 7m34s. I then reloaded it with different combinations of suggested my.cnf innodb parameters and the effects of some of them are dramatic.

The default parameters should definitely be avoided. Reloading a database with the default parameters takes several hours. Last time it took 17-18 hours, this time I interrupted after 4.

With a combination of the parameters suggested by Maarten the time is drastically reduced. In particular the most effective have been setting innodb_buffer_pool_size and innodb_log_file_size. Below are the results of the upload tests I made in decreasing order of time. I then followed Jean Philippe suggestion to drop the requests tables. Dropping the tables took several minutes and it was slightly faster with a single db file. After I dropped the tables and the indexes ibdata1 size dropped to 1.2GB and using combination 4 below it took 1m7s to dump and 5m7s to reload. With one file per table configuration reloading was slightly faster but after I dropped the requests tables there was no difference and it is also balanced by the fact that deletion seems slower and the effects are probably more visible when the database is bigger so these small tests don't give any compelling reason in favour nor against for now.

This are steps that help reducing the time it takes to reload the database:

1) Enable REQCLEAN in shift.conf (I set it to 3 months to comply with security requirements.)
2) set innodb_buffer_pool_size in my.cnf (I set it at 10% of the machine memory and I couldn't see much difference eventually when I set it to 22.5% but in production it might be another story with repeated queries for the same input files)
3) set innodb_log_file_size in my.cnf (didn't give much thought to this, Maarten value of 50MB seemed good enough. Binary log files need to be removed to enable this and the database restarted but check the docs this might not be a valid strategy if you make heavier use of the binary logs.)
4) set innodb_flush_log_at_trx_commit = 2 in my.cnf (although this parameter seems less effective during reload it might be useful in production 2 is slightly safer than 0).
5) Use the script Jean-Philippe gave me to drop the requests tables before an upgrade.

Hopefully they will help stop also the time outs.

Tests:

COMBINATION 1

innodb_buffer_pool_size = 400MB
# innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 167m30.226s
user 1m41.860s
sys 0m9.987s

============================
COMBINATION 2
innodb_buffer_pool_size = 900MB
# innodb_log_file_size = 50MB
# innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 155m2.996s
user 1m40.843s
sys 0m9.935s

===========================
COMBINATION 3
innodb_buffer_pool_size = 900MB
innodb_log_file_size = 50MB
# innodb_flush_log_at_trx_commit = 2
# innodb_file_per_table

real 49m2.683s
user 1m39.137s
sys 0m9.902s
===========================
COMBINATION 4
innodb_buffer_pool_size = 400MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2 <-- test also with 0 instead of 2 but it didn't change the time it took and 2 is slightly safer
# innodb_file_per_table

real 48m32.398s
user 1m40.638s
sys 0m9.733s
===========================
COMBINATION 5
innodb_buffer_pool_size = 900MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table

real 47m25.109s
user 1m39.230s
sys 0m9.985s
===========================
COMBINATION 6
innodb_buffer_pool_size = 400MB
innodb_log_file_size = 50MB
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table

real 46m46.850s
user 1m40.378s
sys 0m9.950s
===========================

No comments: