Had to take the secondary DB server "out of action" today, because the MySQL relay log had corrupted during an unclean reboot, and there wasn't any feasible way of fixing it quickly enough.
After the server was taken offline a database dump was taken from a development database server (which contains data identical to the master server, but cannot be used for production use), and transferred to the secondary production server. Unfortunately the problems with the HDD caused the server to lock up when trying to restore data from the dump, so the HDD replacement that was scheduled for later, had to be done immediately.
As I'm typing the HDD is being replaced by the techs at our ISP. After the HDD swap the RAID array needs to be re-built on the server, so it'll be a while before the server is online, and even a longer while before the database dump is restored, and the server is ready to handle queries again.
Unfortunately a single DB server cannot cope with the demands of all our sites that well. Even though the queries are mostly SELECTs, an odd UPDATE or INSERT can lock a table for long enough to cause, for example, forum thread views to take long to show up or timeout altogether. The amount of queries is reasonably low thanks to heavy caching (both query caching and file caching). The database server currently handles around 200 queries per second.
Update
The server is now back online, and uncompressing the dump. The cable and the PERC 4 RAID Adapter on the server were swapped instead of a single drive in the array. That's because the errors given by the server related to the entire array -- not a single disk.
Update 2
The database restore is now complete, and the secondary server should be back in action shortly.
Subscribe to AfterDawn's weekly newsletter.

