Saturday, July 7, 2012

Clonezilla, PostgreSQL and General Locking

This morning I got a support call, nothing amazing and perhaps expected.

One of my clients were reporting very slow behavior on adding data to the application. After jumping into my vehicle, deflecting incoming traffic, I saw that the situation was critical indeed.

The puzzle began with a piece of information revealed in the pg_stat_activity table, several queries were queuing up for no reason. Locks were held. Screams were yelled.

This particular client runs a mission critical system on a straighforward Apache/PHP/PostgreSQL 9.0 stack, and this system merrily did it's work until a recent clone and restore using Clonezilla. All was well when after system boot and several tests, until a few days later...

Symptoms for the problem were:

  1. Simple queries taking forever to complete (some of them 5 hours).
  2. Transactional locks were held due to these queries.
  3. Piling up of new queries.

To make a long battle opera short, the eventual desparate solution was to VACUUM the tables that caused the delay.

The dust eventually settled, and all was well.

My recommendation for anybody doing a bare-metal clone and restore of a system running a database, always perform a VACUUM/ANALYZE or equivalent on the database after a full system restore. I thus theorise that clone/restore does not place the data files in exactly the same position.