There are probably easier (or harder) ways to do this, but my back was up against a wall yesterday after a very important virtual machine was in a very bad state yesterday, after a series of hardware issues with the host, and basically one of those perfect storms of bad backup and bad host and bad VM happened.
Apparently, backups for this machine had been failing in a deceptive manner that didn’t clue us in that they were failing, and the host (VMware ESXi 5.0) was building new snapshots of the drive over and over again when Veeam tried to take a backup.
Worse, every time you tried to do a VMware level operation with the machine, it was complaining about the disks with something like “Error caused by file /vmfs/volumes/########-########-####-############/VM-Name/VM-Name-0000001.vmdk” and failing out. Little extra could be gleaned from SSHing into the host and checking dmesg, but it was plain the disk was being weird in a software way, not a hardware way. Luckily, the virtual machine itself could read the whole disk just fine, and it still ran just fine. So I was stuck with flaky hardware and no way to move the VM off of it.
But I was able to recover the VM by throwing this Hail Mary pass. Fair warning, this will probably take a lot of downtime. But it’s better than losing that very important VM altogether.
I’m sure there are better or worse tools to use than the Ubuntu 12.04 server iso that I had handy, but this worked just fine for my purposes. Feel free to suggest others — I know HJ Hornbeck is more partial to ddrescue than vanilla dd, but I don’t need any of those bells and whistles myself.
– Add identically sized drive(s) to VM
– Set to boot from BIOS on next boot
– Set CD to Client mode (or, if you have patience, upload ISO of CD for ubuntu 12.04 server to the datastore)
– Using console, mount ISO
– Set boot sequence to boot from CD first
– Save bios and boot from CD
– Pick recovery mode
– Enter your way through to where it wants to mount a root filesystem
– Pick “launch shell in installer environment”
– dmesg | grep sd — should show you your identical drives, one with partition, one without.
– dd if=/dev/sda of=/dev/sdb bs=4k conv=sync,noerror &
– Ampersand puts that task in the background so you can do this — to see progress, find the PID of the process you just launched via ps, then:
– kill -SIGUSR1 ####
– Number of records * 4096 = number of bytes it’s done so far. This is the closest to actual progress report I have been able to get.
– When it’s done, it’ll spit out the number of records again without you having entered a usr1 signal.
– Shut down the machine
– Take note of the SCSI connection, then remove the old drive (don’t delete it in case you need to recover or this didn’t work)
– Change the new drive’s SCSI port to what the old drive’s was
– Set to boot from BIOS again
– Change boot order back to usual
– Try booting the machine — it should work now
– Try migration, backing up, etc. — it should work now
I’m mostly adding this to the blog because, well, it’s all based on public knowledge, so why write out this procedure and only keep it at work?