Yep, Solid State Drives (SSDs) go bad; ask me how I know.
It was Wednesday morning and, before I began my work day, I decided to spin up a new Ubuntu Desktop VM on which I was going to run a variety of tests. I accessed my Proxmox hypervisor, created the VM, pointed it at the ISO image, and off it went. Or not.
Yeah, shortly after the VM booted into its setup mode, the entire Proxmox node said “buh-bye”. Pings kept working, but everything else was, like, gone. So, after staring at the console and repeatedly mashing the keyboard but getting no response, my only alternative was to hard reset it, which I did. It came back up and acted normal so I went about my work day while it continued to chug along.
When I broke for lunch, I attempted the VM setup again, and met with the same result. I also noticed that the SMART ATA error count on my boot drive (the SSD) was increasing. With that, I set about getting myself up to speed on what data I should be saving from the node to facilitate recovery to a new drive.
Long story short, I was able to get the node back up and running on a new boot drive (3TB spinner vs. 1TB SSD) by the next day, with the only thing I “lost” (other than my time) being a boot partition for my handbrake VM, which for whatever reason resided on that boot drive instead of the ZFS pools everything else lives on. In order to recover, I had to determine which files I needed to preserve from the node (pretty much the /etc/pve directory) and, once I had the system running again on a fresh install, I then had to learn about ZFS pools and adding existing pools to systems to which they had not previously been attached (since the system, being new, knew nothing about them). Once I did that, and rebooted a couple times, I was able to start the VMs like nothing had happened (which was pretty satisfying, not gonna lie).
After the dust settled, I took the “bad” SSD and tethered it to another system I had running (hot swap cages are awesome, btw!) so I could pull the ISOs and other data off of it that I wanted to save. It was during the file copy that I discovered that the Ubuntu Desktop ISO resided on the bad blocks on the drive, and thus the meltdowns were being triggered when the system attempted to read those blocks.
Now I’m working on a script that will add the Proxmox node files to my backup strategy so that I always have them off-box, just in case.