My home lab is turning me into a shut-in
It's 14-Dec and my wife had arranged for a short getaway at Khao Yai for just a night. I figured that it would be a good chance to test out all the measures I'd put in to ensure that my home lab can withstand reasonable failures. All 3 of the HP servers, the Unifi Dream Machine SE router and my switches were now hooked up to the UPS and I also hooked up my 3BB fiber modem to it as well so that it could withstand a power failure. UPS utilization is at about 300W and the UPS should be able to withstand at least about 45-50 mins of power outage.
Fast forward to this morning 15-Dec, I did a quick check on my Proxmox cluster and found out that at around 8pm last night, PVE01 had just hung and went unresponsive. Recover was successful as I could see that the Proxmox HA configuration for the 3 VMs I had running on pve01 kick in in and migrate them to PVE02 and PVE03.
Uptime for my VMs shows the same as well.
When I got back home just now, I just proceeded to restart pve01 and it came back up normally. Data was recovered within about 5-10 minutes after the Ceph cluster synchronized all the OSDs.
The reason for pve01 going unresponsive? It's an old issue, with the BIOS complaining that PCIE Slot 0 went unresponsive or something. The thing is that I already moved my NVME adapter from Slot 0 to another slot (meaning Slot 0 is unused) and this is still happening. Ugh. Irritating.
So what do I do now? I was only away for a day and this already happened (the cost of buying second-hand machines from Facebook marketplace). While PVE01 was down, everything was still running without data loss, but if either of PVE02 or PVE03 ran into problems, I would be out of redundancy. It was fine for me to come back the next day and restart PVE01 in this case but what if I was away for a week or more? It's clear I need to have a way to remote restart my machines.
Unfortunately, my HP Z6 G4 and HP Z840s servers do not have built-in IPMI. But they do have Intel AMT and that gives me an option for a poor man's IPMI. Why do I say "poor man's IPMI"? That's mainly 'cos my servers' Xeon CPU don't have built-in integrated graphics and it seems that Intel AMT's remote desktop capabilities rely on the integrated graphics that some Intel CPUs have.
But still, having the ability to remotely restart the machines is better than nothing. To enable AMT's management ports, the following steps are needed:
领英推荐
Once rebooted, the AMT interface will be accessible on port 16992. You'll see this login prompt:
Login using "admin" and the new password you just set.
Go to the "Remote Control" menu item and you'll be able to restart the machine:
Tested this out and it works fine. My only doubt is that after a PCIE Slot failure that I experienced previously, the BIOS would always pop-up an alert during the next reboot which I would need to press Enter to bypass and proceed to normal boot-up. I'm not sure if I used AMT to restart the machine, I would be stuck on this alert page without the ability to get pass it since I do not have physical access or remote desktop capabilities. Will need to test this out the next time this happens.
Running a home lab where you have other team members accessing remotely is a lot harder than it seems. I'm starting to get nervous every time I leave my house and it's slowly turning me into a hikikomori.
Head of Data Science at AIA Thailand. Hiring data scientists & software engineers
2 个月I gave up HA and just run my Proxmox single-node for a couple of years now. Most (all?) downtime had been from blackouts. I kind of want ceph, but am content with the separate TrueNAS and RAID-Z2 pools. I left the hypervisor alone to mess with homeassistant for about 2 years. Now I also left that alone and just PC game on my free time :)
Cloud & Technology Strategy | Financial Services
2 个月Makes me curious the kind of applications you are running in your home lab.