Proxmox 8.0 Hang on Boot - A Fix

Proxmox 8.0 Hang on Boot - A Fix

Today's post is going to be a quick one, but hopefully having another source for the fix for this issue on the web will help someone who's in the same dire straits I was in! (Please note - I'm using the Proxmox "no subscription" repository; if you're an Enterprise customer, YMMV)

Here's the scenario - upgrading my servers from Proxmox 7 to Proxmox 8, and Debian 11 (Bullseye) to Debian 12 (Bookworm). I run the pve7to8 script and everything checks out, so I begin the upgrade. Some 750+ packages later and I'm ready to reboot. I'm working remotely, so I start my continuous ping, issue the reboot command and wait. And wait. And.....oh crap. There are few worse feelings in the world than the server not coming back up after a reboot when you're remote. However, I do have the consolation that this is at least a backup server and not one of our Production machines.

After driving into the office and hooking up a monitor, here's what I've got: on boot, I get past the grub menu screen and about halfway through the boot process and the system just hangs. Nothing happening, no error messages, nothing. So I reboot the server and choose the recovery option from the menu. Now I can see that the server is hanging up somewhere around network initialization in the boot process. I begin going through the troubleshooting steps - is it maybe a hardware issue that coincidentally struck at the same time? Spoiler alert - NO!

One positive - I now know WAY more about ZFS troubleshooting than I ever wanted to know. BIG props to Nick Chevsky for the systemrescue-zfs Github repo!

I'll cut out the 12 hours of troubleshooting that I went through and skip straight to the fix. Fortunately for me, I stumbled across the following link on the Proxmox site - https://forum.proxmox.com/threads/proxmox-ve-8-0-released.129320/post-567264. The fix for the boot hang is to remove the script /etc/network/if-up.d/ntpsec-ntpdate.

To make a long story short:

  1. If your Proxmox root file system is ZFS, download and boot using the systemrescue-zfs ISO.
  2. Using instructions from this link on the Ubuntu forums, you'll need to do something similar to the following steps (YMMV if you configured Proxmox differently on your server):
    1. zpool import rpool - you may receive a warning that the pool was previously mounted on a different system. If so, use the command zpool import rpool -f to force-mount. Note - if the zpool command doesn't find your pool, use zpool import -a to search for pools on all connected storage.
    2. Use the command zfs get all | grep mountpoint to locate your the root filesystem mount within the ZFS pool. On my system, it was rpool/ROOT/pve-1.
    3. Here's the tricky part - when you import the ZFS pool, it attempted to mount any mount points found within the pool, but for me, it did NOT over mount the System Rescue's root filesystem, which is ok. We're going to temporarily change the mount point, remove the NTP script, then change it back, like so:
      1. mkdir /mnt/zfsroot
      2. zfs unmount rpool/ROOT/pve-1
      3. zfs set mountpoint=/mnt/zfsroot rpool/ROOT/pve-1
      4. zfs mount rpool/ROOT/pve-1
      5. Now that your root filesystem is mounted, remove the offending file:
        1. rm /mnt/zfsroot/etc/if-up.d/ntpsec-ntpdate
      6. Now that the offending script has been removed, we need to umount the root filesystem and change our mount point back to the original location:
        1. cd /
        2. umount /mnt/zfsroot
        3. zfs set mountpoint=/ rpool/ROOT/pve-1
  3. With that done, you can remove the USB thumb drive and boot your Proxmox server normally. However, there is one final step you'll need to do:
      1. When your server boots, the ZFS pool is going to complain about being used under another system and you'll be dumped to a (initramfs) prompt. Fortunately, the error tells you exactly what you need to do: The pool can be imported, use 'zpool import -f' to import the pool. Manually import the pool and exit.
        1. zpool import -f rpool
        2. exit

Your server should now be booted completely. These same instructions are also applicable to systems with an LVM root filesystem. In the case of LVM, I would recommend downloading the Proxmox 8.0 installer ISO, select the Advanced menu, Recovery. Once booted, use lvdisplay to view your logical volumes, then follow step 2c above to remove the /etc/network/if-up.d/ntpsec-ntpdate script.

One final note - after figuring out the steps above, I was able to proactively remove the ntpsec Debian package (apt remove --purge ntpsec) and replace it with chrony (see this link for more info on chrony) on my Production servers. However, on first boot, one of my prod machines gave me a rescue prompt, as there was a wayward entry in /etc/fstab for an external hard drive where the path had changed after the Debian update (one more argument for mounting your disks by UUID!).

I would strongly recommend, after you've fixed any issues and booted successfully, reboot one more time just to make sure your server comes up on its own without assistance. I know it's a pain to shut down all the VM's, but it will save you the heartache of a second failed boot process down the road.

Final note - I'm a HUGE fan of Proxmox and I've been running Proxmox hosts for many years. I would be happy to help in any way I can; if you run into a Proxmox issue you can't seem to get past, send me an email - matt@thesoloadmin.com.