ZFS on Linux NFS shares lost at reboot

On several occasions NFS has inexplicably stopped working with my ZFS shares under ZFS on Linux on Ubuntu Server 14.04x64. It seemed to happen every reboot, after checking the sharenfs property on each of the ZFS FS' I found they were set properly, but still clients were getting access denied when trying to mount the share. Other times, even when they were able to re-mount the share they wouldn't do so unless I manually triggered it.

After a bit of Googling I found this bug report which indicated that this problem isn't going to go away. Not exactly a great thing to find out after migrating back to Ubuntu from OpenIndiana due to hardware incompatibilities, especially when everything with ZFS was dead easy under OI. Now that I get what the problem is, short script to fix things automatically from the client side.

The script is called from a cron job which runs every minute. It checks to see if each of its NFS mount points are showing up in the mount list, if not it places a file on the file server indicating that the share is un-mountable, and a similar one locally, it then waits a bit.

While it waits, the server has a likewise scheduled cron job which checks to see if any of the file systems are unshared, then reshares them. Once the FS(es) are reshared, it deletes the notification file and exits.

After the server's had time to see the unmount notification, the client remounts the shares. After everything's remounted, the local umount file is deleted and the script exits.

Unfortunately the timing of things isn't exact, so it can take a couple of minutes for this to work.  The script places a unique name for each share on the server so that a separate monitoring script can track how many times each share is umounted by which client. Each client has a similar script to count the same items, as well as how long it took for the shares to mount again.

To get this working, you'll need to copy an SSH key from the client to the server, and make sure that whatever user is running this cron job has write access to that folder on the server. Since I'm just running this at home I'm not thinking about privilege separation. If I were running this at work, I'd have this script run in two parts, the first which would check to see if things were mounted and create the notification files in an insecure location, then have the second portion run under root and do the mounting and file removal.

On the client side do the following:

if grep -qs '/MOUNT/POINT' /proc/mounts; then  
        ssh USER@SERVER touch /Public/unmounted.SHARE.SERVER  
        filename=$(date +%Y-%m-%d-%H-%M-%S.%2N)  
        echo "Touching local unmounted $filename"  
        touch /scripts/results/unmounted-$filename  
        sleep 30  
while [ -e /scripts/results/unmounted-$filename ]  
        echo mounting  
        /bin/mount -a  
        echo "checking if mounted"3  
        if grep -qs '/Data/Final' /proc/mounts; then  
                echo mounted  
                rm /scripts/results/unmounted-$filename  

On the server, again since it's a home job it's not very careful about things. Were I using this at work, I'd qualify each of the unmounted files and reshare only the shares that were not mounting, and then clean the corresponding file only. But that's too complicated for home, for now. I'll probably eventually set it up so that it's awk'd and done properly, but this works well enough for me right now.

if [ -e "/Public/unmounted".* ]  
        # Repeat for each NFS share you have.  
        /sbin/zfs set sharenfs=OPTIONS ZFS/FILE/SYSTEM  
        rm /Public/unmounted.*