Installation steps for running Arch Linux with root on ZFS using UEFI and systemd-boot. All steps are run as root.
Requires an Arch Linux image with ZFS built-in (see 16, 17).
If using KVM, add a Serial number for each virtual disk and reboot the VM. The disks should now be available in /dev/disk/by-id as virtio-<Serial>.
- Set a bigger font if needed:
setfont latarcyrheb-sun32
- To setup Wi-Fi, add the ESSID and passphrase:
 wpa_passphrase ESSID PASSPHRASE > /etc/wpa_supplicant/wpa_supplicant.conf
- Start wpa_supplicant and get an IP address:
wpa_supplicant -B -c /etc/wpa_supplicant/wpa_supplicant.conf -i <wifi interface>
dhclient <wifi interface>
- Wipe disks, create boot, swap and ZFS partitions:
sgdisk --zap-all /dev/disk/by-id/<disk0>
wipefs -a /dev/disk/by-id/<disk0>
sgdisk -n1:0:+512M -t1:ef00 /dev/disk/by-id/<disk0>
sgdisk -n2:0:+8G -t2:8200 /dev/disk/by-id/<disk0>
sgdisk -n3:0:+210G -t3:bf00 /dev/disk/by-id/<disk0>
sgdisk --zap-all /dev/disk/by-id/<disk1>
wipefs -a /dev/disk/by-id/<disk1>
sgdisk -n1:0:+512M -t1:ef00 /dev/disk/by-id/<disk1>
sgdisk -n2:0:+8G -t2:8200 /dev/disk/by-id/<disk1>
sgdisk -n3:0:+210G -t3:bf00 /dev/disk/by-id/<disk1>
Always use 1 MiB aligned partitions; check with (e.g., for the boot partition on the first disk):
parted /dev/disk/by-id/<disk0> align-check optimal 1
You can also use a zvol for swap, but see this first.
- Format boot and swap partitions
mkfs.vfat /dev/disk/by-id/<disk0>-part1
mkfs.vfat /dev/disk/by-id/<disk1>-part1
mkswap /dev/disk/by-id/<disk0>-part2
mkswap /dev/disk/by-id/<disk1>-part2
- Create swap
swapon /dev/disk/by-id/<disk0>-part2 /dev/disk/by-id/<disk1>-part2
- On choosing ashift
You should specify an ashift when that value is too low for what you actually need, either today (disk lies) or into the future (replacement disks will be AF). Looks like a sound advice to me. If in doubt, clarify before going any further.
Disk block size check:
cat /sys/class/block/<disk>/queue/{phys,log}ical_block_size
- Create pool (here's for a RAID 1 equivalent, all-flash drives):
zpool create \
    -o ashift=12 \
    -o autotrim=on \
    -o autoexpand=on \
    -o autoreplace=on \
    -O atime=off \
    -O acltype=posixacl \
    -O canmount=off -O compression=zstd \
    -O dnodesize=auto -O normalization=formD \
    -O xattr=sa -O devices=off -O mountpoint=none -R /mnt rpool mirror /dev/disk/by-id/<disk0>-part3 /dev/disk/by-id/<disk1>-part3
If the pool is larger than 10 disks you should identify them by-path or by-vdev (see here for more details).
Check ashift with:
zdb -C | grep ashift
- Create datasets:
zfs create -o canmount=off -o mountpoint=none rpool/ROOT
zfs create -o mountpoint=/ -o canmount=noauto rpool/ROOT/default
zfs create -o mountpoint=none rpool/DATA
zfs create -o mountpoint=/home rpool/DATA/home
A more involved filesystem/dataset structure is possible if snapshots and system rollbacks are desired. As these usually involve GRUB2 and/or ZFSBootMenu, I don't find the benefits outweighing systemd-boot simplicity.
- Create swap (not needed if you have dedicated partitions, like above)
zfs create -V 16G -b $(getconf PAGESIZE) -o compression=zle -o logbias=throughput -o sync=always -o primarycache=metadata -o secondarycache=none -o com.sun:auto-snapshot=false rpool/swap
mkswap /dev/zvol/rpool/swap
- Unmount all
zfs umount -a
rm -rf /mnt/*
- Export, then import pool:
zpool export rpool
zpool import -d /dev/disk/by-id -R /mnt rpool -N
- Mount root, then the other datasets:
zfs mount rpool/ROOT/default
zfs mount -a
- Mount boot partition:
mkdir /mnt/boot
mount /dev/disk/by-id/<disk0>-part1 /mnt/boot
- Generate fstab:
mkdir /mnt/etc
genfstab -U /mnt >> /mnt/etc/fstab
- Add swap (not needed if you created swap partitions, like above):
echo "/dev/zvol/rpool/swap    none       swap  discard                    0  0" >> /mnt/etc/fstab
- Install the base system:
pacstrap /mnt base base-devel linux linux-firmware vim
If it fails, add GPG keys (see bellow).
- Change root into the new system:
arch-chroot /mnt
- 
Remove all lines in /etc/fstab, leaving only the entries forswapandboot; forboot, changefmaskanddmaskto0077.
- 
Add ZFS repository in /etc/pacman.conf:
[archzfs]
Server = https://archzfs.com/$repo/x86_64
I no longer use this method, I built my own pipeline for dealing with kernel or ZFS upgrades.
- Add GPG keys:
curl -O https://archzfs.com/archzfs.gpg
pacman-key -a archzfs.gpg
pacman-key --lsign-key DDF7DB817396A49B2A2723F7403BD972F75D9D76
pacman -Syy
- Configure time zone (change accordingly):
ln -sf /usr/share/zoneinfo/Region/City /etc/localtime
hwclock --systohc
- Generate locale (change accordingly):
sed -i 's/#\(en_US\.UTF-8\)/\1/' /etc/locale.gen
locale-gen
echo "LANG=en_US.UTF-8" > /etc/locale.conf
- Configure vconsole, hostname, hosts:
echo -e "KEYMAP=us\nFONT=ter-v16n" > /etc/vconsole.conf
hostnamectl set-hostname al-zfs
echo -e "127.0.0.1 localhost\n::1 localhost" >> /etc/hosts
- 
Set root password 
- 
Install ZFS, microcode etc: 
pacman -Syu archzfs-linux amd-ucode networkmanager sudo openssh terminus-font
Choose the default option (all) for the archzfs group.
For Intel processors install intel-ucode instead of amd-ucode.
- Generate host id:
zgenhostid $(hostid)
- Create cache file:
zpool set cachefile=/etc/zfs/zpool.cache rpool
- Configure initial ramdisk in /etc/mkinitcpio.confby removingfsckand addingzfsafterkeyboard:
HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block zfs filesystems)Regenerate environment:
mkinitcpio -p linux
- Enable ZFS services:
systemctl enable zfs.target
systemctl enable zfs-import-cache.service
systemctl enable zfs-mount.service
systemctl enable zfs-import.target
- Install the bootloader:
bootctl --path=/boot install
- Add an EFI boot manager update hook in /etc/pacman.d/hooks/100-systemd-boot.hook:
[Trigger]
Type = Package
Operation = Upgrade
Target = systemd
[Action]
Description = update systemd-boot
When = PostTransaction
Exec = /usr/bin/bootctl update
- Replace content of /boot/loader/loader.conf with:
default arch
timeout 0
# bigger boot menu on a 4K laptop display
#console-mode 1
- Create a /boot/loader/entries/arch.conf containing:
title Arch Linux
linux /vmlinuz-linux
initrd /amd-ucode.img
initrd /initramfs-linux.img
options zfs=rpool/ROOT/default rw
If using an Intel processor, replace /amd-ucode.img with /intel-ucode.img.
- Exit and unmount all:
exit
zfs umount -a
umount -R /mnt
- Export pool:
zpool export rpool
- Reboot
A minimal Arch Linux system with root on ZFS should now be configured.
zfs create -o mountpoint=/home/user rpool/DATA/home/user
groupadd -g 1234 group
useradd -g group -u 1234 -d /home/user -s /bin/bash user
cp /etc/skel/.bash* /home/user
chown -R user:group /home/user && chmod 700 /home/user
Drives: 2 x Micron 7450 MAX 400GB NVMe, 2 x Micron 7450 PRO 1920GB NVMe, 2 x Samsung 870 EVO 4TB SATA SSD.
Change sector size on Micron drives:
nvme format -l 1 /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk0 --force
nvme format -l 1 /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk1 --force
nvme format -l 1 /dev/disk/by-id/nvme-Micron_7450_MTFDKBG1T9TFR-disk0 --force
nvme format -l 1 /dev/disk/by-id/nvme-Micron_7450_MTFDKBG1T9TFR-disk1 --force
Format SLOG drives:
sgdisk -n1:0:+16G -t1:bf01 /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk0
sgdisk -n1:0:+16G -t1:bf01 /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk1
Create pool:
zpool create \
    -o ashift=12 \
    -o autotrim=on \
    -o autoexpand=on \
    -o autoreplace=on \
    -O atime=off \
    -O acltype=nfsv4 -O aclinherit=passthrough-x -O aclmode=passthrough \
    -O canmount=off -O compression=zstd \
    -O dnodesize=auto \
    -O xattr=sa -O devices=off -O mountpoint=none \
  forge \
    mirror /dev/disk/by-id/ata-Samsung_SSD_870_EVO-disk0 /dev/disk/by-id/ata-Samsung_SSD_870_EVO-disk1 \
    special mirror /dev/disk/by-id/nvme-Micron_7450_MTFDKBG1T9TFR-disk0 /dev/disk/by-id/nvme-Micron_7450_MTFDKBG1T9TFR-disk1 \
    log mirror /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk0-part1 /dev/disk/by-id/nvme-Micron_7450_MTFDKBA400TFS-disk1-part1
Keep small files on the special mirror:
zfs set special_small_blocks=16K forge
A prior size estimation for the special_small_blocks target value is needed. Current occupancy in this case (current pool occupancy 50% and metadata + files <=16K) will be 15x lower than the formatted drive capacity, however I plan to have some virtual machines disks on the special VDEV aiming to a final occupancy for everything sitting at 40-50%. If this won't turn out well (free space too low), I will reduce special_small_blocks to 8K. The target full capacity for the special will be at 90 to 95% and for main VDEV at 80%, due to the drives class.
Create a dataset for virtual machines:
zfs create -o mountpoint=none -o canmount=off -o compression=zstd-fast:3 forge/VM
zfs set logbias=latency forge/VM
zfs set sync=standard forge/VM
Set Domain in idmapd.conf on server and clients.
zfs set [email protected]/24 tank0/DATA/path/name
systemctl enable nfs-server.service zfs-share.service --now
The following steps will replace all the disks in a mirrored vdev with larger ones.
Before upgrade:
zpool status lpool
  pool: lpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:18 with 0 errors on Tue Oct 17 15:52:42 2023
config:
        NAME                                                      STATE     READ WRITE CKSUM
        lpool                                                     ONLINE       0     0     0
          mirror-0                                                ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V-part1  ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K621937D-part1  ONLINE       0     0     0
errors: No known data errors
zpool list lpool             
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
lpool   464G   218G   246G        -         -     9%    47%  1.00x    ONLINE  -
The autoexpand property on the pool should be set to on.
zpool get autoexpand lpool
NAME   PROPERTY    VALUE   SOURCE
lpool  autoexpand  off     default
If it's not, set it:
zpool set autoexpand=on lpool
Power off the machine, physically replace one of drives using the same slot or cable, then power it back on. After replacing the drive, zpool status shows the following:
zpool status lpool
  pool: lpool
 state: DEGRADED 
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:02:18 with 0 errors on Tue Oct 17 15:52:42 2023
config:
        NAME                                                STATE     READ WRITE CKSUM
        lpool                                               DEGRADED     0     0     0
          mirror-0                                          DEGRADED     0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V  ONLINE       0     0     0
            3003823869835917342                             UNAVAIL      0     0     0  was /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K621937D-part1
errors: No known data errors
Replace the drive in pool (here, nvme-Samsung_SSD_970_EVO_500GB_S466NB0K621937D is the old, now replaced drive):
zpool replace lpool /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K621937D /dev/disk/by-id/nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V
The new drive is now resilvering:
zpool status lpool
  pool: lpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Oct 29 14:09:00 2023
        144G / 218G scanned at 20.6G/s, 0B / 218G issued
        0B resilvered, 0.00% done, no estimated completion time
config:
        NAME                                                STATE     READ WRITE CKSUM
        lpool                                               DEGRADED     0     0     0
          mirror-0                                          DEGRADED     0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V  ONLINE       0     0     0
            replacing-1                                     DEGRADED     0     0     0
              3003823869835917342                           UNAVAIL      0     0     0  was /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K621937D-part1
              nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V  ONLINE       0     0     0
errors: No known data errors
Wait for resilver to finish:
zpool status lpool
  pool: lpool
 state: ONLINE
  scan: resilvered 220G in 00:01:58 with 0 errors on Sun Oct 29 14:10:58 2023
config:
        NAME                                                STATE     READ WRITE CKSUM
        lpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V  ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V    ONLINE       0     0     0
errors: No known data errors
Scrub the pool
zpool scrub lpool
and wait until it's finished:
zpool status lpool
  pool: lpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:04 with 0 errors on Sun Oct 29 14:16:24 2023
config:
        NAME                                                STATE     READ WRITE CKSUM
        lpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V  ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V    ONLINE       0     0     0
errors: No known data errors
Physically replace the second disk using the same slot or cable; after powering on the machine, the pool is again in a degraded state:
zpool status lpool
  pool: lpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:02:04 with 0 errors on Sun Oct 29 14:16:24 2023
config:
        NAME                                              STATE     READ WRITE CKSUM
        lpool                                             DEGRADED     0     0     0
          mirror-0                                        DEGRADED     0     0     0
            4843306544541531925                           UNAVAIL      0     0     0  was /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V-part1
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V  ONLINE       0     0     0
errors: No known data errors
Replace the second drive in the pool:
zpool replace lpool /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V /dev/disk/by-id/nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W932278L
The second drive is now resilvering:
zpool status lpool
  pool: lpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Oct 29 14:28:24 2023
        218G / 218G scanned, 23.7G / 218G issued at 1.58G/s
        24.1G resilvered, 10.87% done, 00:02:03 to go
config:
        NAME                                                STATE     READ WRITE CKSUM
        lpool                                               DEGRADED     0     0     0
          mirror-0                                          DEGRADED     0     0     0
            replacing-0                                     DEGRADED     0     0     0
              4843306544541531925                           UNAVAIL      0     0     0  was /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_500GB_S466NB0K683727V-part1
              nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W932278L  ONLINE       0     0     0  (resilvering)
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V    ONLINE       0     0     0
errors: No known data errors
Wait for resilvering to finish:
zpool status lpool
  pool: lpool
 state: ONLINE
  scan: resilvered 220G in 00:02:30 with 0 errors on Sun Oct 29 14:30:54 2023
config:
        NAME                                              STATE     READ WRITE CKSUM
        lpool                                             ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W932278L  ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V  ONLINE       0     0     0
errors: No known data errors
Scrub pool again:
zpool scrub lpool && zpool status lpool
  pool: lpool
 state: ONLINE
  scan: scrub in progress since Sun Oct 29 14:31:51 2023
        218G / 218G scanned, 15.1G / 218G issued at 1.51G/s
        0B repaired, 6.91% done, 00:02:14 to go
config:
        NAME                                              STATE     READ WRITE CKSUM
        lpool                                             ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W932278L  ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V  ONLINE       0     0     0
errors: No known data errors
Check pool:
zpool status lpool
  pool: lpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:21 with 0 errors on Sun Oct 29 14:34:12 2023
config:
        NAME                                              STATE     READ WRITE CKSUM
        lpool                                             ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W932278L  ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NF0W836871V  ONLINE       0     0     0
errors: No known data errors
New pool size:
zpool list lpool  
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
lpool  1.82T   218G  1.60T        -         -     2%    11%  1.00x    ONLINE  -
- https://wiki.archlinux.org/index.php/Install_Arch_Linux_on_ZFS
- https://wiki.archlinux.org/index.php/ZFS
- https://ramsdenj.com/2016/06/23/arch-linux-on-zfs-part-2-installation.html
- https://github.com/reconquest/archiso-zfs
- https://zedenv.readthedocs.io/en/latest/setup.html
- https://docs.oracle.com/cd/E37838_01/html/E60980/index.html
- https://ramsdenj.com/2020/03/18/zectl-zfs-boot-environment-manager-for-linux.html
- https://superuser.com/questions/1310927/what-is-the-absolute-minimum-size-a-uefi-partition-can-be, https://systemd.io/9OOT_LOADER_SPECIFICATION/
- OpenZFS Admin Documentation
- zfs(8)
- zpool(8)
- https://jrs-s.net/category/open-source/zfs/
- https://github.com/ewwhite/zfs-ha/wiki
- http://nex7.blogspot.com/2013/03/readme1st.html
- Arch Linux ZFS images
- Create an Arch Linux ISO with ZFS builtin
- ArchZFS