taslabs-net/PVE9_TB4_Guide_Updated.md

Last active October 6, 2025 02:15

Star (28) You must be signed in to star a gist
Fork (4) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/taslabs-net/9da77d302adb9fc3f10942d81f700a05.js"></script>
Save taslabs-net/9da77d302adb9fc3f10942d81f700a05 to your computer and use it in GitHub Desktop.

Download ZIP

Thunderbolt4 mesh network

Raw

PVE9_TB4_Guide_Updated.md

PVE 9 BETA TB4 + Ceph Guide

Updated as of: 2025-01-03 - Network architecture corrections applied

GitHub https://github.com/taslabs-net/PVE9_TB4

For the best reading experience, visit the live documentation at: https://tb4.git.taslabs.net/

Network Architecture (UPDATED)

Cluster Management Network: 10.11.11.0/24 (vmbr0)

Primary cluster communication and SSH access
n2: 10.11.11.12
n3: 10.11.11.13
n4: 10.11.11.14

VM Network and Backup Cluster Network: 10.1.1.0/24 (vmbr1)

VM traffic and backup cluster communication
n2: 10.1.1.12
n3: 10.1.1.13
n4: 10.1.1.14

TB4 Mesh Network: 10.100.0.0/24 (en05/en06)

High-speed TB4 interfaces for Ceph cluster_network
Isolated from client I/O traffic
Provides optimal performance for Ceph OSD communication

SSH Key Setup (UPDATED)

Critical: Before proceeding with any configuration, you must set up SSH key authentication for passwordless access to all nodes.

Step 1: Generate SSH Key (if you don't have one)

# Generate a new SSH key (if needed):
ssh-keygen -t ed25519 -C "cluster-ssh-key" -f ~/.ssh/cluster_key

Step 2: Accept Host Keys (First Time Only)

IMPORTANT: Before running the deployment commands, you must SSH into each node once to accept the host key:

# Accept host keys for all nodes (type 'yes' when prompted):
ssh [email protected] "echo 'Host key accepted for n2'"
ssh [email protected] "echo 'Host key accepted for n3'"
ssh [email protected] "echo 'Host key accepted for n4'"

Note: This step is required because the first SSH connection to each host requires accepting the host key. Without this, the automated deployment commands will fail.

Step 3: Deploy SSH Key to All Nodes

Deploy your public key to each node's authorized_keys:

# Deploy to n2 (10.11.11.12):
ssh [email protected] "mkdir -p ~/.ssh && echo 'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGHoypdiKhldYlNUvW27uzutzewJ+X08Rlg/m7vmmtW cluster-ssh-key' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

# Deploy to n3 (10.11.11.13):
ssh [email protected] "mkdir -p ~/.ssh && echo 'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGHoypdiKhldYlNUvW27uzutzewJ+X08Rlg/m7vmmtW cluster-ssh-key' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

# Deploy to n4 (10.11.11.14):
ssh [email protected] "mkdir -p ~/.ssh && echo 'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGHoypdiKhldYlNUvW27uzutzewJ+X08Rlg/m7vmmtW cluster-ssh-key' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Step 4: Test SSH Key Authentication

# Test passwordless SSH access to all nodes:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "Testing SSH access to $node..."
  ssh root@$node "echo 'SSH key authentication working on $node'"
done

Expected result: All nodes should respond without prompting for a password.

TB4 Hardware Detection (UPDATED)

Step 1: Prepare All Nodes

Critical: Perform these steps on ALL mesh nodes (n2, n3, n4).

Load TB4 kernel modules:

# Execute on each node:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "echo 'thunderbolt' >> /etc/modules"
  ssh root@$node "echo 'thunderbolt-net' >> /etc/modules"
  ssh root@$node "modprobe thunderbolt && modprobe thunderbolt-net"
done

Verify modules loaded:

for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "=== TB4 modules on $node ==="
  ssh root@$node "lsmod | grep thunderbolt"
done

Expected output: Both thunderbolt and thunderbolt_net modules present.

Step 2: Identify TB4 Hardware

Find TB4 controllers and interfaces:

for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "=== TB4 hardware on $node ==="
  ssh root@$node "lspci | grep -i thunderbolt"
  ssh root@$node "ip link show | grep -E '(en0[5-9]|thunderbolt)'"
done

Expected: TB4 PCI controllers detected, TB4 network interfaces visible.

Step 3: Create Systemd Link Files

Critical: Create interface renaming rules based on PCI paths for consistent naming.

For all nodes (n2, n3, n4):

# Create systemd link files for TB4 interface renaming:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "cat > /etc/systemd/network/00-thunderbolt0.link << 'EOF'
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en05
EOF"

  ssh root@$node "cat > /etc/systemd/network/00-thunderbolt1.link << 'EOF'
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en06
EOF"
done

Note: Adjust PCI paths if different on your hardware (check with lspci | grep -i thunderbolt)

Verification: After creating the link files, reboot and verify:

for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "=== Interface names on $node ==="
  ssh root@$node "ip link show | grep -E '(en05|en06)'"
done

Expected: Both en05 and en06 interfaces should be present and properly named.

TB4 Network Configuration (UPDATED)

Step 4: Configure Network Interfaces

CRITICAL: TB4 interfaces MUST be defined BEFORE the source /etc/network/interfaces.d/* line to prevent conflicts with SDN configuration.

Manual configuration required for each node:

Edit /etc/network/interfaces on each node and insert the following BEFORE the source /etc/network/interfaces.d/* line:

# Add at the TOP of the file, right after the header comments:
iface en05 inet manual #do not edit in GUI
iface en06 inet manual #do not edit in GUI

Then add the full interface definitions BEFORE the source line:

# n2 configuration:
auto en05
iface en05 inet static
    address 10.100.0.2/30
    mtu 65520

auto en06
iface en06 inet static
    address 10.100.0.5/30
    mtu 65520

# n3 configuration:
auto en05
iface en05 inet static
    address 10.100.0.6/30
    mtu 65520

auto en06
iface en06 inet static
    address 10.100.0.9/30
    mtu 65520

# n4 configuration:
auto en05
iface en05 inet static
    address 10.100.0.10/30
    mtu 65520

auto en06
iface en06 inet static
    address 10.100.0.14/30
    mtu 65520

IMPORTANT:

The auto keyword is CRITICAL - without it, interfaces won't come up automatically at boot
These static IP addresses are REQUIRED for Ceph's cluster_network
Without the IPs, OSDs will fail to start with "Cannot assign requested address" errors

Step 5: Enable systemd-networkd

Required for systemd link files to work:

# Enable and start systemd-networkd on all nodes:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "systemctl enable systemd-networkd && systemctl start systemd-networkd"
done

Step 6: Create Udev Rules and Scripts

Automation for reliable interface bringup on cable insertion:

Create udev rules:

for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "cat > /etc/udev/rules.d/10-tb-en.rules << 'EOF'
ACTION==\"move\", SUBSYSTEM==\"net\", KERNEL==\"en05\", RUN+=\"/usr/local/bin/pve-en05.sh\"
ACTION==\"move\", SUBSYSTEM==\"net\", KERNEL==\"en06\", RUN+=\"/usr/local/bin/pve-en06.sh\"
EOF"
done

Create interface bringup scripts:

# Create en05 bringup script for all nodes:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "cat > /usr/local/bin/pve-en05.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en05 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en05 up mtu 65520
        echo \"\$(date): en05 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh root@$node "chmod +x /usr/local/bin/pve-en05.sh"
done

# Create en06 bringup script for all nodes:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  ssh root@$node "cat > /usr/local/bin/pve-en06.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en06 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en06 up mtu 65520
        echo \"\$(date): en06 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh root@$node "chmod +x /usr/local/bin/pve-en06.sh"
done

Step 7: Verify Network Configuration

Test TB4 network connectivity:

# Test connectivity between nodes:
for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "=== Testing TB4 connectivity from $node ==="
  ssh root@$node "ping -c 2 10.100.0.2 && ping -c 2 10.100.0.6 && ping -c 2 10.100.0.10"
done

Expected: All ping tests should succeed, confirming TB4 mesh connectivity.

Verify interface status:

for node in 10.11.11.12 10.11.11.13 10.11.11.14; do
  echo "=== TB4 interface status on $node ==="
  ssh root@$node "ip addr show en05 en06"
done

Expected: Both interfaces should show UP state with correct IP addresses.

Key Updates Made

SSH Access Network: Changed from 10.1.1.x to 10.11.11.x (cluster management network)
Network Architecture: Added clear explanation of the three network segments
All SSH Commands: Updated to use correct cluster management network
Verification Steps: Enhanced with better testing and troubleshooting

Network Summary

10.11.11.0/24 = Cluster Management Network (vmbr0) - SSH access and cluster communication
10.1.1.0/24 = VM Network and Backup Cluster Network (vmbr1) - VM traffic
10.100.0.0/24 = TB4 Mesh Network (en05/en06) - Ceph cluster_network for optimal performance

This updated version ensures all commands use the proper network architecture for your cluster setup.

For the complete guide with all phases, troubleshooting, and the best reading experience, visit: https://tb4.git.taslabs.net/

Raw

pve9tb4.md

Complete (ish) Thunderbolt 4 + Ceph Guide: Setup for Proxmox VE 9 BETA STABLE

Acknowledgments

This builds upon excellent foundational work by @scyto.

Original TB4 research from @scyto: https://gist.github.com/scyto/76e94832927a89d977ea989da157e9dc
My Original PVE 9 Writeup: https://gist.github.com/taslabs-net/9f6e06ab32833864678a4acbb6dc9131

Key contributions from @scyto's work:

TB4 hardware detection and kernel module strategies
Systemd networking and udev automation techniques
MTU optimization and performance tuning approaches

Overview:

This guide provides a step-by-step, tested (lightly) for building a high-performance Thunderbolt 4 + Ceph cluster on Proxmox VE 9 beta.

Lab Results:

TB4 Mesh Performance: Sub-millisecond latency, 65520 MTU, full mesh connectivity
Ceph Performance: 1,300+ MB/s write, 1,760+ MB/s read with optimizations
Reliability: 0% packet loss, automatic failover, persistent configuration
Integration: Full Proxmox GUI visibility and management

Hardware Environment:

Nodes: 3x systems with dual TB4 ports (tested on MS01 mini-PCs)
Memory: 64GB RAM per node (optimal for high-performance Ceph)
CPU: 13th Gen Intel (or equivalent high-performance processors)
Storage: NVMe drives for Ceph OSDs
Network: TB4 mesh (10.100.0.0/24) + management (10.11.12.0/24)

Software Stack:

Proxmox VE: 9.0 beta with native SDN OpenFabric support
Ceph: Nautilus with BlueStore, LZ4 compression, 2:1 replication
OpenFabric: IPv4-only mesh routing for simplicity and performance

Prerequisites: What You Need

Physical Requirements

3 nodes minimum: Each with dual TB4 ports (tested with MS01 mini-PCs)
TB4 cables: Quality TB4 cables for mesh connectivity
Ring topology: Physical connections n2→n3→n4→n2 (or similar mesh pattern)
Management network: Standard Ethernet for initial setup and management

Software Requirements

Proxmox VE 9.0 beta (test repository)
SSH root access to all nodes
Basic Linux networking knowledge
Patience: TB4 mesh setup requires careful attention to detail!

Network Planning

Management network: 10.11.12.0/24 (adjust to your environment)
TB4 cluster network: 10.100.0.0/24 (for Ceph cluster traffic)
Router IDs: 10.100.0.12 (n2), 10.100.0.13 (n3), 10.100.0.14 (n4)

Phase 1: Thunderbolt Foundation Setup

Step 1: Prepare All Nodes

Critical: Perform these steps on ALL mesh nodes (n2, n3, n4).

Load TB4 kernel modules:

# Execute on each node:
for node in n2 n3 n4; do
  ssh $node "echo 'thunderbolt' >> /etc/modules"
  ssh $node "echo 'thunderbolt-net' >> /etc/modules"  
  ssh $node "modprobe thunderbolt && modprobe thunderbolt-net"
done

Verify modules loaded:

for node in n2 n3 n4; do
  echo "=== TB4 modules on $node ==="
  ssh $node "lsmod | grep thunderbolt"
done

Expected output: Both thunderbolt and thunderbolt_net modules present.

Step 2: Identify TB4 Hardware

Find TB4 controllers and interfaces:

for node in n2 n3 n4; do
  echo "=== TB4 hardware on $node ==="
  ssh $node "lspci | grep -i thunderbolt"
  ssh $node "ip link show | grep -E '(en0[5-9]|thunderbolt)'"
done

Expected: TB4 PCI controllers detected, TB4 network interfaces visible.

Step 3: Create Systemd Link Files

Critical: Create interface renaming rules based on PCI paths for consistent naming.

For all nodes (n2, n3, n4):

# Create systemd link files for TB4 interface renaming:
for node in n2 n3 n4; do
  ssh $node "cat > /etc/systemd/network/00-thunderbolt0.link << 'EOF'
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en05
EOF"

  ssh $node "cat > /etc/systemd/network/00-thunderbolt1.link << 'EOF'
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en06
EOF"
done

Note: Adjust PCI paths if different on your hardware (check with lspci | grep -i thunderbolt)

Step 4: Configure Network Interfaces

Add TB4 interfaces to network configuration with optimal settings:

# Configure TB4 interfaces on all nodes:
for node in n2 n3 n4; do
  ssh $node "cat >> /etc/network/interfaces << 'EOF'

auto en05
iface en05 inet manual
    mtu 65520

auto en06
iface en06 inet manual
    mtu 65520
EOF"
done

Step 5: Enable systemd-networkd

Required for systemd link files to work:

# Enable and start systemd-networkd on all nodes:
for node in n2 n3 n4; do
  ssh $node "systemctl enable systemd-networkd && systemctl start systemd-networkd"
done

Step 6: Create Udev Rules and Scripts

Automation for reliable interface bringup on cable insertion:

Create udev rules:

for node in n2 n3 n4; do
  ssh $node "cat > /etc/udev/rules.d/10-tb-en.rules << 'EOF'
ACTION==\"add|move\", SUBSYSTEM==\"net\", KERNEL==\"en05\", RUN+=\"/usr/local/bin/pve-en05.sh\"
ACTION==\"add|move\", SUBSYSTEM==\"net\", KERNEL==\"en06\", RUN+=\"/usr/local/bin/pve-en06.sh\"
EOF"
done

Create interface bringup scripts:

# Create en05 bringup script for all nodes:
for node in n2 n3 n4; do
  ssh $node "cat > /usr/local/bin/pve-en05.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en05 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en05 up mtu 65520
        echo \"\$(date): en05 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh $node "chmod +x /usr/local/bin/pve-en05.sh"
done

# Create en06 bringup script for all nodes:
for node in n2 n3 n4; do
  ssh $node "cat > /usr/local/bin/pve-en06.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en06 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en06 up mtu 65520
        echo \"\$(date): en06 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh $node "chmod +x /usr/local/bin/pve-en06.sh"
done

Step 7: Update Initramfs and Reboot

Apply all TB4 configuration changes:

# Update initramfs on all nodes:
for node in n2 n3 n4; do
  ssh $node "update-initramfs -u -k all"
done

# Reboot all nodes to apply changes:
echo "Rebooting all nodes - wait for them to come back online..."
for node in n2 n3 n4; do
  ssh $node "reboot"
done

# Wait and verify after reboot:
echo "Waiting 60 seconds for nodes to reboot..."
sleep 60

# Verify TB4 interfaces after reboot:
for node in n2 n3 n4; do
  echo "=== TB4 interfaces on $node after reboot ==="
  ssh $node "ip link show | grep -E '(en05|en06)'"
done

Expected result: TB4 interfaces should be named en05 and en06 with proper MTU settings.

Step 8: Enable IPv4 Forwarding

Essential: TB4 mesh requires IPv4 forwarding for OpenFabric routing.

# Configure IPv4 forwarding on all nodes:
for node in n2 n3 n4; do
  ssh $node "echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf"
  ssh $node "sysctl -p"
done

Verify forwarding enabled:

for node in n2 n3 n4; do
  echo "=== IPv4 forwarding on $node ==="
  ssh $node "sysctl net.ipv4.ip_forward"
done

Expected: net.ipv4.ip_forward = 1 on all nodes.

Step 9: Create Systemd Service for Boot Reliability

Ensure TB4 interfaces come up automatically on boot:

Create systemd service:

for node in n2 n3 n4; do
  ssh $node "cat > /etc/systemd/system/thunderbolt-interfaces.service << 'EOF'
[Unit]
Description=Configure Thunderbolt Network Interfaces
After=network.target thunderbolt.service
Wants=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/thunderbolt-startup.sh

[Install]
WantedBy=multi-user.target
EOF"
done

Create startup script:

for node in n2 n3 n4; do
  ssh $node "cat > /usr/local/bin/thunderbolt-startup.sh << 'EOF'
#!/bin/bash
# Thunderbolt interface startup script
LOGFILE=\"/var/log/thunderbolt-startup.log\"

echo \"\$(date): Starting Thunderbolt interface configuration\" >> \"\$LOGFILE\"

# Wait up to 30 seconds for interfaces to appear
for i in {1..30}; do
    if ip link show en05 &>/dev/null && ip link show en06 &>/dev/null; then
        echo \"\$(date): Thunderbolt interfaces found\" >> \"\$LOGFILE\"
        break
    fi
    echo \"\$(date): Waiting for Thunderbolt interfaces... (\$i/30)\" >> \"\$LOGFILE\"
    sleep 1
done

# Configure interfaces if they exist
if ip link show en05 &>/dev/null; then
    /usr/local/bin/pve-en05.sh
    echo \"\$(date): en05 configured\" >> \"\$LOGFILE\"
fi

if ip link show en06 &>/dev/null; then
    /usr/local/bin/pve-en06.sh
    echo \"\$(date): en06 configured\" >> \"\$LOGFILE\"
fi

echo \"\$(date): Thunderbolt configuration completed\" >> \"\$LOGFILE\"
EOF"
  ssh $node "chmod +x /usr/local/bin/thunderbolt-startup.sh"
done

Enable the service:

for node in n2 n3 n4; do
  ssh $node "systemctl daemon-reload"
  ssh $node "systemctl enable thunderbolt-interfaces.service"
done

Note: This service ensures TB4 interfaces come up even if udev rules fail to trigger on boot.

Phase 2: Proxmox SDN Configuration

Step 4: Create OpenFabric Fabric in GUI

Location: Datacenter → SDN → Fabrics

Click: "Add Fabric" → "OpenFabric"
Configure in the dialog:
- Name: tb4
- IPv4 Prefix: 10.100.0.0/24
- IPv6 Prefix: (leave empty for IPv4-only)
- Hello Interval: 3 (default)
- CSNP Interval: 10 (default)
Click: "OK"

Expected result: You should see a fabric named tb4 with Protocol OpenFabric and IPv4 10.100.0.0/24

Step 5: Add Nodes to Fabric

Still in: Datacenter → SDN → Fabrics → (select tb4 fabric)

Click: "Add Node"
Configure for n2:
- Node: n2
- IPv4: 10.100.0.12
- IPv6: (leave empty)
- Interfaces: Select en05 and en06 from the interface list
Click: "OK"
Repeat for n3: IPv4: 10.100.0.13, interfaces: en05, en06
Repeat for n4: IPv4: 10.100.0.14, interfaces: en05, en06

Expected result: You should see all 3 nodes listed under the fabric with their IPv4 addresses and interfaces (en05, en06 for each)

Important: You need to manually configure /30 point-to-point addresses on the en05 and en06 interfaces to create mesh connectivity. Example addressing scheme:

n2: en05: 10.100.0.1/30, en06: 10.100.0.5/30
n3: en05: 10.100.0.9/30, en06: 10.100.0.13/30
n4: en05: 10.100.0.17/30, en06: 10.100.0.21/30

These /30 subnets allow each interface to connect to exactly one other interface in the mesh topology. Configure these addresses in the Proxmox network interface settings for each node.

Step 6: Apply SDN Configuration

Critical: This activates the mesh - nothing works until you apply!

In GUI: Datacenter → SDN → "Apply" (button in top toolbar)

Expected result: Status table shows all nodes with "OK" status like this:

SDN     Node    Status
localnet... n3   OK
localnet... n1   OK  
localnet... n4   OK
localnet... n2   OK

Step 7: Start FRR Service

Critical: OpenFabric routing requires FRR (Free Range Routing) to be running.

# Start and enable FRR on all mesh nodes:
for node in n2 n3 n4; do
  ssh $node "systemctl start frr && systemctl enable frr"
done

Verify FRR is running:

for node in n2 n3 n4; do
  echo "=== FRR status on $node ==="
  ssh $node "systemctl status frr | grep Active"
done

Expected output:

=== FRR status on n2 ===
     Active: active (running) since Mon 2025-01-27 20:15:23 EST; 2h ago
=== FRR status on n3 ===
     Active: active (running) since Mon 2025-01-27 20:15:25 EST; 2h ago
=== FRR status on n4 ===
     Active: active (running) since Mon 2025-01-27 20:15:27 EST; 2h ago

Command-line verification:

# Check SDN services on all nodes:
for node in n2 n3 n4; do
  echo "=== SDN status on $node ==="
  ssh $node "systemctl status frr | grep Active"
done

Expected output:

=== SDN status on n2 ===
     Active: active (running) since Mon 2025-01-27 20:15:23 EST; 2h ago
=== SDN status on n3 ===
     Active: active (running) since Mon 2025-01-27 20:15:25 EST; 2h ago
=== SDN status on n4 ===
     Active: active (running) since Mon 2025-01-27 20:15:27 EST; 2h ago

Phase 3: Mesh Verification and Testing

Step 8: Verify Interface Configuration

Check TB4 interfaces are up with correct settings:

for node in n2 n3 n4; do
  echo "=== TB4 interfaces on $node ==="
  ssh $node "ip addr show | grep -E '(en05|en06|10\.100\.0\.)'"
done

Expected output example (n2):

=== TB4 interfaces on n2 ===
    inet 10.100.0.12/32 scope global dummy_tb4
11: en05: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    inet 10.100.0.1/30 scope global en05
12: en06: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    inet 10.100.0.5/30 scope global en06

What this shows:

Router ID address: 10.100.0.12/32 on dummy_tb4 interface
TB4 interfaces UP: en05 and en06 with state UP
Jumbo frames: mtu 65520 on both interfaces
Point-to-point addresses: /30 subnets for mesh connectivity

Step 9: Test OpenFabric Mesh Connectivity

Critical test: Verify full mesh communication works.

# Test router ID connectivity (should be sub-millisecond):
for target in 10.100.0.12 10.100.0.13 10.100.0.14; do
  echo "=== Testing connectivity to $target ==="
  ping -c 3 $target
done

Expected output:

=== Testing connectivity to 10.100.0.12 ===
PING 10.100.0.12 (10.100.0.12) 56(84) bytes of data.
64 bytes from 10.100.0.12: icmp_seq=1 ttl=64 time=0.618 ms
64 bytes from 10.100.0.12: icmp_seq=2 ttl=64 time=0.582 ms
64 bytes from 10.100.0.12: icmp_seq=3 ttl=64 time=0.595 ms
--- 10.100.0.12 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms

=== Testing connectivity to 10.100.0.13 ===
PING 10.100.0.13 (10.100.0.13) 56(84) bytes of data.
64 bytes from 10.100.0.13: icmp_seq=1 ttl=64 time=0.634 ms
64 bytes from 10.100.0.13: icmp_seq=2 ttl=64 time=0.611 ms
64 bytes from 10.100.0.13: icmp_seq=3 ttl=64 time=0.598 ms
--- 10.100.0.13 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms

What to look for:

All pings succeed: 3 received, 0% packet loss
Sub-millisecond latency: time=0.6xx ms (typical ~0.6ms)
No timeouts or errors: Should see response for every packet

If connectivity fails: TB4 interfaces may need manual bring-up after reboot:

# Bring up TB4 interfaces manually:
for node in n2 n3 n4; do
  ssh $node "ip link set en05 up mtu 65520"
  ssh $node "ip link set en06 up mtu 65520"
  ssh $node "ifreload -a"
done

Step 10: Verify Mesh Performance

Test mesh latency and basic throughput:

# Test latency between router IDs:
for node in n2 n3 n4; do
  echo "=== Latency test from $node ==="
  ssh $node "ping -c 5 -i 0.2 10.100.0.12 | tail -1"
  ssh $node "ping -c 5 -i 0.2 10.100.0.13 | tail -1"
  ssh $node "ping -c 5 -i 0.2 10.100.0.14 | tail -1"
done

Expected: Round-trip times under 1ms consistently.

Phase 4: High-Performance Ceph Integration

Step 11: Install Ceph on All Mesh Nodes

Install Ceph packages on all mesh nodes:

# Initialize Ceph on mesh nodes:
for node in n2 n3 n4; do
  echo "=== Installing Ceph on $node ==="
  ssh $node "pveceph install --repository test"
done

Step 12: Create Ceph Directory Structure

Essential: Proper directory structure and ownership:

# Create base Ceph directories with correct ownership:
for node in n2 n3 n4; do
  ssh $node "mkdir -p /var/lib/ceph && chown ceph:ceph /var/lib/ceph"
  ssh $node "mkdir -p /etc/ceph && chown ceph:ceph /etc/ceph"
done

Step 13: Create First Monitor and Manager

CLI Approach:

# Create initial monitor on n2:
ssh n2 "pveceph mon create"

Expected output:

Monitor daemon started successfully on node n2.
Created new cluster with fsid: 12345678-1234-5678-9abc-123456789abc

GUI Approach:

Location: n2 node → Ceph → Monitor → "Create"
Result: Should show green "Monitor created successfully" message

Verify monitor creation:

ssh n2 "ceph -s"

Expected output:

  cluster:
    id:     12345678-1234-5678-9abc-123456789abc
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum n2 (age 2m)
    mgr: n2(active, since 1m)
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Step 14: Configure Network Settings

Set public and cluster networks for optimal TB4 performance:

# Configure Ceph networks:
ssh n2 "ceph config set global public_network 10.11.12.0/24"
ssh n2 "ceph config set global cluster_network 10.100.0.0/24"

# Configure monitor networks:
ssh n2 "ceph config set mon public_network 10.11.12.0/24"
ssh n2 "ceph config set mon cluster_network 10.100.0.0/24"

Step 15: Create Additional Monitors

Create 3-monitor quorum on mesh nodes:

CLI Approach:

# Create monitor on n3:
ssh n3 "pveceph mon create"

# Create monitor on n4:
ssh n4 "pveceph mon create"

Expected output (for each):

Monitor daemon started successfully on node n3.
Monitor daemon started successfully on node n4.

GUI Approach:

n3: n3 node → Ceph → Monitor → "Create"
n4: n4 node → Ceph → Monitor → "Create"
Result: Green success messages on both nodes

Verify 3-monitor quorum:

ssh n2 "ceph quorum_status"

Expected output:

{
    "election_epoch": 3,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_names": [
        "n2",
        "n3",
        "n4"
    ],
    "quorum_leader_name": "n2",
    "quorum_age": 127,
    "monmap": {
        "epoch": 3,
        "fsid": "12345678-1234-5678-9abc-123456789abc",
        "modified": "2025-01-27T20:15:42.123456Z",
        "created": "2025-01-27T20:10:15.789012Z",
        "min_mon_release_name": "reef",
        "mons": [
            {
                "rank": 0,
                "name": "n2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.11.12.12:3300"
                        }
                    ]
                }
            }
        ]
    }
}

What to verify:

3 monitors in quorum: "quorum_names": ["n2", "n3", "n4"]
All nodes listed: Should see all 3 mesh nodes
Leader elected: "quorum_leader_name" should show one of the nodes

Step 16: Create OSDs (2 per Node)

Create high-performance OSDs on NVMe drives:

CLI Approach:

# Create OSDs on n2:
ssh n2 "pveceph osd create /dev/nvme0n1"
ssh n2 "pveceph osd create /dev/nvme1n1"

# Create OSDs on n3:
ssh n3 "pveceph osd create /dev/nvme0n1"
ssh n3 "pveceph osd create /dev/nvme1n1"

# Create OSDs on n4:
ssh n4 "pveceph osd create /dev/nvme0n1"
ssh n4 "pveceph osd create /dev/nvme1n1"

Expected output (for each OSD):

Creating OSD on /dev/nvme0n1
OSD.0 created successfully.
OSD daemon started.

GUI Approach:

Location: Each node → Ceph → OSD → "Create: OSD"
Select: Choose /dev/nvme0n1 and /dev/nvme1n1 from device list
Advanced: Leave DB/WAL settings as default (co-located)
Result: Green "OSD created successfully" messages

Verify all OSDs are up:

ssh n2 "ceph osd tree"

Expected output:

ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF 
-1       5.45776 root default                          
-3       1.81959     host n2                           
 0   ssd 0.90979         osd.0     up  1.00000 1.00000 
 1   ssd 0.90979         osd.1     up  1.00000 1.00000 
-5       1.81959     host n3                           
 2   ssd 0.90979         osd.2     up  1.00000 1.00000 
 3   ssd 0.90979         osd.3     up  1.00000 1.00000 
-7       1.81959     host n4                           
 4   ssd 0.90979         osd.4     up  1.00000 1.00000 
 5   ssd 0.90979         osd.5     up  1.00000 1.00000

What to verify:

6 OSDs total: 2 per mesh node (osd.0-5)
All 'up' status: Every OSD shows up in STATUS column
Weight 1.00000: All OSDs have full weight (not being rebalanced out)
Hosts organized: Each node (n2, n3, n4) shows as separate host with 2 OSDs

Phase 5: High-Performance Optimizations

Step 17: Memory Optimizations (64GB RAM Nodes)

Configure optimal memory usage for high-performance hardware:

# Set OSD memory target to 8GB per OSD (ideal for 64GB nodes):
ssh n2 "ceph config set osd osd_memory_target 8589934592"

# Set BlueStore cache sizes for NVMe performance:
ssh n2 "ceph config set osd bluestore_cache_size_ssd 4294967296"

# Set memory allocation optimizations:
ssh n2 "ceph config set osd osd_memory_cache_min 1073741824"
ssh n2 "ceph config set osd osd_memory_cache_resize_interval 1"

Step 18: CPU and Threading Optimizations (13th Gen Intel)

Optimize for high-performance CPUs:

# Set CPU threading optimizations:
ssh n2 "ceph config set osd osd_op_num_threads_per_shard 2"
ssh n2 "ceph config set osd osd_op_num_shards 8"

# Set BlueStore threading for NVMe:
ssh n2 "ceph config set osd bluestore_sync_submit_transaction false"
ssh n2 "ceph config set osd bluestore_throttle_bytes 268435456"
ssh n2 "ceph config set osd bluestore_throttle_deferred_bytes 134217728"

# Set CPU-specific optimizations:
ssh n2 "ceph config set osd osd_client_message_cap 1000"
ssh n2 "ceph config set osd osd_client_message_size_cap 1073741824"

Step 19: Network Optimizations for TB4 Mesh

Optimize network settings for TB4 high-performance cluster communication:

# Set network optimizations for TB4 mesh (65520 MTU, sub-ms latency):
ssh n2 "ceph config set global ms_tcp_nodelay true"
ssh n2 "ceph config set global ms_tcp_rcvbuf 134217728"
ssh n2 "ceph config set global ms_tcp_prefetch_max_size 65536"

# Set cluster network optimizations for 10.100.0.0/24 TB4 mesh:
ssh n2 "ceph config set global ms_cluster_mode crc"
ssh n2 "ceph config set global ms_async_op_threads 8"
ssh n2 "ceph config set global ms_dispatch_throttle_bytes 1073741824"

# Set heartbeat optimizations for fast TB4 network:
ssh n2 "ceph config set osd osd_heartbeat_interval 6"
ssh n2 "ceph config set osd osd_heartbeat_grace 20"

Step 20: BlueStore and NVMe Optimizations

Configure BlueStore for maximum NVMe and TB4 performance:

# Set BlueStore optimizations for NVMe drives:
ssh n2 "ceph config set osd bluestore_compression_algorithm lz4"
ssh n2 "ceph config set osd bluestore_compression_mode aggressive"
ssh n2 "ceph config set osd bluestore_compression_required_ratio 0.7"

# Set NVMe-specific optimizations:
ssh n2 "ceph config set osd bluestore_cache_trim_interval 200"

# Set WAL and DB optimizations for NVMe:
ssh n2 "ceph config set osd bluestore_block_db_size 5368709120"
ssh n2 "ceph config set osd bluestore_block_wal_size 1073741824"

Step 21: Scrubbing and Maintenance Optimizations

Configure scrubbing for high-performance environment:

# Set scrubbing optimizations:
ssh n2 "ceph config set osd osd_scrub_during_recovery false"
ssh n2 "ceph config set osd osd_scrub_begin_hour 2"
ssh n2 "ceph config set osd osd_scrub_end_hour 6"

# Set deep scrub optimizations:
ssh n2 "ceph config set osd osd_deep_scrub_interval 1209600"
ssh n2 "ceph config set osd osd_scrub_max_interval 1209600"
ssh n2 "ceph config set osd osd_scrub_min_interval 86400"

# Set recovery optimizations for TB4 mesh:
ssh n2 "ceph config set osd osd_recovery_max_active 8"
ssh n2 "ceph config set osd osd_max_backfills 4"
ssh n2 "ceph config set osd osd_recovery_op_priority 1"

Phase 6: Storage Pool Creation and Configuration

Step 22: Create High-Performance Storage Pool

Create optimized storage pool with 2:1 replication ratio:

# Create pool with optimal PG count for 6 OSDs (256 PGs = ~85 PGs per OSD):
ssh n2 "ceph osd pool create cephtb4 256 256"

# Set 2:1 replication ratio (size=2, min_size=1) for test lab:
ssh n2 "ceph osd pool set cephtb4 size 2"
ssh n2 "ceph osd pool set cephtb4 min_size 1"

# Enable RBD application for Proxmox integration:
ssh n2 "ceph osd pool application enable cephtb4 rbd"

Step 23: Verify Cluster Health

Check that cluster is healthy and ready:

ssh n2 "ceph -s"

Expected results:

Health: HEALTH_OK (or HEALTH_WARN with minor warnings)
OSDs: 6 osds: 6 up, 6 in
PGs: All PGs active+clean
Pools: cephtb4 pool created and ready

Phase 7: Performance Testing and Validation

Step 24: Test Optimized Cluster Performance

Run comprehensive performance testing to validate optimizations:

# Test write performance with optimized cluster:
ssh n2 "rados -p cephtb4 bench 10 write --no-cleanup -b 4M -t 16"

# Test read performance:
ssh n2 "rados -p cephtb4 bench 10 rand -t 16"

# Clean up test data:
ssh n2 "rados -p cephtb4 cleanup"

Results

Write Performance:

Average Bandwidth: 1,294 MB/s
Peak Bandwidth: 2,076 MB/s
Average IOPS: 323
Average Latency: ~48ms

Read Performance:

Average Bandwidth: 1,762 MB/s
Peak Bandwidth: 2,448 MB/s
Average IOPS: 440
Average Latency: ~36ms

Step 25: Verify Configuration Database

Check that all optimizations are active in Proxmox GUI:

Navigate: Ceph → Configuration Database
Verify: All optimization settings visible and applied
Check: No configuration errors or warnings

Key optimizations to verify:

osd_memory_target: 8589934592 (8GB per OSD)
bluestore_cache_size_ssd: 4294967296 (4GB cache)
bluestore_compression_algorithm: lz4
cluster_network: 10.100.0.0/24 (TB4 mesh)
public_network: 10.11.12.0/24

Troubleshooting Common Issues

TB4 Mesh Issues

Problem: TB4 interfaces not coming up after reboot Root Cause: Udev rules may not trigger on boot, scripts may be corrupted

Quick Fix: Manually bring up interfaces:

# Solution: Manually bring up interfaces and reapply SDN config:
for node in n2 n3 n4; do
  ssh $node "ip link set en05 up mtu 65520"
  ssh $node "ip link set en06 up mtu 65520"
  ssh $node "ifreload -a"
done

Permanent Fix: Check systemd service and scripts:

# Verify systemd service is enabled:
for node in n2 n3 n4; do
  ssh $node "systemctl status thunderbolt-interfaces.service"
done

# Check if scripts are corrupted (should be ~13 lines, not 31073):
for node in n2 n3 n4; do
  ssh $node "wc -l /usr/local/bin/pve-en*.sh"
done

# Check for shebang errors:
for node in n2 n3 n4; do
  ssh $node "head -1 /usr/local/bin/*.sh | grep -E 'thunderbolt|pve-en'"
done
# If you see #\!/bin/bash (with backslash), fix it:
for node in n2 n3 n4; do
  ssh $node "sed -i '1s/#\\\\!/#!/' /usr/local/bin/thunderbolt-startup.sh"
  ssh $node "sed -i '1s/#\\\\!/#!/' /usr/local/bin/pve-en05.sh"
  ssh $node "sed -i '1s/#\\\\!/#!/' /usr/local/bin/pve-en06.sh"
done

Problem: Systemd service fails with "Exec format error"

Root Cause: Corrupted shebang line in scripts (#!/bin/bash instead of #!/bin/bash)
Diagnosis: Check systemctl status thunderbolt-interfaces for exec format errors
Solution: Fix shebang lines as shown above, then restart service

Problem: Mesh connectivity fails between some nodes

# Check interface status:
for node in n2 n3 n4; do
  echo "=== $node TB4 status ==="
  ssh $node "ip addr show | grep -E '(en05|en06|10\.100\.0\.)'"
done

# Verify FRR routing service:
for node in n2 n3 n4; do
  ssh $node "systemctl status frr"
done

Ceph Issues

Problem: OSDs going down after creation

Root Cause: Usually network connectivity issues (TB4 mesh not working)
Solution: Fix TB4 mesh first, then restart OSD services:

# Restart OSD services after fixing mesh:
for node in n2 n3 n4; do
  ssh $node "systemctl restart ceph-osd@*.service"
done

Problem: Ceph cluster shows OSDs down after reboot

Symptoms: ceph status shows OSDs down, heartbeat failures in logs
Root Cause: TB4 interfaces (Ceph private network) not coming up
Solution:

# 1. Bring up TB4 interfaces on all nodes:
for node in n2 n3 n4; do
  ssh $node "/usr/local/bin/pve-en05.sh"
  ssh $node "/usr/local/bin/pve-en06.sh"
done

# 2. Wait for interfaces to stabilize:
sleep 10

# 3. Restart Ceph OSDs:
for node in n2 n3 n4; do
  ssh $node "systemctl restart ceph-osd@*.service"
done

# 4. Monitor recovery:
ssh n2 "watch ceph -s"

Problem: Inactive PGs or slow performance

# Check cluster status:
ssh n2 "ceph -s"

# Verify optimizations are applied:
ssh n2 "ceph config dump | grep -E '(memory_target|cache_size|compression)'"

# Check network binding:
ssh n2 "ceph config get osd cluster_network"
ssh n2 "ceph config get osd public_network"

Problem: Proxmox GUI doesn't show OSDs

Root Cause: Usually config database synchronization issues
Solution: Restart Ceph monitor services and check GUI again

System-Level Performance Optimizations (Optional)

Additional OS-Level Tuning

For even better performance on high-end hardware:

# Apply on all mesh nodes:
for node in n2 n3 n4; do
  ssh $node "
    # Network tuning:
    echo 'net.core.rmem_max = 268435456' >> /etc/sysctl.conf
    echo 'net.core.wmem_max = 268435456' >> /etc/sysctl.conf
    echo 'net.core.netdev_max_backlog = 30000' >> /etc/sysctl.conf
    
    # Memory tuning:
    echo 'vm.swappiness = 1' >> /etc/sysctl.conf
    echo 'vm.min_free_kbytes = 4194304' >> /etc/sysctl.conf
    
    # Apply settings:
    sysctl -p
  "
done

Changelog

July 30, 2025

Added troubleshooting for "Exec format error" caused by corrupted shebang lines
Fixed script examples to ensure proper shebang format (#!/bin/bash)
Added diagnostic commands for detecting shebang corruption

July 28, 2025

Initial complete guide created
Integrated TB4 mesh networking with Ceph storage
Added systemd service for boot reliability
Comprehensive troubleshooting section

jhhoffma3 commented Sep 24, 2025

Do I need to have the cluster already configured before implementing this guide or should the SDN fabric work independently?

The reason I ask is because I can get everything working up until Phase 2, Step 4. I can create the OpenFabric on n2, but I can't add anymore nodes as I am presented with the error "All available nodes are already part of the fabric".

Do I need to create the same fabric on each node? I tried this but still couldn't recognize the other nodes (they were not joined to a cluster).

ChadYoshikawa commented Sep 25, 2025

Do I need to have the cluster already configured before implementing this guide or should the SDN fabric work independently?

The reason I ask is because I can get everything working up until Phase 2, Step 4. I can create the OpenFabric on n2, but I can't add anymore nodes as I am presented with the error "All available nodes are already part of the fabric".

Do I need to create the same fabric on each node? I tried this but still couldn't recognize the other nodes (they were not joined to a cluster).

I just ran into this -- yes, the guide assumes a cluster with the nodes already added. Without that setup, the other nodes do not show up.

jhhoffma3 commented Sep 26, 2025

Do I need to have the cluster already configured before implementing this guide or should the SDN fabric work independently?
The reason I ask is because I can get everything working up until Phase 2, Step 4. I can create the OpenFabric on n2, but I can't add anymore nodes as I am presented with the error "All available nodes are already part of the fabric".
Do I need to create the same fabric on each node? I tried this but still couldn't recognize the other nodes (they were not joined to a cluster).

I just ran into this -- yes, the guide assumes a cluster with the nodes already added. Without that setup, the other nodes do not show up.

Thanks, that got me past that step and through the rest of the setup. Now I'm stuck in the same spot I was with scyto's guide. Once the devices are setup in the cluster and I swap the migration network over to the TB network I get the following error.

could not get migration ip: multiple, different, IP address configured for network '10.100.0.0/24'
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Prox01' -o 'UserKnownHostsFile=/etc/pve/nodes/Prox01/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' [email protected] pvecm mtunnel -migration_network 10.100.0.0/24 -get_migration_ip' failed: exit code 255 <

Not sure how to fix this. I believe it has something to do with the fact that the nodes (n2-n4) have IP's assigned to them, but each adapter also has static ip's assigned to them as well...maybe?

Really want to get this running so I can move everything to my MS-01 cluster and off my NUC cluster.

Yearly1825 commented Sep 26, 2025 •

edited

Loading

I followed this guide and made the scripting changes mentioned in the comments. I also removed the /30 ip addresses from the SDN section in the GUI as I don't understand what those would do, I didn't notice any negative effects. TB interfaces reliably come up and iperf3 performance matches expectations.

I am having terrible Ceph performance though 180MB/s avergae on a Samsung 990 NVME on 3x MS-01's with the 13th core i9s... I have verified the traffic is in fact using the thunderbolt interfaces and not any of the 2.5g interfaces.

I am not sure what could be causing this. I tested another cluster that is still on PRoxmox 8 and used scyto's guide and the performance without tuning is at about 1100.

Anyone have any ideas?

EDIT: Ok the issue was the ceph public network using the slower network actually degrades performance for this small cluster. After adding public and cluster networks to the 10.100.x.x network I saw speeds closer to what this gist outlines. See this post for additional details: https://forum.proxmox.com/threads/low-ceph-performance-on-3-node-proxmox-9-cluster-with-sata-ssds.170091/

Yearly1825 commented Sep 26, 2025 •

edited

Loading

I forked this gist and modified some of the instructions to include what I changed for PVE 9.0.10. I welcome comments if this helps others.

https://gist.github.com/Yearly1825/1e2798cbe4fb0e0d0574551da7dab0a0

Main Changes:

Fixed script error handling: Replaced problematic || syntax with proper if/then/else statements in interface bringup scripts (pve-en05.sh and pve-en06.sh) for more reliable error handling and retry logic
Optimized Ceph networking: Changed configuration to use Thunderbolt network (10.100.0.0/24) for both public and cluster networks instead of split networks, resolving performance degradation issues discovered in production (ref: Proxmox forum thread #170091)
Simplified network topology: Removed confusing and unnecessary /30 point-to-point subnet configuration - OpenFabric handles mesh routing automatically without manual subnet assignments
Improved /etc/network/interfaces instructions: Changed from appending to file to properly inserting configuration above the source directive to prevent conflicts
Enhanced script reliability: Added proper bash conditionals for more predictable behavior during interface initialization

Hopefully this can help others and thanks to taslabs-net for the work.

Author

taslabs-net commented Sep 26, 2025 •

edited

Loading

I purposefully don't use the same Ceph public and private network on purpose. I need other nodes in my cluster to have access to the Ceph resources.

When I setup each node, I immediately create and transfer SSH key to each and then make a local alias for ssh {n2,n3,n4) etc. Since I'm doing the exact same thing on 3 machines, it's easier to just put into scripts and send the same command 3 times. This has worked for me start to finish a few times. But now it's just stable and works.

They are NOT a proxmox Cluster beforehand, though you could. Doesn't matter either way.

Good luck everyone!

Yearly1825 commented Sep 26, 2025

I purposefully don't use the same Ceph public and private network on purpose. I need other nodes in my cluster to have access to the Ceph resources.

When I setup each node, I immediately create and transfer SSH key to each and then make a local alias for ssh {n2,n3,n4) etc. Since I'm doing the exact same thing on 3 machines, it's easier to just put into scripts and send the same command 3 times. This has worked for me start to finish a few times. But now it's just stable and works.

They are NOT a proxmox Cluster beforehand, though you could. Doesn't matter either way.

Good luck everyone!

Yup definitely different use cases. I use tmux and synchronyze panes to do all commands in multiple machines at once. But unlike you I did not have to have ceph available for other VMs, but that is a good point. I was only using the 2.5Gb ports at the moment and I'm sure if I added the SFP+ ports for the public network my speed would have improved.

Again thanks for your gist, definitely couldn't of done this without it.

Author

taslabs-net commented Sep 26, 2025

I was only using the 2.5Gb ports at the moment and I'm sure if I added the SFP+ ports for the public network my speed would have improved.

That is very true. I am using the 10g's lacp to my switches for the public network.

jhhoffma3 commented Sep 27, 2025

I forked this gist and modified some of the instructions to include what I changed for PVE 9.0.10. I welcome comments if this helps others.

https://gist.github.com/Yearly1825/1e2798cbe4fb0e0d0574551da7dab0a0

Main Changes:

Fixed script error handling: Replaced problematic || syntax with proper if/then/else statements in interface bringup scripts (pve-en05.sh and pve-en06.sh) for more reliable error handling and retry logic

Optimized Ceph networking: Changed configuration to use Thunderbolt network (10.100.0.0/24) for both public and cluster networks instead of split networks, resolving performance degradation issues discovered in production (ref: Proxmox forum thread #170091)

Simplified network topology: Removed confusing and unnecessary /30 point-to-point subnet configuration - OpenFabric handles mesh routing automatically without manual subnet assignments

Improved /etc/network/interfaces instructions: Changed from appending to file to properly inserting configuration above the source directive to prevent conflicts

Enhanced script reliability: Added proper bash conditionals for more predictable behavior during interface initialization

Hopefully this can help others and thanks to taslabs-net for the work.

Yes, thanks to all. These suggestions and Taslabs guide def got me where I needed to be and after removing the /30's on the adapters, I was able to migrate an LXC over the TB4 network. Now to migrate everything else...there goes my weekend.

jhhoffma3 commented Oct 2, 2025 •

edited

Loading

I forked this gist and modified some of the instructions to include what I changed for PVE 9.0.10. I welcome comments if this helps others.
https://gist.github.com/Yearly1825/1e2798cbe4fb0e0d0574551da7dab0a0
Main Changes:

Fixed script error handling: Replaced problematic || syntax with proper if/then/else statements in interface bringup scripts (pve-en05.sh and pve-en06.sh) for more reliable error handling and retry logic

Optimized Ceph networking: Changed configuration to use Thunderbolt network (10.100.0.0/24) for both public and cluster networks instead of split networks, resolving performance degradation issues discovered in production (ref: Proxmox forum thread #170091)

Simplified network topology: Removed confusing and unnecessary /30 point-to-point subnet configuration - OpenFabric handles mesh routing automatically without manual subnet assignments

Improved /etc/network/interfaces instructions: Changed from appending to file to properly inserting configuration above the source directive to prevent conflicts

Enhanced script reliability: Added proper bash conditionals for more predictable behavior during interface initialization

Hopefully this can help others and thanks to taslabs-net for the work.

Yes, thanks to all. These suggestions and Taslabs guide def got me where I needed to be and after removing the /30's on the adapters, I was able to migrate an LXC over the TB4 network. Now to migrate everything else...there goes my weekend.

NOT SO FAST....I migrated one LXC and after rebooting the nodes, migrations no longer work but no errors are generated in UI (only message in popup is "no content", which stalls indefinitely). The only thing I was able to do to make the tb4 network operational for migrations is to go into the SDN settings and hit "Apply" again. After that process completed, migrations work among all nodes in the cluster again.

Any thoughts on this issue and/or how to ensure that SDN settings are reapplied properly after reboot?

UPDATE: Still looking for help on this? Anyone?

taslabs-net/PVE9_TB4_Guide_Updated.md

PVE 9 BETA TB4 + Ceph Guide

Network Architecture (UPDATED)

SSH Key Setup (UPDATED)

Step 1: Generate SSH Key (if you don't have one)

Step 2: Accept Host Keys (First Time Only)

Step 3: Deploy SSH Key to All Nodes

Step 4: Test SSH Key Authentication

TB4 Hardware Detection (UPDATED)

Step 1: Prepare All Nodes

Step 2: Identify TB4 Hardware

Step 3: Create Systemd Link Files

TB4 Network Configuration (UPDATED)

Step 4: Configure Network Interfaces

Step 5: Enable systemd-networkd

Step 6: Create Udev Rules and Scripts

Step 7: Verify Network Configuration

Key Updates Made

Network Summary

Complete (ish) Thunderbolt 4 + Ceph Guide: Setup for Proxmox VE 9 BETA STABLE

Acknowledgments

Overview:

Prerequisites: What You Need

Physical Requirements

Software Requirements

Network Planning

Phase 1: Thunderbolt Foundation Setup

Step 1: Prepare All Nodes

Step 2: Identify TB4 Hardware

Step 3: Create Systemd Link Files

Step 4: Configure Network Interfaces

Step 5: Enable systemd-networkd

Step 6: Create Udev Rules and Scripts

Step 7: Update Initramfs and Reboot

Step 8: Enable IPv4 Forwarding

Step 9: Create Systemd Service for Boot Reliability

Phase 2: Proxmox SDN Configuration

Step 4: Create OpenFabric Fabric in GUI

Step 5: Add Nodes to Fabric

Step 6: Apply SDN Configuration

Step 7: Start FRR Service

Phase 3: Mesh Verification and Testing

Step 8: Verify Interface Configuration

Step 9: Test OpenFabric Mesh Connectivity

Step 10: Verify Mesh Performance

Phase 4: High-Performance Ceph Integration

Step 11: Install Ceph on All Mesh Nodes

Step 12: Create Ceph Directory Structure

Step 13: Create First Monitor and Manager

Step 14: Configure Network Settings

Step 15: Create Additional Monitors

Step 16: Create OSDs (2 per Node)

Phase 5: High-Performance Optimizations

Step 17: Memory Optimizations (64GB RAM Nodes)

Step 18: CPU and Threading Optimizations (13th Gen Intel)

Step 19: Network Optimizations for TB4 Mesh

Step 20: BlueStore and NVMe Optimizations

Step 21: Scrubbing and Maintenance Optimizations

Phase 6: Storage Pool Creation and Configuration

Step 22: Create High-Performance Storage Pool

Step 23: Verify Cluster Health

Phase 7: Performance Testing and Validation

Step 24: Test Optimized Cluster Performance

Step 25: Verify Configuration Database

Troubleshooting Common Issues

TB4 Mesh Issues

Ceph Issues

System-Level Performance Optimizations (Optional)

Additional OS-Level Tuning

Changelog

July 30, 2025

July 28, 2025

jhhoffma3 commented Sep 24, 2025

Uh oh!

ChadYoshikawa commented Sep 25, 2025

Uh oh!

jhhoffma3 commented Sep 26, 2025

Uh oh!

Yearly1825 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Yearly1825 commented Sep 26, 2025 •

edited

Loading

Yearly1825 commented Sep 26, 2025 •

edited

Loading

taslabs-net commented Sep 26, 2025 •

edited

Loading

jhhoffma3 commented Oct 2, 2025 •

edited

Loading