Skip to main content

Enterprise Install

Refer to the following sections to troubleshoot errors encountered when installing an Enterprise Cluster.

Scenario - VerteX Management Appliance Fails to Upgrade due to Stuck LINSTOR Satellite Pods

When attempting to upgrade the VerteX Management Appliance, the linstor-satellite.* and linstor-csi-node.* pods may become stuck, which causes the upgrade process to stall. This is because the linstor-satellite.* pods may be using an incorrect Distributed Replicated Block Device (DRBD) image for the drbd-module-loader container.

To resolve this issue, you can check whether the pods are using an incorrect image and update them if necessary.

Debug Steps

  1. Log in to the Local UI of the leader node of your VerteX management cluster. By default, Local UI is accessible at https://<node-ip>:5080. Replace <node-ip> with the IP address of the leader node.

  2. From the left main menu, click Cluster.

  3. Under Environment, download the Admin Kubeconfig File by clicking on the <cluster-name>.kubeconfig hyperlink.

  4. Open a terminal session in an environment that has network access to the VerteX management cluster.

  5. Issue the following command to set the KUBECONFIG environment variable to the path of the kubeconfig file you downloaded in step 3.

    export KUBECONFIG=<path-to-kubeconfig-file>
  6. Use the following command to check the status of these pods in the piraeus-system namespace.

    kubectl get pods --namespace piraeus-system

    In the output, look for the status of the linstor-satellite.* pods.

    Example output
    NAME                                                            READY   STATUS                  RESTARTS         AGE
    ha-controller-2886l 1/1 Running 0 25h
    ha-controller-nnvqt 1/1 Running 1 (29m ago) 25h
    ha-controller-qhc26 1/1 Running 1 (36m ago) 25h
    linstor-controller-69b8ff6479-ccpzx 1/1 Running 0 31m
    linstor-csi-controller-78c8bc4d55-5gk2b 7/7 Running 4 (28m ago) 38m
    linstor-csi-node-dp8lm 0/3 Error 0 25h
    linstor-csi-node-r2hfv 0/3 Error 0 25h
    linstor-csi-node-tpt6h 3/3 Running 0 25h
    linstor-satellite.edge-1d3f3842cb0fdcef14b65cb510b5974f-5vkml 2/2 Running 0 25h
    linstor-satellite.edge-53583842350d90345a1f7251033cb228-8s7js 0/2 Init:CrashLoopBackOff 10 (2m54s ago) 26m
    linstor-satellite.edge-c0913842383ebd183d13d1458bb762c5-78q97 0/2 Init:CrashLoopBackOff 11 (3m46s ago) 33m
    piraeusoperator-piraeus-controller-manager-6f8988d598-b2v57 1/1 Running 1 (28m ago) 25h
  7. If any of the linstor-satellite.* pods are not in a Running state, use the following command to describe the pods. Replace <pod-name> with the name of the LINSTOR satellite pod you want to inspect.

    kubectl describe pod <pod-name> --namespace piraeus-system

    Look for events indicating that the pod is attempting to use the drbd9-jammy:v9.2.13 image for the drbd-module-loader container, such as the following example.

    Example output
    ...
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 34m default-scheduler Successfully assigned piraeus-system/linstor-satellite.edge-c0913842383ebd183d13d1458bb762c5-78q97 to edge-c0913842383ebd183d13d1458bb762c5
    Warning BackOff 26m (x6 over 31m) kubelet Back-off restarting failed container drbd-module-loader in pod linstor-satellite.edge-c0913842383ebd183d13d1458bb762c5-78q97_piraeus-system(71ea7db5-cc2c-4585-b1f7-fcc19bf14891)
    Normal Pulled 26m (x5 over 34m) kubelet Container image "us-docker.pkg.dev/palette-images-fips/packs/piraeus-operator/2.8.1/drbd9-jammy:v9.2.13" already present on machine
    Normal Created 26m (x5 over 34m) kubelet Created container: drbd-module-loader
    Normal Started 26m (x5 over 34m) kubelet Started container drbd-module-loader
    Normal Pulled 5m58s (x7 over 25m) kubelet Container image "us-docker.pkg.dev/palette-images-fips/packs/piraeus-operator/2.8.1/drbd9-jammy:v9.2.13" already present on machine
    Normal Created 5m58s (x7 over 25m) kubelet Created container: drbd-module-loader
    Normal Started 5m58s (x7 over 25m) kubelet Started container drbd-module-loader
    Warning BackOff 3m41s (x53 over 23m) kubelet Back-off restarting failed container drbd-module-loader in pod linstor-satellite.edge-c0913842383ebd183d13d1458bb762c5-78q97_piraeus-system(71ea7db5-cc2c-4585-b1f7-fcc19bf14891)
  8. If any of the linstor-satellite.* pods are using the drbd9-jammy:v9.2.13 image, issue the following command to create a manifest that corrects the image reference for the drbd-module-loader container.

    kubectl apply --filename - <<EOF
    apiVersion: piraeus.io/v1
    kind: LinstorSatelliteConfiguration
    metadata:
    name: custom-loader-image
    namespace: piraeus-system
    spec:
    podTemplate:
    spec:
    initContainers:
    - name: drbd-module-loader
    image: us-docker.pkg.dev/palette-images-fips/packs/piraeus-operator/2.8.1/dbrd-loader:v2.8.1
    imagePullPolicy: IfNotPresent
    EOF
    Expected output
    linstorsatelliteconfiguration.piraeus.io/custom-loader-image created
  9. Wait for the linstor-satellite.* pods to be recreated with the new image.

  10. Verify that the drbd-module-loader container is using the new image by describing the linstor-satellite.* pods. Replace <pod-name> with the name of the pod you want to check. You may need to issue kubectl get pods --namespace piraeus-system first as the pod names will have changed.

    kubectl describe pods <pod-name> --namespace piraeus-system

    Look for events indicating that the drbd-module-loader container is now using the dbrd-loader:v2.8.1 image.

    Example output
    ...
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 4m44s default-scheduler Successfully assigned piraeus-system/linstor-satellite.edge-c0913842383ebd183d13d1458bb762c5-wfd4q to edge-c0913842383ebd183d13d1458bb762c5
    Normal Pulled 4m45s kubelet Container image "us-docker.pkg.dev/palette-images-fips/packs/piraeus-operator/2.8.1/dbrd-loader:v2.8.1" already present on machine
    Normal Created 4m45s kubelet Created container: drbd-module-loader
    Normal Started 4m44s kubelet Started container drbd-module-loader

The VerteX Management Appliance upgrade process will then continue. You can monitor the upgrade progress in Local UI.

Scenario - Palette/VerteX Management Appliance Installation Stalled due to piraeus-operator Pack in Error State

During the installation of the Palette or VerteX Management Appliance, the piraeus-operator pack can enter an error state in Local UI. This can be caused by stalled creation of Kubernetes secrets in the piraeus-system namespace and can prevent the installation from completing successfully.

To resolve, you can manually delete any secrets in the piraeus-system namespace that have a pending-install status label. This will allow the piraeus-operator pack to complete its deployment and the Palette/VerteX Management Appliance installation to proceed.

Debug Steps

  1. Log in to the Local UI of the leader node of your Palette/VerteX management cluster. By default, Local UI is accessible at https://<node-ip>:5080. Replace <node-ip> with the IP address of the leader node.

  2. From the left main menu, click Cluster.

  3. Download the Admin Kubeconfig File by clicking on the <cluster-name>.kubeconfig hyperlink.

  4. Open a terminal session in an environment that has network access to the Palette/VerteX management cluster.

  5. Issue the following command to set the KUBECONFIG environment variable to the path of the kubeconfig file you downloaded in step 3.

    export KUBECONFIG=<path-to-kubeconfig-file>
  6. Use kubectl to list all secrets in the piraeus-system namespace.

    kubectl get secrets --namespace piraeus-system
    Example output
    NAME                                                      TYPE                             DATA   AGE
    sh.helm.release.v1.piraeusoperator-linstor-gui.v1 helm.sh/release.v1 1 1h
    sh.helm.release.v1.piraeusoperator-linstor-gui.v2 helm.sh/release.v1 1 1h
    sh.helm.release.v1.piraeusoperator-piraeus-cluster.v1 helm.sh/release.v1 1 1h
    sh.helm.release.v1.piraeusoperator-piraeus-dashboard.v1 helm.sh/release.v1 1 1h
    sh.helm.release.v1.piraeusoperator-piraeus.v1 helm.sh/release.v1 1 1h
    sh.helm.release.v1.piraeusoperator-piraeus.v2 helm.sh/release.v1 1 1h
  7. Use the following command to check each secret for a pending-install status label. Replace <secret-name> with the name of the secret you want to check.

    kubectl describe secrets <secret-name> --namespace piraeus-system
    Example output
    Name:         sh.helm.release.v1.piraeusoperator-piraeus-cluster.v1
    Namespace: piraeus-system
    Labels: modifiedAt=0123456789
    name=piraeusoperator-piraeus-cluster
    owner=helm
    status=pending-install
    version=1
    Annotations: <none>

    Type: helm.sh/release.v1

    Data
    ====
    release: 7156 bytes
    tip

    You can also try using the following command to filter for secrets with a pending-install status label.

    kubectl describe secrets --namespace piraeus-system --selector status=pending-install
  8. If you find any secrets with this label, delete them using the following command. Replace <secret-name> with the name of the secret you want to delete.

    kubectl delete secrets <secret-name> --namespace piraeus-system
  9. After deleting any secrets with a pending-install status label, wait for the piraeus-operator pack to enter a Running status in Local UI. The installation of Palette/VerteX Management Appliance should then proceed successfully.

Scenario - Unexpected Logouts in Tenant Console After Palette/VerteX Management Appliance Installation

After installing self-hosted Palette/Palette VerteX using the Palette Management Appliance or VerteX Management Appliance, you may experience unexpected logouts when using the tenant console. This can be caused by a time skew on your Palette/VerteX management cluster nodes, which leads to authentication issues.

To verify the system time, open a terminal session on each node in your Palette/VerteX management cluster and issue the following command to check the system time.

timedatectl

An output similar to the following will be displayed. A time skew is indicated by the Local time and Universal time values being different across the nodes.

               Local time: Fri 2025-07-11 09:41:42 UTC
Universal time: Fri 2025-07-11 09:41:42 UTC
RTC time: Fri 2025-07-11 09:41:42
Time zone: UTC (UTC, +0000)
System clock synchronized: no
NTP service: active
RTC in local TZ: no

You may also notice errors in the auth-* pod logs in the hubble-system namespace similar to the following.

Example command to extract logs from auth pod
kubectl logs --namespace hubble-system auth-5f95c77cb-49jtv
Example output
auth-5f95c77cb-49jtv Jul  7 17:22:46.378 ERROR [hubble_token.go:426/hucontext.getClaimsFromToken] [Unable to parse the token 'abcd...1234' due to Token used before
issued]
auth-5f95c77cb-49jtv Jul 7 17:22:46.378 ERROR [auth_service.go:282/service.(*AuthService).Logout] [provided token 'xxxxx' is not valid Token used before issued]

This indicates that the system time on your Palette/VerteX management cluster nodes is not synchronized with a Network Time Protocol (NTP) server. To resolve this issue, you can configure an NTP server in the Palette/VerteX management cluster settings.

Debug Steps

  1. Log in to Local UI of the leader node of your Palette/VerteX management cluster.

  2. On the left main menu, click Cluster.

  3. Click Actions in the top-right corner and select Cluster Settings from the drop-down menu.

  4. In the Network Time Protocol (NTP) (Optional) field, enter the NTP server that you want to use for your Palette/VerteX management cluster. For example, you can use pool.ntp.org or any other NTP server that is accessible from your Palette/VerteX management cluster nodes.

  5. Click Save Changes to apply the changes.

Scenario - IP Pool Exhausted During Airgapped Upgrade

When upgrading a self-hosted airgapped cluster to version 4.6.32, the IPAM controller may report an Exhausted IP Pools error despite having available IP addresses. This is due to a race condition in CAPV version 1.12.0, which may lead to an orphaned IP claim when its associated VMware vSphere machine is deleted during the control plane rollout. When this occurs, the IP claim and IP address are not cleaned up, keeping the IP reserved and exhausting the IP pool. To complete the upgrade, you must manually release the orphaned claim holding the IP address.

Debug Steps

  1. Open up a terminal session in an environment that has network access to the cluster. Refer to Access Cluster with CLI for additional guidance.

  2. Issue the following command to list the IP addresses of the current nodes in the cluster.

    info

    The airgap support VM is not listed, only the nodes in the cluster.

    kubectl get nodes \
    --output jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}'
    Example output
    10.10.227.13
    10.10.227.11
    10.10.227.14
  3. List all IP claims in the spectro-mgmt namespace. The base spectro-mgmt-cluster claim belongs to the airgap support VM.

    kubectl get ipclaim --namespace spectro-mgmt
    Example output
    NAMESPACE      NAME                                   AGE
    spectro-mgmt spectro-mgmt-cluster 29h
    spectro-mgmt spectro-mgmt-cluster-cp-43978-dw858-0 14h
    spectro-mgmt spectro-mgmt-cluster-cp-43978-p2bpg-0 29h
    spectro-mgmt spectro-mgmt-cluster-cp-dt44d-0 14h
    spectro-mgmt spectro-mgmt-cluster-cp-qx4vw-0 6m
  4. Map each claim to its allocated IP.

    kubectl get ipclaim --namespace spectro-mgmt \
    --output jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.address.name}{"\n"}{end}'

    Compare the IP addresses of the nodes in the cluster to the IP addresses in the claim list, ignoring the spectro-mgmt-cluster claim of the airgap support VM. The IP that appears in the claim list that does not appear in the node list is the orphaned claim. In the below example, the orphaned claim is spectro-mgmt-cluster-cp-qx4vw-0, which is tied to the IP address 10.10.227.12 (spectro-mgmt-cluster-cluster1-10-10-227-12).

    Example output
    spectro-mgmt-cluster                   spectro-mgmt-cluster-cluster1-10-10-227-10
    spectro-mgmt-cluster-cp-43978-dw858-0 spectro-mgmt-cluster-cluster1-10-10-227-14
    spectro-mgmt-cluster-cp-43978-p2bpg-0 spectro-mgmt-cluster-cluster1-10-10-227-13
    spectro-mgmt-cluster-cp-dt44d-0 spectro-mgmt-cluster-cluster1-10-10-227-11
    spectro-mgmt-cluster-cp-qx4vw-0 spectro-mgmt-cluster-cluster1-10-10-227-12
  5. Delete the orphaned claim.

    kubectl delete ipclaim --namespace spectro-mgmt <claim-name>
    Example command
    kubectl delete ipclaim --namespace spectro-mgmt spectro-mgmt-cluster-cp-qx4vw-0
  6. Re-run the upgrade. For guidance, refer to the applicable upgrade guide for your airgapped instance of Palette or VerteX.

Scenario - Self-Linking Error

When installing an Enterprise Cluster, you may encounter an error stating that the enterprise cluster is unable to self-link. Self-linking is the process of Palette or VerteX becoming aware of the Kubernetes cluster it is installed on. This error may occur if the self-hosted pack registry specified in the installation is missing the Certificate Authority (CA). This issue can be resolved by adding the CA to the pack registry.

Debug Steps

  1. Log in to the pack registry server that you specified in the Palette or VerteX installation.

  2. Download the CA certificate from the pack registry server. Different OCI registries have different methods for downloading the CA certificate. For Harbor, check out the Download the Harbor Certificate guide.

  3. Log in to the system console. Refer to Access Palette system console or Access Vertex system console for additional guidance.

  4. From the left navigation menu, select Administration and click on the Pack Registries tab.

  5. Click on the three-dot Menu icon for the pack registry that you specified in the installation and select Edit.

  6. Click on the Upload file button and upload the CA certificate that you downloaded in step 2.

  7. Check the box Insecure Skip TLS Verify and click on Confirm.

A pack registry configuration screen.

After a few moments, a system profile will be created and Palette or VerteX will be able to self-link successfully. If you continue to encounter issues, contact our support team by emailing support@spectrocloud.com so that we can provide you with further guidance.

Scenario - Enterprise Backup Stuck

In the scenario where an enterprise backup is stuck, a restart of the management pod may resolve the issue. Use the following steps to restart the management pod.

Debug Steps

  1. Open up a terminal session in an environment that has network access to the Kubernetes cluster. Refer to the Access Cluster with CLI for additional guidance.

  2. Identify the mgmt pod in the hubble-system namespace. Use the following command to list all pods in the hubble-system namespace and filter for the mgmt pod.

    kubectl get pods --namespace hubble-system | grep mgmt
    mgmt-f7f97f4fd-lds69                   1/1     Running   0             45m
  3. Restart the mgmt pod by deleting it. Use the following command to delete the mgmt pod. Replace <mgmt-pod-name> with the actual name of the mgmt pod that you identified in step 2.

    kubectl delete pod <mgmt-pod-name> --namespace hubble-system
    pod "mgmt-f7f97f4fd-lds69" deleted

Scenario - Non-Unique vSphere CNS Mapping

In Palette and VerteX releases 4.4.8 and earlier, Persistent Volume Claims (PVCs) metadata do not use a unique identifier for self-hosted Palette clusters. This causes incorrect Cloud Native Storage (CNS) mappings in vSphere, potentially leading to issues during node operations and upgrades.

This issue is resolved in Palette and VerteX releases starting with 4.4.14. However, upgrading to 4.4.14 will not automatically resolve this issue. If you have self-hosted instances of Palette in your vSphere environment older than 4.4.14, you should execute the following utility script manually to make the CNS mapping unique for the associated PVC.

Debug Steps

  1. Ensure your machine has network access to your self-hosted Palette instance with kubectl. Alternatively, establish an SSH connection to a machine where you can access your self-hosted Palette instance with kubectl.

  2. Log in to your self-hosted Palette instance System Console.

  3. In the Main Menu, click Enterprise Cluster.

  4. In the cluster details page, scroll down to the Kubernetes Config File field and download the kubeconfig file.

  5. Issue the following command to download the utility script.

    curl --output csi-helper https://software.spectrocloud.com/tools/csi-helper/csi-helper
  6. Adjust the permission of the script.

    chmod +x csi-helper
  7. Issue the following command to execute the utility script. Replace the placeholder with the path to your kubeconfig file.

    ./csi-helper --kubeconfig=<PATH_TO_KUBECONFIG>
  8. Issue the following command to verify that the script has updated the cluster ID.

    kubectl describe configmap vsphere-cloud-config --namespace=kube-system

    If the update is successful, the cluster ID in the ConfigMap will have a unique ID assigned instead of spectro-mgmt/spectro-mgmt-cluster.

    Name:         vsphere-cloud-config
    Namespace: kube-system
    Labels: component=cloud-controller-manager
    vsphere-cpi-infra=config
    Annotations: cluster.spectrocloud.com/last-applied-hash: 17721994478134573986

    Data
    ====
    vsphere.conf:
    ----
    [Global]
    cluster-id = "896d25b9-bfac-414f-bb6f-52fd469d3a6c/spectro-mgmt-cluster"

    [VirtualCenter "vcenter.spectrocloud.dev"]
    insecure-flag = "true"
    user = "example@vsphere.local"
    password = "************"

    [Labels]
    zone = "k8s-zone"
    region = "k8s-region"


    BinaryData
    ====

    Events: <none>

Scenario - "Too Many Open Files" in Cluster

When viewing logs for Enterprise or Private Cloud Gateway clusters, you may encounter a "too many open files" error, which prevents logs from tailing after a certain point. To resolve this issue, you must increase the maximum number of file descriptors for each node on your cluster.

Debug Steps

Repeat the following process for each node in your cluster.

  1. Log in to a node in your cluster.

    ssh -i <key-name> <spectro@hostname>
  2. Switch to sudo mode using the command that best fits your system and preferences.

    sudo --login
  3. Increase the maximum number of file descriptors that the kernel can allocate system-wide.

    echo "fs.file-max = 1000000" > /etc/sysctl.d/99-maxfiles.conf
  4. Apply the updated sysctl settings. The increased limit is returned.

    sysctl -p /etc/sysctl.d/99-maxfiles.conf
    fs.file-max = 1000000
  5. Restart the kubelet and containerd services.

    systemctl restart kubelet containerd
  6. Confirm that the change was applied.

    sysctl fs.file-max
    fs.file-max = 1000000

Scenario - MAAS and VMware vSphere Clusters Fail Image Resolution in Non-Airgap Environments

In Palette or VerteX non-airgap installations with versions 4.2.13 to 4.5.22, MAAS and VMware vSphere clusters may fail to provision due to image resolution errors. These environments have incorrectly configured default image endpoints. To resolve this issue, you must manually set these endpoints.

Debug Steps

  1. Open a terminal with connectivity to your self-hosted environment.

  2. Execute the following command to save the base URL of your Palette instance API to the BASE_URL environment value. Add your correct URL in place of REPLACE_ME.

    export BASE_URL="REPLACE ME"
  3. Use the following command to log in to the Palette System API by using the /v1/auth/syslogin endpoint. Ensure you replace the credentials below with your system console credentials.

    curl --location '$BASE_URL/v1/auth/syslogin' \
    --header 'Content-Type: application/json' \
    --data '{
    "password": "**********",
    "username": "**********"
    }'

    The output displays the authorization token.

    {
    "Authorization": "**********.",
    "IsPasswordReset": true
    }
  4. Copy the authorization token to your clipboard and assign it to an environment variable. Replace the placeholder below with the value from the output.

    export TOKEN=**********
  5. Execute the following command to set the MAAS image endpoint to https://maasgoldenimage.s3.amazonaws.com. Replace the caCert value below with the Certificate Authority (CA) certificate for your self-hosted environment.

    curl --request PUT '$BASE_URL/v1/system/config/maas/image' \
    --header 'Authorization: $TOKEN' \
    --header 'Content-Type: application/json' \
    --data '{
    "spec": {
    "imagesHostEndpoint": "https://maasgoldenimage.s3.amazonaws.com",
    "insecureSkipVerify": false,
    "caCert": "**********"
    }
    }'
  6. Execute the following command to set the VMware vSphere image endpoint to https://vmwaregoldenimage.s3.amazonaws.com. Replace the caCert value below with the Certificate Authority (CA) certificate for your self-hosted environment.

    curl --request PUT '$BASE_URL/v1/system/config/vsphere/image' \
    --header 'Authorization: $TOKEN' \
    --header 'Content-Type: application/json' \
    --data '{
    "spec": {
    "imagesHostEndpoint": "https://vmwaregoldenimage.s3.amazonaws.com",
    "insecureSkipVerify": false,
    "caCert": "**********"
    }
    }'

MAAS and VMware vSphere clusters will now be successfully provisioned on your self-hosted Palette environment.