Why does my worker node show a NetworkUnavailable error?
Virtual Private Cloud Classic infrastructure Satellite
When you update your master or worker nodes, your worker nodes enter a Node network unavailable state.
Your worker nodes might enter a NetworkUnavailable or Node network unavailable state whenever the calico-node pod has been shut down. This might happen during a Calico patch update, but shouldn't impact your
application availability.
When Calico is updated, the node.kubernetes.io/network-unavailable:NoSchedule taint is added to your worker node and the Node network unavailable condition becomes True. Both of these conditions are cleared
when Calico restarts, which typically takes only a few seconds.
While this happens, you might see an error message similar to the following.
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
Sometimes, the restart might take longer. In nearly all cases, the restart is fast enough to avoid any worker node network issues. However, there are situations where a Calico restart is delayed and thus, there could be network interruptions. For these cases, the node network unavailable taint and condition are designed to keep new apps from being deployed to the new node until Calico and the node are fixed. Calico updates are rolled out in a very controlled manner so as to minimize overall application impact should there be a node problem.
Monitor the Node network unavailable state with IBM Cloud Monitoring
By using services to monitor applications such as IBM Cloud Monitoring, you can configure alerts for when a worker node goes into a Node network unavailable state, and count each time this happens. You can also configure thresholds
and tune your alerts to allow for when worker nodes are in a Node network unavailable state during routine Calico patches.
When you set up IBM Cloud Monitoring alerts, take the following scenarios into consideration.
- A
Node network unavailablealert might become a problem when acalico-nodepod fails to achieve aRunningstate, and its container restart count continues to increase. - A worker node remains in
Node network unavailablestate for a long amount of time.
After a worker update or replace, sometimes the calico-node pod still does not start on Red Hat OpenShift VPC Cluster. The calico_node pod might get stuck in a state where it is unable to start on a Red Hat OpenShift
VPC cluster. This is not an issue on IKS or Classic clusters. This can occur when you have the sysdig-admission-controller-webhook installed and try to do a worker update or replace. This happens because:
- The VPN client pod gets moved to the new worker as it is starting.
calico-nodeon the new worker starts up, but gets stuck because it makes anapiservercall and times out after 2 seconds.- The
apiservercall then tries to call the webhook which fails because the VPN client pod was trying to start on the new node. The VPN node cannot successfully do so becausecalico-nodehasn't started up yet.
In summary, the calico-node pod startup depends on the webhook working; the webhook depends on the VPN client pod; and the VPN client pod depends on calico-node starting up. The system is stuck in a circular dependency.
If you are able to gather logs from a successfully deployed calico-node pod, you might see an error like this:
2022-09-08 07:13:19.719 [WARNING][9] startup/utils.go 228: Failed to set NetworkUnavailable; will retry error=Patch "https://172.21.0.1:443/api/v1/nodes/10.242.64.17/status?timeout=2s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Workarounds for calico-node
You can use one of the following methods to work around the issue and get the calico-node pod running again.
- Remove the
sysdig-admission-controller-webhookfrom the system. - Modify the
sysdig-admission-controller-webhookand change the timeout to be less than 2 seconds. - Modify the
sysdig-admission-controller-webhookto scope it to the appropriate namespaces, and avoid system-critical namespaces such ascalico-system. - Cordon the new node but don't drain it. Delete the VPN pod and wait for it to start on another worker. Uncordon the node.
After performing any of the previous workarounds, the calico-node pod can start successfully.