[Troubleshooting notes]In the AKS cluster, pods in a separate subnet cannot connect to the Internet
Issue:
In the AKS cluster, pods in a separate subnet(172.16.3.0/24) cannot connect to the Internet
Environment:
cloudprovider: AKS
version: 1.17.11
NetworkPolicy: Calico
NetworkPlugin: Azure
nodepools:
- agentpool | 172.16.2.0/24
- test | 172.16.3.0/24
Troubleshooting steps:
- Login to the node where the pod is running and run the test like ‘curl -v 216.58.193.78’ which connects to Google.com. The result looks good like below.
- Deploy a pod like nginx in the test nodepool
- Run the same test and it fails due to timeout
- Considering the connection on the node level is good, I suspect the packet was dropped at somewhere inside the OS. Hence I capture the network trace on the node and from the trace, the packet reaches the node eth0 successfully but seems fails to be sent out.
- In the network of this cluster, when the pod sends the request to the address out of the cluster cidr, the packet reaches the node at first and then will be SNATed via the iptables. Hence it’s possible the packet is dropped by the iptables.
- To debug how does iptables work, I add below rule to trace the request.
iptables -t raw -A PREROUTING --dest 216.58.193.78 -j TRACE
- Then I rerun the curl command in the pod.
- The trace are logged in the kern.log. With source/destination, it’s easy to locate the log like below. “raw:PREROUTING” is the 1st rule in each transfer and the one before would be where the packet is dropped. In the kern.log, it’s “nat:POSTROUTING:policy:4”
- With information from kern.log, there seems to be no matching rule for the packet in the “POSTROUTING chain”
- Usually for the request from the pod, it will be SNATed by the rule like ‘cali-POSTROUTING’, but in this issue, there is no match which might be the root cause.
- The ‘cali-POSTROUTING’ has following jump rules like below. The normal scenario would be packet matches the last rule and gets SNATed. After checking the IPset in the rule, there is only ‘172.16.2.0/24’. That’s why the packet doens’t get matched in this rule.
- As this IPset is set by calico, it leads me to check the IPpool within the cluster. And it turns out the IPPool for the 2nd nodepool is not configured. After creating another IPPool with the cidr of the 2nd nodepool, the SNAT works fine and pod can connects to the Internet
Tips:
- enable iptables trace via below
iptables -t raw -A PREROUTING --dest <destination IP> -j TRACE
iptables -t raw -A OUTPUT --dest <destination IP> -j TRACE