Skip to main content
Rizaldi's Personal Website

Lessons learned making things work between AKS and Azure WAF

The last few months because of work I do some experiments on using Azure Kubernetes Service (AKS) and securing the services hosted on it using Azure WAF. In short, I need to host services in AKS but I also need to utilize Azure WAF as a security layer for those services. This post is about the things I learned while I do the experiment.

Context

From Azure WAF docs:

Web Application Firewall (WAF) provides centralized protection of your web applications from common exploits and vulnerabilities. Web applications are increasingly targeted by malicious attacks that exploit commonly known vulnerabilities. SQL injection and cross-site scripting are among the most common attacks.

Azure WAF is more or less the same with AWS WAF or other WAF products out there.

From the same docs page, I found out that Azure WAF can be deployed with Azure Application Gateway, Azure Front Door, and Azure Content Delivery Network from Microsoft. Azure Front Door itself is a type of CDN. So I focus on the first option, Azure Application Gateway.

The problem with Application Gateway Ingress Controller (AGIC)

Azure provides Ingress controller for services running inside a kubernetes cluster and use Azure Application Gateway as gateway into the cluster.

The way it works is the cluster admin creates an Ingress object with azure-application-gateway as .spec.ingressClassName. The AGIC controller installed inside the cluster will detect this and from the Ingress specification it will list the services accessible through that Ingress. Then the controller will sync the App Gateway's rules, listeners, and backends to match the Ingress specification.

At first this seemed to work just as I expected it. The problems arise when there is a new deployment. So the Application Gateway backends will record the IP addresses of the pods related to a service. When there is a new deployment, those IP addresses will change since the pods from the old deployment will be replaced by the new ones. From my observation the changes is not applied as seamless as it should be.

What I did is I curl into my service's health check, give it a 1-second duration to wait and I loop it. I create a new deployment and I apply it. In my experiment there was a 15-second period where the curl return 502. While doing the curl, I also monitor the Endpoint object of the Service object. The changes were applied almost instantly. I suspect the problem arouse because when the AGIC sync the changes with the Application Gateway's backend pools the changes were not applied instantly.

Since I stuck with App Gateway, I decided to experiment with Azure Front Door. After reading the documentation, I understand that the way to use Azure Front Door with AKS is to use my cluster load balancer as one of the origins of an Azure Front Door instance.

If you google AKS Azure Front Door, the first link is this documentation in Azure Architecture Center. It's quite complicated. After reading it a couple of times, and only understanding just a fraction of it, I found that there is something called a Private Link service in Azure.

From Azure docs:

Azure Private Link enables you to access Azure PaaS Services (for example, Azure Storage and SQL Database) and Azure hosted customer-owned/partner services over a private endpoint in your virtual network.

Maybe I can use Private Link to get access to my cluster's load balancer and then I can make my load balancer only accessible from the Private Link? That would make it more secure right? Then I found out that Azure Front Door private link is not yet available in South East Asia region.

App Gateway and NodePort-type Service object

So I plan to use Contour as my cluster Ingress controller. By default Contour run as service of type LoadBalancer. But it turns out I can opt to run it as type NodePort. By doing this I assumed that I can create a backend pool in my Appliation Gateway as Virtual Machine Scale Set type. It means that I can define a backend as the cluster nodes and the port based on the definition of the service used by Contour.

When I tried to do it this way it worked for a while. But the thing is there is a time, I think it's usually once in a week or 2 weeks, where AKS just decides to replace my nodes with new nodes. Completely replace all the nodes. Usually it happens in the morning. With this setup when this happened, the backend pool somehow forgets where it points to. I checked the name of the old vm scale set and the new one. It had the same name. Granted the nodes inside the scale set were different but the name was the same.

The way that I can think of to work around this is to somehow detect when nodes replacement happens and when it does then make sure the backend in the Application Gateway points to the VM scale set. But there must be a better way, I thought to myself.

IP Address Filtering and Front Door identifier

It turns out there is a whole page dedicated on how to secure traffic toAzure Front Door origins. For AKS load balancer there are 2 options: use X-Azure-FDID request header to filter request and/or use IP address filtering.

The latter is easier to implement with Contour. Using Contour's HTTPProxy I can use a HeaderMatchCondition to check whether an incoming request has the expected value for X-Azure-FDID header. If it doesn't the Ingress controller will immediately return 403. If it does, the request is passed into the expected service.

The second option is a bit more challenging. The IP addresses of the Azure Front Door edge nodes can change over time. Azure provide a download link to get the whole Azure Ip ranges and service tags. Using the data in this file I can regularly update the IP filtering in the Contour's HTTP Proxies or Ingresses to filter out unwanted requests.

Conclusion

To use Azure WAF to secure my AKS cluster I need to set up my cluster either by using Azure App Gateway or Azure Front Door. Using App Gateway Ingress Controller provided by Azure I found that the service can be down for up to 30 seconds when there is a new deployment because the changes in the App Gateway's backend pools is not instantaneous. Using Contour as NodePort service is also not possible at this point. Using Front Door, I can utilize filtering based on Front Door identifier in request header and IP addresses to filter out unwanted request coming into my cluster.