8 Reasons why Azure Container Instances Suck

tl;dr: ACI is a very immature and unstable product, and I doubt that many customers are using it in production. Just stay away from it.

Azure Container Instances

Microsoft Azure is one of the three major global cloud providers, competing against AWS and Google Cloud. It is often convenient to use Azure for companies that are already in the Microsoft eco-system, maybe using Office 365, Active Directory, Dynamics CRM, Navision, or Power BI. Needless to say, Microsoft products work well with each other, and in that case the path of least resistance would be to stick with Azure, and use their products.

For anything else, Azure has always somehow managed to disappoint. It seems they invest much more in advertising their half-baked products then maintaining them. Case in point, to this date, the out-of-the-box Ubuntu image available to install on a basic vanilla VM on Azure is still 18.04LTS, when Ubuntu 20.04LTS has been out for almost a year, and has been available on AWS, Google Cloud, and also on the cheaper cloud providers like Digital Ocean and Hetzner Cloud for a long while how. Even worse, while they falsely advertise Azure as the “best destination to run PostgreSQL”, they are still running on PostgreSQL 11, while AWS RDS and Google Cloud are both on PostgreSQL 13. Azure does offer a newer “Flexible Server” option still in preview, which offers PostgreSQL 12. However, it doesn’t have all the features available, including missing extensions such as TimescaleDB. This is ironic, considering how they supposedly partnerned with Timescale and made a whole fuss about it, but then left it stuck on version 1.3.2 (there were 12 updates on the 1.x release up to 1.7.5 since then, and a newer 2.x version released in December 2020).

But I digress (although I included the above for a reason which will become clearer below), and you are here because you are probably interested in Azure Container Instances (ACI). Maybe you are considering to use them in development or even in production, or maybe you are already facing the same issues and Google pointed you to this article.

What are Azure Container Instances?

Azure Container Instances (ACI) are Microsoft Azure’s solution to run a docker container without having to provision a Virtual Machine to host it. It is simpler and cheaper than Azure Kubernetes Service (AKS) to operate, which makes it attractive if you want to start small and simple.

Once you create your docker image and push it to your Azure Container Registry (ACR), you can deploy a container instance with a CLI command like:

az container create --resource-group $AZURE_RESOURCE_GROUP --name $ACI_CONTAINER_NAME --image $IMAGE_NAME --registry-username $ACR_USERNAME --registry-password $ACR_PASSWORD --restart-policy Never --environment-variables $ENV_VARS --secure-environment-variables $SECURE_ENV_VARS

This command creates a new Container Instance of the docker image $IMAGE_NAMEand provisions it in the specified AZURE_RESOURCE_GROUPwith the specified ACR_CONTAINER_NAME. You can also pass runtime environment variables with --environment-variables and --secure-environment-variables (the latter do not get logged or shown anywhere, so ideal for passwords and other secrets you want the container instance to use).

This is all fine and dandy, and relatively easy to integrate with modern CI/CD tools like Bitbucket or Gitlab. The simplicity of it all is very attractive, and it makes it really easy to deploy a cloud native application. But this is where the fun stops.

ACIs can be publicly accessible, private inside a VNET, but not both.

For obvious security reasons, private resources should not be accessible publicly over the internet. The common approach to do this on Azure is to create a vnet. If you decide to install your own PostgreSQL database on a VM with a managed disk attached (maybe because you want to use the latest PostgreSQL 13 instead of the outdated one offered as a managed database by Azure), the portal itself will suggest to put it inside a VNET so that you do not expose its port publicly. There could be various other services you would want to run privately, so a VNET makes sense.

At some point, one of the services inside the VNET needs to be exposed publicly. Maybe it is the web application exposing the front-end user interface, or maybe it is a REST API. If it is running on a normal VM, you just add a public IP to the VM. It even gets a Fully Qualified Domain Name (FQDN) on one of Azure’s domains, so you can avoid dealing with IP addresses directly.

Of course you would assume that such basic functionality is available for ACI. Well it is… sort of… because you can have an ACI with a public IP and FQDN. But the moment the ACI is public, it can’t be inside a vnet. The moment it is inside a vnet, you can’t give it a DNS name to expose it publicly. There has been a request to remove this restriction for over 2 years.

One possible solution is to use an Application Gateway, but as you will see below, it comes with its own issues.

ACI Resource IDs are not available to Backend Pools

In order for the Application Gateway, exposing the public IP address, to route requests to an ACI inside a VNET, you need to specify a Backend Pool that specifies the private resources that should serve any requests hitting the gateway. If you were running a normal VM, you could just specify its resource ID and that’s about it.

When using container instances, their resource ID is not available to Backend Pools. You can only use the IP address directly. So what you have to do is find the IP address of your container and add it to the Application Gateway’s backend pool. This works fine, as long as the IP address of the container instance does not change. Again, this issue has been logged and waiting for a resolution for 2 years without any response.

Even the Azure documentation acknowledges this limitation.

The IP address of an ACI cannot be fixed

A workaround to this would have been possible if only there was an option to specify the IP address of an ACI inside the VNET. After all the IP addresses are your own private range, typically in the 10.0.x.x range. As an experiment I put my ACI in its own subnet of the smallest range possible (3 IP addresses), hoping that it sticks to the same IP address, but it didn’t.

So when your ACI gets restarted for some reason (more on this below), you are bound to get your Application Gateway configuration outdated and invalid and you need to manually update this.

The only solution I found so far is to somehow use Azure Init containers. Admittedly I have not explored this in enough depth yet, because it has already been a painful few weeks to find and try to get around the limitations listed here, and adding more complexity is starting to defeat the whole purpose of using simple ACIs and not going for a more sophisticated AKS setup.

ACI does not work well with Private DNS Zones

When deploying your ACI inside a VNET you probably want it to use the resources inside the VNET (that’s the whole point). Of course, you wouldn’t want to talk to the other resources directly through their IP address, because their IP address might change, especially if they are ACIs themselves (see above). When deploying normal VMs, these automatically get registered to your Private DNS zone, and you can communicate with the other resources using their name. Again, you would think that such basic functionality would work seamlessly with a container instance, but it does not.

ACIs do not register themselves with their resource ID to the Private DNS Zone. They will not be reachable by name. I found mixed information about this online, but from my experience this does not work consistently. This either does not work at all, or what comes up is some auto generated name like vm000001 which of course has nothing to do with the real resource name.

The even more frustrating part is that even the resources inside the VNET, which are properly auto-registered with the Private DNS Zone, like a normal VM, are not reachable from the ACI. So two normal VMs inside the VNET can reach each other by their resource name, but a container instance inside the same VNET can’t. This is probably somehow solvable with a more advanced DNS configuration, probably involving operating your own DNS server instead of the Azure one, but the prospects of a clean solution for this are not too promising either.

Inconsistent ACI Deployment Behaviour

The usual practice of updating docker images is to set the :latest tag to the image you want your containers to use. This way, when the container pulls the :latest image, it will get the updated code.

According to the documentation and other online sources running ther CLI commandaz container create is enough to make the container instance that is already running to pull the :latest image again from the Azure Container Registry and restart. This usually works fine, however I noticed several instances where the container was not restarted at all, and I had to manually restart it. The result from az container create (running from a Bitbucket pipeline) was clearly successful:

Successful result of `az container create`

But the container instance was not restarted. So, to use this in a proper CI/CD setup, one will probably need to restart the container to make sure it picks up the new version of the image.

Missing ACI Events

This has more to do with managing your ACIs once you get them up and running. Azure Portal provides you information about your container, such as the memory usage, CPU usage, bandwidth etc., together with the stdout logs from your container and the events related to your container (when it was started, killed, etc.)

However, these events often are not in sync with the container’s real status. Whenever the container is terminated you should see a Killed event, while when the container is started you should see a Started event.

In this case, the container was terminated at 11th March at around 1am:

Container Instance Terminated on Thursday 11th March 2021 at around 1am

But the Events show no information about this (and as you can imagine no reason why it was terminated).

Container Events in descending timestamp order (latest first)

When the container gets restarted, the logs of the previous instance disappear so you have no idea about what was the reason why your application died. For this reason, I set the --restart-policy to Never temporarily, until I understand what is going on. In the meantime Azure Support suggested to add an Azure Log Analytics service (adding more costs) so that logs of previous instances are still visible.

ACIs lack proper logging.

The only logs you have available are those of the application inside in the container. There is nothing that tells you anything about the container itself. This information is not even available to support. When investigating why my container was getting killed for no reason (more on this below) Azure Support had very little more information than what I had.

After following their suggestion and adding Log Analytics, I was hoping to get more information, but still, I only saw my application’s logs. It was just happily waiting for connections (it is a dummy REST API doing nothing) until it was just obliterated from existence without giving any reason.

ACIs are unstable in some regions.

There are various reasons why choosing the right region is important, not just for latency and proximity to your users, but also because of data residency requirements, especially regarding Data Privacy regulations like GDPR.

When I deployed my ACI on West EU I was observing some weird behaviour. My container would run happily for a few hours, and then get killed. Sometimes this killed event would appear in the Container Events, and sometimes (like in the above case) it would not, and I would only determine when it happened from the performance metrics graphs. I couldn’t know why it happened, and of course the first response from support was that most probably my application was the culprit.

So I created a dummy application, a simple REST API, and the instance still got terminated, sometimes after a few hours, sometimes after a couple of days. Support suggested that I deploy on the East US region, and lo and behold, the instance has been running without fail for over 15 days. In the meantime Support have been chasing the Product team for weeks, trying to understand what is going on. It seems that other people are experiencing the same problem too. It seems that it doesn’t make a difference what the application actually is either. In this case it was a .net application that experienced the same problem.

The last information I have from Azure Support is that there is some problem on their side in that region, and the reason is “pod shim crashing”, which seems to point to some underlying Kubernetes issue. The fact that this has been going for so long without resolution, and that other customers were affected too, doesn’t give me much confidence that ACI is actually production ready.

Conclusion

It comes to no surprise that in my opinion, ACI is not ready for production use. Your experience might be different, but in my case the experience has been just horrible. I wasted so much time searching documentation and chasing Support for answers, and I still have not resolved all the issues.

If you need to stick with Azure, and want to have a cloud native solution for your docker containers that is production-ready, you will probably have to look into a more complex setup with AKS. If you don’t, you’re probably better off using AWS Fargate or Google Container Engine.

AI Technologist, Software Architect, Coder