Azure Kubernetes Service (AKS)
Arpio replicates Azure Kubernetes Service clusters and their node pools, including cluster configuration, networking, supported add-ons, and the Kubernetes resources running inside the cluster — enabling containerized workloads to be restored in recovery environments.
Managed Clusters
Arpio replicates AKS Managed Clusters (Microsoft.ContainerService/managedClusters) with their network profile, identity configuration, API server access profile, add-on profiles, workload auto-scaler configuration, security profile, ingress profile, and service mesh profile. Both publicly accessible and private clusters are supported, as is API server VNet integration.
The following attributes are translated during replication:
| Attribute | Translation Method |
|---|---|
| API Server Subnet | Reference translated to recovery subnet |
| Disk Encryption Set | Reference translated to recovery disk encryption set |
| Load Balancer Outbound Public IPs | References translated to recovery public IP addresses |
| Load Balancer Outbound IP Prefixes | References translated to recovery public IP prefixes |
| Azure Key Vault KMS | Reference translated to recovery Key Vault (used for etcd encryption) |
| Microsoft Defender Workspace | Reference translated to recovery Log Analytics workspace |
| Istio Service Mesh CA Key Vault | Reference translated to recovery Key Vault |
| Web App Routing DNS Zones | References translated to recovery DNS zones or private DNS zones |
| Pod Identity User-Assigned Identities | References translated to recovery user-assigned identities (deprecated AKS feature) |
| OMS Agent Log Analytics Workspace | Reference translated to recovery Log Analytics workspace |
| AGIC Application Gateway | Reference translated to recovery Application Gateway |
| Windows Admin Password Secret | The Key Vault secret URL supplied via the arpio-config:admin-password-secret tag is translated to the recovery Key Vault secret |
The following resources are automatically selected into recovery points when a Managed Cluster is selected:
- The cluster's agent pools
- The cluster's virtual network and subnets
- Subnet referenced by the private cluster API server (when VNet integration is enabled)
- Disk encryption set referenced for OS/data disk encryption
- Public IP addresses and IP prefixes referenced as load balancer outbound IPs
- Key Vault referenced as the etcd encryption KMS key
- Log Analytics workspaces referenced by the
omsagentadd-on and by Microsoft Defender for Containers - Application Gateway referenced by the AGIC add-on (or the AKS-managed Application Gateway, if AGIC manages it itself)
- DNS zones and private DNS zones referenced by the Web App Routing ingress profile
- Key Vault referenced as the Istio service mesh certificate authority
- User-assigned identities referenced by the (deprecated) pod identity profile
- Key Vault secret referenced by the
arpio-config:admin-password-secrettag, if set - Cluster-scoped Kubernetes resources and all resources in the
kube-systemnamespace (see Kubernetes Resources) - Disks referenced by Persistent Volumes in the cluster's Kubernetes resources
The following AKS add-ons are supported and replicated:
| Add-on | Notes |
|---|---|
azureKeyvaultSecretsProvider |
CSI driver for mounting Key Vault secrets into pods |
azurepolicy |
Azure Policy / Gatekeeper. Cluster-resident Policy/Gatekeeper objects are excluded from replication (see Kubernetes Resources) |
ingressApplicationGateway |
AGIC. When the Application Gateway was created and managed by AKS, it is promoted to a BYO Application Gateway in recovery (see Managed Resources Promoted to BYO) |
omsagent |
Container Insights / Log Analytics integration |
In addition, the KEDA workload auto-scaler (workloadAutoScalerProfile.keda) is supported.
If a cluster has any other add-on or workload auto-scaler enabled, Arpio surfaces a recovery-point issue identifying the unsupported feature. The cluster will still be replicated, but the unsupported feature may not function correctly after recovery and may need to be reconfigured manually.
Windows node pools require a local administrator password that Azure treats as write-only — it cannot be read back from the cluster after creation. To replicate the password into recovery, store it in Azure Key Vault and reference the secret URL on the cluster using the arpio-config:admin-password-secret tag. Arpio translates the secret URL to its recovery-environment Key Vault secret and provides the value to AKS when the recovery cluster is created.
Agent Pools
Arpio replicates AKS agent pools (Microsoft.ContainerService/agentPools) along with their parent cluster. Agent pools are also discoverable as standalone resources but are normally selected automatically when their cluster is selected.
The following attributes are translated during replication:
| Attribute | Translation Method |
|---|---|
| VNet Subnet | Reference translated to recovery subnet (or to the cluster's promoted BYO subnet if the source cluster used an AKS-managed VNet) |
| Pod Subnet | Reference translated to recovery subnet |
| Application Security Groups | References translated to recovery application security groups |
| Proximity Placement Group | Reference translated to recovery proximity placement group |
| Capacity Reservation Group | Reference translated to recovery capacity reservation group |
| Dedicated Host Group | Reference translated to recovery host group |
| Node Public IP Prefix | Reference translated to recovery public IP prefix |
| Snapshot Source | Reference translated to recovery agent pool snapshot |
The following resources are automatically selected into recovery points when an Agent Pool is selected:
- The parent Managed Cluster
- Subnets referenced as the node subnet (
vnetSubnetID) and pod subnet (podSubnetID) - Application security groups attached to the pool's nodes
- Proximity placement group, capacity reservation group, and dedicated host group referenced by the pool
- Public IP prefix used for node public IPs
- Agent pool snapshot referenced as the pool's creation source
- The pool's node count and node image version are managed by Azure and the cluster autoscaler; the recovered pool starts at its configured initial size and on the current AKS node image.
Kubernetes Resources
In addition to the AKS control-plane resources above, Arpio replicates the Kubernetes objects running inside the cluster such as deployments, services, config maps, and secrets. This allows these resources to to be restored in the recovered cluster. Internal Kubernetes attributes (image references, Azure resource IDs and ARNs embedded in manifests, secrets, and so on) are translated to point at their recovery-environment counterparts. References from in-cluster resources to Azure resources (for example, secrets pulled from Key Vault by the CSI Secrets Store driver) are detected and the referenced Azure resources are automatically included in the recovery point.
Kubernetes Resource Selection
When a Managed Cluster is selected into a recovery point, Arpio replicates a curated subset of the in-cluster Kubernetes resources rather than the entire cluster.
- By default, cluster-scoped resources and all resources in the
kube-systemnamespace are included. - By namespace, additional namespaces can be selected in the Arpio console. All resources in a selected namespace are included in the recovery point.
- By tag, tag rules can be built against virtual tags on Kubernetes resources to carve out broader or narrower selections. The supported tags are:
| Tag | Selects |
|---|---|
k8s:kind |
Resources of a specific Kubernetes kind |
k8s:namespace |
Resources in a specific namespace |
k8s:label |
Resources carrying a specific Kubernetes label |
k8s:name |
Resources matching a specific name |
Dependencies between Kubernetes resources are taken into account when building the recovery point — selecting a resource automatically pulls in the related cluster objects required to run it (for example, a Deployment's referenced ConfigMaps, Secrets, and ServiceAccount).
Managed Resources Promoted to Bring-Your-Own (BYO)
When AKS creates infrastructure on the cluster's behalf — its virtual network, or its AGIC Application Gateway — those resources live in the cluster's "infra" resource group (typically MC_*) and are owned by AKS. Arpio recreates these as customer-owned bring-your-own (BYO) resources in the recovery environment so they can be wired into the recovery network and (where applicable) routed through a network sandbox firewall. The cluster is then created against the BYO resources.
If the source cluster was created with an AKS-managed VNet (named aks-vnet-* in the MC_* resource group), Arpio replicates it as a BYO Virtual Network in the recovery environment, located in the cluster's own resource group. The cluster's agent pool vnetSubnetID references and any apiServerAccessProfile.subnetId for VNet-integrated clusters are rewritten to point at the new BYO subnet (named aks-subnet).
To allow the cluster's control-plane managed identity to join nodes to the BYO subnet and reconcile VMSS NICs when LoadBalancer services are created, Arpio also creates a Network Contributor role assignment for the cluster's control-plane identity on the BYO VNet's resource group during recovery.
If the AGIC add-on is enabled and the Application Gateway was created and managed by AKS, Arpio replicates the gateway as a BYO Application Gateway in the recovery environment (in the cluster's own resource group), and the AGIC add-on configuration is updated to reference it explicitly via applicationGatewayId.
To allow the AGIC managed identity to manage the BYO gateway, join its subnet, and read its public IP, Arpio creates a Network Contributor role assignment for the AGIC identity on the BYO gateway's resource group during recovery.
Network Sandbox
When a recovery test is run with the Advanced Network Sandbox enabled, Arpio places an Azure Firewall between the recovery cluster and the internet so the test environment can be exercised without giving it production network reachability. Several AKS-specific adjustments are made to the cluster so it remains functional behind the firewall.
The sandbox firewall needs to be wired into the cluster's network — which requires user-defined routes on the cluster's subnets and an Application Gateway whose subnet Arpio controls. AKS does not allow either on infrastructure that AKS itself manages. As described in Managed Resources Promoted to BYO, Arpio recreates AKS-managed VNets and AGIC Application Gateways as customer-owned resources in the recovery environment so it can attach UDRs to the cluster's subnets and route AGIC ingress through the firewall. This promotion happens for all AKS recoveries (not only sandboxed ones) to keep failover-test and real-failover topologies consistent.
AKS supports several outbound types (loadBalancer, managedNATGateway, userAssignedNATGateway, userDefinedRouting, none, etc.). To force all egress through the sandbox firewall, the cluster's networkProfile.outboundType must be userDefinedRouting.
If the source cluster's outboundType is not already userDefinedRouting or none, Arpio rewrites it to userDefinedRouting in the failover-test copy of the cluster. This change is applied only to the test copy — real failovers preserve the source outboundType.
For background on the AKS UDR egress model and the Microsoft endpoints that must be reachable from a firewalled cluster, see Microsoft's Customize cluster egress with a user-defined routing table in AKS.
Putting a firewall in front of the cluster also blocks ingress to Service type=LoadBalancer resources that AKS provisions for in-cluster services. To keep these services reachable during a sandboxed test, Arpio allocates an additional public IP per load balancer frontend on the sandbox firewall and adds DNAT rules to the firewall that translate traffic arriving at the firewall IP to the underlying load balancer frontend IP.
To reach a load-balanced service during a sandboxed test, send traffic to the firewall's public IP for that frontend (visible in the Arpio console alongside the recovery resources) rather than directly to the load balancer's own public IP.
AGIC Application Gateways follow Microsoft's parallel firewall + AGW pattern instead of DNAT — see the generic Network Sandbox page for details.
Because the sandbox firewall denies all egress by default, an AKS cluster behind the firewall cannot function until the destinations it needs to reach — image registries, package mirrors, the AKS control plane, etc. — are explicitly allowed. Add the following domains to the sandbox allowlist when enabling Network Sandbox for a recovery test that includes an AKS cluster:
| Domain | Purpose |
|---|---|
acs-mirror.azureedge.net |
AKS base image and binary mirror |
packages.aks.azure.com |
AKS package repository |
packages.microsoft.com |
Microsoft package repository |
.blob.storage.azure.net |
Azure Blob storage (Microsoft endpoints) |
.blob.core.windows.net |
Azure Blob storage |
mcr.microsoft.com |
Microsoft Container Registry |
.data.mcr.microsoft.com |
Microsoft Container Registry data endpoint |
dc.services.visualstudio.com |
Application Insights telemetry |
.hcp.<region>.azmk8s.io |
Managed Kubernetes API server endpoint (region-specific) |
login.microsoftonline.com |
Microsoft Entra ID — required for AGIC to authenticate to ARM |
management.azure.com |
Azure Resource Manager — required for AGIC to manage the Application Gateway |
!!! important "Adjust the azmk8s.io domain for your recovery region" The .hcp.<region>.azmk8s.io entry must use the recovery environment's region — for example, .hcp.westus2.azmk8s.io for a recovery environment in West US 2, or .hcp.eastus.azmk8s.io for East US. Without this entry, nodes in the recovery cluster cannot reach the managed control plane and the cluster will not come up.
This list is the minimum required for a basic AKS cluster (with AGIC) to start and join the control plane. Workloads running inside the cluster will likely need additional allowlist entries — for example, to reach Azure SQL, Service Bus, Event Hubs, third-party APIs, or other external dependencies. AKS add-ons and workload features may also require additional egress; Microsoft's Outbound network and FQDN rules for AKS clusters documents the complete set of endpoints per add-on and feature.
See Enabling Selective Outbound Access for how to enter the allowlist when starting a sandboxed recovery test.