Making Every CPU Count: The Engineering Journey Behind Choreo’s Scale-to-Zero

- Lakmal Warusawithana
- Vice President & Distinguished Engineer - WSO2 LLC

How we architected a responsive, HTTP-aware scale-to-zero mechanism to improve resource efficiency across cloud native workloads in Choreo.
Cloud platforms promise elasticity, but achieving true efficiency, especially for idle or sporadically used services, requires more than just horizontal scaling. At Choreo, we set out to rethink how workloads behave when they're not actively serving requests. The result: a robust scale-to-zero mechanism that minimizes idle consumption without sacrificing responsiveness. In this blog, we'll walk through the architectural decisions, queuing model, autoscaler internals, and the lessons learned while building a production-grade scale-to-zero capability for HTTP-based services.
Choosing the Right Tool
A fundamental principle in software engineering is not reinventing the wheel. When we set out to build scale-to-zero in Choreo, we began by exploring whether an existing open source tool could meet our needs. But selecting the right tool is far from straightforward, especially in the rich and ever-expanding CNCF landscape, where multiple projects often appear to solve the same problem on the surface.
Each tool comes with its own architecture, assumptions, and trade-offs. Many don't fit directly into your environment or use case without significant effort. Choosing the right one requires deep evaluation: hands-on experimentation, architectural reviews, understanding extensibility points, and assessing how well it integrates with your platform's operational model and scalability goals.
After evaluating several options, we selected Kubernetes Event-Driven Autoscaling (KEDA). KEDA is a lightweight, Kubernetes-native component that enables event-driven scaling of any container based on a wide range of external triggers, not just CPU or memory. It integrates seamlessly with Kubernetes' Horizontal Pod Autoscaler (HPA) and supports custom scalers, making it a great fit for our HTTP request-based scaling model.
While KEDA itself is mature and battle-tested for general event-driven autoscaling, there was one caveat: its HTTP add-on, the component responsible for HTTP-based scaling, was still in alpha (now in beta), with limited contributions and in a sort of maintenance mode. Although the HTTP add-on worked in standalone scenarios, its architecture hadn't been fully designed to meet enterprise-grade complexity or production-hardening requirements.
Despite this, we decided to take on the challenge. Our goal was clear: enable workload autoscaling based on actual HTTP request behavior. So we dug into the internals of the HTTP add-on to evaluate how we could adapt and extend it to meet Choreo's needs. What followed was a deep technical effort to align the HTTP scaling mechanics with the realities of a complex, multi-tenant, production environment, while preserving responsiveness and reliability.
Understanding the KEDA HTTP Add-On Architecture
Before diving into how we integrated KEDA into Choreo, it's important to understand how the KEDA HTTP add-on is designed to work.
The HTTP add-on extends KEDA's capabilities by enabling autoscaling based on incoming HTTP requests, something not natively supported by Kubernetes' Horizontal Pod Autoscaler (HPA), which typically relies on metrics like CPU or memory. The HTTP add-on introduces a proxy-based architecture that intercepts requests and scales workloads on demand, making it well-suited for implementing scale-to-zero for HTTP services.

Figure 1: Keda HTTP add-on architecture (source - https://github.com/kedacore/http-add-on/blob/main/docs/design.md)
At a high level, the architecture consists of the following key components:
- HTTP Interceptor Proxy: This component sits in front of the user's service and intercepts incoming HTTP traffic. When a request arrives and no service replicas are running, it temporarily queues the request and signals the autoscaler to scale up the target deployment.
- Scaler and Operator: The scaler is responsible for triggering the scale-up process by interacting with the KEDA operator and the Kubernetes HPA. It monitors pending request queues and manages the lifecycle of the workload replicas based on real-time demand.
- Queue: The HTTP add-on introduces an in-memory queue that temporarily holds incoming requests until the application is ready to handle them. This allows the system to absorb cold-start latency without dropping traffic.
- Target Service: Once the application has scaled up and is ready to serve, the queued requests are forwarded to the actual service endpoint.
While this architecture works in standalone scenarios, it introduces several challenges in enterprise settings, particularly in multi-tenant platforms like Choreo. The proxy pattern assumes a single ingress path and global scaling logic, which doesn't align well with Choreo's cell-based model, where isolation, network boundaries, and routing control are tightly enforced. This architectural mismatch is what made the integration both interesting and non-trivial.
Why KEDA Integration Wasn't Plug-and-Play
Integrating KEDA into Choreo's architecture was far from straightforward. Choreo is built on a cell-based architecture, which serves as the foundation of its design. In this model, all components, such as services, scheduled tasks, and web applications, belonging to a project are deployed within a dedicated Kubernetes namespace, referred to as a cell. Components within a cell can communicate with each other directly using their Kubernetes service names.

Figure 2: Choreo's cell-based architecture
As shown in Figure 2, the structure and communication flow of a typical Choreo cell is designed with strict boundaries and controlled ingress. External ingress traffic is enforced to route through a centralized API gateway, ensuring governance, observability, and security. Web application traffic is routed through a dedicated ingress gateway. To maintain strict network isolation, all other cross-boundary traffic is denied by default using Cilium network policies.
Our First Approach: Using iptables to Route Traffic Through the KEDA Interceptor
In Choreo, not all services are required to use scale-to-zero. Instead, users have the flexibility to enable or disable scale-to-zero on a per-service basis, depending on their specific use cases. As a result, some services within a cell are configured to scale-to-zero using KEDA, while others continue to operate using the standard Kubernetes Horizontal Pod Autoscaler (HPA) with a fixed minimum replica count.
This mixed-mode model introduced a key challenge: how to correctly route traffic based on whether the destination service uses scale-to-zero. Services with scale-to-zero enabled must route all HTTP traffic through the KEDA HTTP Interceptor, which buffers incoming requests and initiates workload scaling.
In line with Choreo's cell-based architecture, HTTP traffic targeting a user workload can originate from two sources:
- From the API or ingress gateway, if the service is exposed externally.
- From another user workload within the same cell, if the service is called internally using its Kubernetes service name.
One of the most important user experience goals we had was to ensure that this traffic redirection happened automatically, without requiring Choreo users to modify their code or change how they make service-to-service calls. Users should be able to call other workloads using standard Kubernetes service names, without being aware of the underlying routing logic.
To support this:
- For external traffic via the API gateway, the gateway is configured to forward all outgoing API requests to the KEDA HTTP Interceptor if the target service has scale-to-zero enabled.
- For internal service-to-service traffic within the same cell, we injected init containers into all user workload deployments, regardless of whether they are callers. These init containers configure iptables rules during pod startup to redirect all outgoing HTTP traffic to the KEDA HTTP Interceptor.

Figure 3: Init containers with iptables traffic forwarding
This approach ensured that all HTTP traffic, both internal and external, was consistently routed through the interceptor without requiring any changes from the developer. However, it introduced several significant limitations:
- Every user-app required a corresponding HTTPScaledObject, even if scale-to-zero was disabled. Without it, the Interceptor could not resolve the destination of incoming requests, leading to failures.
- All user workloads had to be redeployed with the init container, which increased operational complexity and deployment overhead.
- Most critically, this design routed all HTTP traffic in the cluster through the KEDA HTTP Interceptor, regardless of whether autoscaling was needed. This raised performance concerns and potential bottlenecks, as the interceptor now had to handle all HTTP traffic—including requests to services that never needed to scale to zero.
These challenges led us to explore a more efficient and scalable approach, better suited to Choreo's architecture and performance goals.
Our Refined Approach: DNS Redirection with ExternalName for Seamless Interceptor Routing
To reduce the complexity of our initial implementation, we revisited the core assumptions around traffic redirection and developed a more elegant solution—one that eliminates the need for init containers, avoids modifying application deployments, and maintains Choreo's seamless developer experience.
In this refined approach, the only change required is at the Kubernetes service level, based on whether a component is configured for scale-to-zero. Application code, deployment manifests, and runtime behavior remain completely untouched.
Let's walk through a typical example.

Figure 4: DNS redirection with ExternalName
Consider a scenario where the app-foo component communicates with the app-bar component.
To support this, we define two Kubernetes services for app-bar:
1. app-bar: The public-facing service name used by other components (e.g., app-foo) to invoke the service. This is defined as an ExternalName service that redirects traffic to the KEDA HTTP Interceptor.
app-bar-local: A standard Kubernetes ClusterIP service pointing directly to the app-bar deployment.
2. With this setup, when app-foo calls app-bar using its standard service name, Kubernetes resolves the app-bar service via DNS to the KEDA HTTP Interceptor, thanks to the ExternalName mapping. The original request headers are preserved, and the redirection is entirely transparent to the caller.
Port Consistency and Traffic Routing
This routing mechanism works correctly only when both the app-bar component and the KEDA Interceptor proxy operate on the same port, typically 8080. However, in cases where the target service runs on a different port (e.g., 8087), the ExternalName service incorrectly attempts to route to that specific port on the Interceptor—which is not intended.
To address this, we standardized on a predefined range of supported ports for scale-to-zero applications—such as 8080–8089 and 7070–7079—giving us 20 available ports. We then configure the KEDA Interceptor's Kubernetes service like this:
This approach Regardless of the exposed service port, all traffic within the defined range is redirected to port 8080 on the Interceptor.
HTTPScaledObject Configuration
The corresponding HTTPScaledObject specifies app-bar-local as the scaleTargetRef, allowing KEDA to direct the request to the correct backend service once it has scaled up:
If no replicas are running, the Interceptor queues the request, KEDA triggers a scale-up, and the traffic is forwarded only after the service becomes available.
Benefits of This Approach
This refined design greatly simplified our integration while improving reliability and maintainability:
- No init containers
- No custom iptables rules per workload
- No changes to user deployment specs
- Easily switchable between scale-to-zero and standard HPA by updating the service type
By isolating the routing logic entirely to the Kubernetes service layer, we preserved Choreo's core architecture, minimized operational overhead, and made scale-to-zero completely transparent to platform users.
Operationalizing Scale-to-Zero: Improvements and Production Readiness
We rolled out our scale-to-zero implementation to production in March 2024, and since then, the system has been running reliably with no major issues encountered. As we monitored its performance and gathered feedback, we identified and implemented several improvements to enhance user experience and reliability.
Improved First-Time User Experience After Deployment
In the initial release, newly deployed services started at zero replicas by default. This meant that when a user deployed an app and immediately tried to access it, the service had to scale up from zero, incurring a delay equal to the cold-start time (e.g., container startup). While this behavior was technically correct, it wasn't aligned with user expectations—especially since users typically want to access the service immediately after deployment.
To address this, we introduced a warm-up behavior:
- Upon deployment, services are initially scaled to the minimum replica count (e.g., 1 or 2).
- After a configurable idle timeout, the service is allowed to scale back down to zero if unused.
This change significantly improved the first-touch experience without compromising the resource savings of scale-to-zero in the long run.
Handling Transient 503s During Cold Starts
We also observed an edge case for certain applications—especially those written in Java—where the Kubernetes readiness probe passed, but the application was not yet ready to fully handle incoming requests. As a result, the first request released by the KEDA Interceptor could occasionally result in an HTTP 503.
To mitigate this, we implemented an automatic 503 retry mechanism in the interceptor for the first request after scale-up. This ensures:
- Better reliability for users
- Successful delivery of the first request even during tight startup window
Enhanced Observability with OpenTelemetry Support
Another key improvement we made post-initial release was integrating OpenTelemetry into the KEDA HTTP Interceptor to provide better visibility into the behavior and performance of scale-to-zero services.
By embedding OpenTelemetry support directly into the Interceptor, we've made scale-to-zero fully observable, giving both developers and platform engineers clear insights into autoscaling behavior, request flows, and performance characteristics—just like any always-on service. This also brings scale-to-zero workloads into parity with standard services in terms of observability, ensuring consistent operational maturity across the board.
Looking Ahead: WebSocket and eBPF Support
Our initial implementation focused exclusively on HTTP services, and WebSocket support was not included. We are currently working on extending the KEDA Interceptor to support WebSocket-based workloads, enabling scale-to-zero for more real-time and bi-directional communication scenarios.
Additionally, we have completed a proof of concept for replacing the user-space HTTP Interceptor with an eBPF-based Interceptor. Early results are promising, showing noticeable improvements in network latency and throughput. We believe this shift will:
- Significantly improve performance for high-throughput services
- Reduce overhead at the data plane level
- Simplify request routing without compromising visibility
These enhancements are actively being developed and will be covered in future WSO2 engineering blog posts.
Summary: Cost Efficiency at Scale
By implementing scale-to-zero in Choreo, we've been able to drastically reduce infrastructure costs, especially for workloads with intermittent or unpredictable traffic. This cost optimization directly benefits our users by enabling them to run more workloads with minimal baseline consumption, improving overall platform efficiency.
At the same time, we've maintained a strong focus on user experience, ensuring that services remain responsive, reliable, and simple to manage—with no need for custom configuration or changes to application code.
Scale-to-zero has proven to be a foundational capability for enabling elastic, cost-effective platforms—and we're excited to continue evolving it further.