Astro Hacker News - Why Your Load Balancer Still Sends Traffic to Dead Backends

dastbe |next [-]

kind of right, kind of wrong

* for client-side load balancing, it's entirely possible to move active healthchecking into a dedicated service and have its results be vended along with discovery. In fact, more managed server-side load balancers are also moving healthchecking out of band so they can scale the forwarding plane independently of probes.

* for server-side load balancing, it's entirely possible to shard forwarders to avoid SPOFs, typically by creating isolated increments and then using shuffle sharding by caller/callee to minimize overlap between workloads. I think Alibaba's canalmesh whitepaper covers such an approach.

As for scale, I think for almost everybody it's completely overblown to go with a p2p model. I think a reasonable estimate for a centralized proxy fleet is about 1% of infrastructure costs. If you want to save that, you need to have a team that can build/maintain your centralized proxy's capabilities in all the languages/frameworks your company uses, and you likely need to be build the proxy anyways for the long-tail. Whereas you can fund a much smaller team to focus on e2e ownership of your forwarding plane.

Add on top that you need a safe deployment strategy for updating the critical logic in all of these combinations, and continuous deployment to ensure your fixes roll out to the fleet in a timely fashion. This is itself a hard scaling problem.

singhsanjay12 |root |parent [-]

For client-side LB, moving active healthcheck outside into dedicated service, wouldn't it create more reliability issues with one more service to worry about? Are there any examples of this approach being used in the industry?

donavanm |root |parent |next [-]

IME you end up with both; something like discrete client, LB, and controller. You can’t rely on any one component to “turn itself off.“ ex a client or LB can easily get into a “wedged” state where it’s unable to take itself out of consideration for traffic. For example, I’ve had silly incidents based on bgp routes staying up, memory errors/pressure preventing new health check results from being parsed, the file systems is going read only, SKB pressure interfering with pipes, and of course, the classic difference between a dedicated health check in point versus actual traffic. All those examples it prevents the client or LB from removing itself from the traffic path.

An external controller is able to safely remove traffic from one of the other failed components. In addition the client can still do local traffic analysis, or use in band signaling, to identify anomalous end points and remove itself or them from the traffic path.

Good active probes are actually a pretty meaningful traffic load. It was a HUGE problem for flat virtual network models like a heroku a decade ago. This is exacerbated when you have more clients and more in points.

As a reference, this distributed model it is what AWS moved to 15 years ago. And if you look at any of the high throughput clouds services or CDNs they’ll have a similar model.

dastbe |root |parent [-]

one thing to add for passive healthchecking and clientside loadbalancing is that throughput and dilution of signal really matters.

there are obviously plenty of low/sparse call volume services where passive healthchecks would take forever to get signal, or signal is so infrequently collected its meaningless. and even with decent RPS, say 1m RPS distributed between 1000 caller replicas and 1000 callee replicas, that means that any one caller-callee pair is only seeing 1rps. Depending on your noise threshold, a centralized active healthcheck can respond much faster.

There are some ways to improve signal in the latter case using subsetting and aggregating/reporting controllers, but that all comes with added complexity.

dastbe |root |parent |previous [-]

From a dataplane perspective, it does mean your healthchecks are running from a different location than your proxy. So there are risks where routability is impacted for proxy -> dest but not for healthchecker -> dest.

For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)

In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.

dotwaffle |next |previous [-]

I've never quite understood why there couldn't be a standardised "reverse" HTTP connection, from server to load balancer, over which connections are balanced. Standardised so that some kind of health signalling could be present for easy/safe draining of connections.

singhsanjay12 |root |parent |next [-]

The idea is attractive (especially for draining), but once you try to map arbitrary inbound client connections onto backend-initiated "reverse" pipes, you end up needing standardized semantics for multiplexing, backpressure, failure recovery, identity propagation, and streaming! So, you're no longer just standardizing "reverse HTTP", you’re standardizing a full proxy transport + control plane. In practice, the ecosystem standardized draining/health via readiness + LB control-plane APIs and (for HTTP/2/3) graceful shutdown signals, which solves the draining problem without flipping the fundamental accept/connect roles.

bastawhiz |root |parent |next |previous [-]

Whether the load balancer connects to the server or reverse, nothing changes. A modern H2 connection is pretty much just that: one persistent connection between the load balancer and server, who initiates it doesn't change much.

The connection being active doesn't tell you that the server is healthy (it could hang, for instance, and you wouldn't know until the connection times out or a health check fails). Either way, you still have to send health checks, and either way you can't know between health checks that the server hasn't failed. Ultimately this has to work for every failure mode where the server can't respond to requests, and in any given state, you don't know what capabilities the server has.

snowhale |root |parent |previous [-]

[dead]

igor47 |next |previous [-]

Back in the day, I thought about this problem domain a lot! I even wrote and open-sourced a service discovery framework called SmartStack, an early precursor to later approaches like Envoy, described here: https://medium.com/airbnb-engineering/smartstack-service-dis...

This was a client side framework, in the OPs parlance. What's missing in OP is the insight that the server-side load balancer can also fail -- what will load balance the load balancers? We performed registration based on health checks from a sidecar, and then we also did client side checks which we called connectivity checks. Multiple client instances can disagree about the state of the world because network partitions actually can result in different states of the world for different clients.

Finally, you do also still need circuit breakers. Health checks are generally pretty broad, and when a single endpoint in a service begins having high latency, you don't want to bring down the entire client service with all capacity stuck making requests to that one endpoint. This specific example is probably more relevant to the old days of thread and process pools than to modern evented/async frameworks, but the broader point still applies

singhsanjay12 |root |parent [-]

> when a single endpoint in a service begins having high latency

Yes, have seen this first hand. Tracking the latency per endpoint in a sliding window helped in some way, but it created other problems for low qps services.

singhsanjay12 |next |previous [-]

I wrote this after seeing cases where instances were technically “up” but clearly not serving traffic correctly.

The article explores how client-side and server-side load balancing differ in failure detection speed, consistency, and operational complexity.

I’d love input from people who’ve operated service meshes, Envoy/HAProxy setups, or large distributed fleets — particularly around edge cases and scaling tradeoffs.

jeffbee |root |parent |next [-]

I don't think you really need sub-millisecond detection to get sub-millisecond service latency. You mainly need to send backup requests, where appropriate, to backup channels, when the main request didn't respond promptly, and your program needs to be ready for the high probability that the original request wins this race anyway. It's more than fine that Client A and Client B have differing opinions about the health of the channel to Server C at a given time, because there really isn't any such thing as the atomic health of Server C anyway. The health of the channel consists of the client, the server, and the network, and the health of AC may or may not impact the channel BC. It's risky to let clients advertise their opinions about backend health to other clients, because that leads to the event where a bad client shoots down a server, or many servers, for every client.

owenthejumper |root |parent |next |previous [-]

Modern LBs, like HAProxy, support both active & passive health checks (and others, like agent checks where the app itself can adjust the load balancing behavior). This means that your "client scenario" covering passive checks can be done server side too.

Also, in HAProxy (that's the one I know), server side health checks can be in millisecond intervals. I can't remember the minimum, I think it's 100ms, so theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.

bastawhiz |root |parent [-]

> theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.

You need to be careful here, though, because the server might just be a little sluggish. If it's doing something like garbage collection, your responses might take a couple hundred milliseconds temporarily. A blip of latency could take your server out of rotation. That increases load on your other servers and could cause a cascading failure.

If you don't need sub-second reactions to failures, don't worry too much about it.

Noumenon72 |root |parent |next |previous [-]

Thanks for writing something that's accessible to someone who's only used Nginx server-side load balancing and didn't know client-side load balancing existed at higher scale.

firefoxd |root |parent |previous [-]

Hi author, a tangent:

    <meta name="viewport" content="width=device-width, initial-scale=1" />

For us who need to zoom in on mobile devices.

singhsanjay12 |root |parent [-]

Ok, do you mind briefly describing, what issues you saw on mobile?

willi59549879 |root |parent [-]

Zoom on mobile is not possible. So all the graphs are tiny and not readable.

gbuk2013 |next |previous [-]

I have to say I am not a fan of doing this on the client side.

API gateways (which is what server side load-balancer can be abstracted as) serve as important control points for service traffic, for example for auth, monitoring and observability, application firewall, rate limiting etc.

In my general experience code running on the client side is less reliable due to permutations of browsers, flaky networks, challenges with observability.

That said, client side already has one type of load balancing - DNS - but that doesn’t address the availability challenge.

AuthAuth |next |previous [-]

It seems like passive is the best option here but can someone explain why one real request must fail? So the load balancer is monitoring for failed requests. If it receives one can it not forward the initial request again?

jayd16 |root |parent |next [-]

Not every request is idempotent and its not known when or why a request has failed. GETs are ok (in theory) but you can't retry a POST without risk of side effects.

bdangubic |root |parent |next [-]

I am a contractor and have been fixing shit large part of my career. non-idempotent POSTs are just about always at the top of the list of shit to fix immediately. To this day (30 years in) I do not understand how can someone design a system where POSTs are not idempotent… I mean I know why, the vast majority of people in our industry are just not good at what they do but still…

SoftTalker |root |parent [-]

Yep. I worked in corporate back-office IT way before the web era. It was a requirement that every batch job be re-runable idempotently. So if it failed, you'd identify the bad data, excise it, rerun the job, and deal with the bad record in the morning.

OptionOfT |root |parent |previous [-]

There were some issues with replaying certain GETs back in the day:

https://news.ycombinator.com/item?id=16964907

cormacrelf |root |parent |previous [-]

For GET /, sure, and some mature load balancers can do this. For POST /upload_video, no. You'd have to store all in-flight requests, either in-memory or on disk, in case you need to replay the entire thing with a different backend. Not a very good tradeoff.

itmitica |next |previous [-]

“We won’t fix the simple, visible server-side problem, so we’ll distribute a harder version of it into every client.”

umairnadeem123 |previous [-]

[dead]

singhsanjay12 |root |parent [-]

Agree - sliding window error rates plus client-side circuit breakers (with half-open probes and ramp-up) work really well in practice, and the recovery-speed point is especially important.

The only nuance I was trying to call out is what happens at very large scale. These mechanisms operate per client instance, so each client needs a few failures before it trips its breaker and then runs its own probes and ramp-up. That's perfectly reasonable locally, but when you have hundreds or thousands of clients, the aggregate "learning traffic" can still be noticeable. Each client might only send a little bad traffic before reacting, but multiplied across the fleet it can still add up. Similarly, recovery can still produce smaller synchronized ramps as many clients independently notice improvement around the same time.

So I tend to think of client-side circuit breakers as necessary but not always sufficient at scale. They're great for fast local containment and tail-latency protection, but they work best when paired with some shared signal (LB, mesh control plane, or similar) that can dampen the aggregate effect and smooth recovery globally.