Such amazing knowledge in the Amazon Builder& #39;s Library. Just seen "Avoiding fallback in distributed systems". https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/?did=ba_card&trk=ba_card">https://aws.amazon.com/builders-... > super useful info, even if you& #39;re designing an on-prem active/passive service. How should you think about retry/failover and fallback?
Also "Static stability using Availability Zones". How to load apps across multiple data centers (AZs in AWS speak). Meaning, when do you use active/active or active/passive, and how do you build multiple active/active services relying on each other. https://aws.amazon.com/builders-library/static-stability-using-availability-zones/?did=ba_card&trk=ba_card">https://aws.amazon.com/builders-...
Then, failures always happen, how do you manage stuff not talking. "Timeouts, retries, and backoff with jitter". from @MarcJBrooker shows how we think about it, how we are cautious with retries. https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/?did=ba_card&trk=ba_card">https://aws.amazon.com/builders-...
A great way to connect separate services is with a queue for messaging. What happens when queues back up? "Avoiding insurmountable queue backlogs" from @dyanacek explains how we use prioritization, sidelining, and backpressure to minimize queue backlogs. https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/?did=ba_card&trk=ba_card">https://aws.amazon.com/builders-...
Then @dyanacek has more with "Using load shedding to avoid overload". Avoid taking on more work when you know you can& #39;t cope. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/?did=ba_card&trk=ba_card">https://aws.amazon.com/builders-...
You can imagine AWS rolls out a heck of a lot of software changes globally to MASSIVE distributed systems. How do we rollback when things go wrong? "Ensuring rollback safety during deployments" explains how we do it: https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/?did=ba_card&trk=ba_card.">https://aws.amazon.com/builders-...
Pair this with "Going faster with continuous delivery". How we reduce the risk that a defect will reach customers and how we drastically up the speed of execution. " https://aws.amazon.com/builders-library/going-faster-with-continuous-delivery/"">https://aws.amazon.com/builders-...
and diving even deeper into our deployment process with "Automating safe, hands-off deployments". @clare_liguori explains how we do staggered wave deployments to internal service "cells" across 24 Regions & 76 Availability Zones https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/">https://aws.amazon.com/builders-...