Thursday, June 2, 2016

DevOps, Availability, and Risk

excellent points from an episode of Arrested DevOps, entitled "Who Owns Your Availability?" (TLDR: you do!) https://www.arresteddevops.com/availability/

My thoughts:

- technical risk can produce business risk, as in "your hundred employees can't do anything for an hour", up to "your database is gone therefore the company is gone" kinds of risk.  Or, "feature X doesn't work for user class Y" kinds of risk. Do you as a business prioritize consumers paying you, or you delivering their stuff, or your admins/phone people delighting your customers, or your developers fixing bugs?

From the show (Charity Majors, Pete Cheslock, ADO crew). (Quotes are my foggy recollections, not quotes.):

- cache ("vendor") your dependencies

If you can't deploy to production because GitHub or a 3rd party package server in China is down, things are not good.  Likewise, if your server is connecting to China and all your packages are local, perhaps it's time for a security check. (If you don't know what servers your server is talking to, that's another risk.)

- what is your Risk Profile? What is considered acceptable risk?

As your company starts it's probably fine to rely on the internet being always available all the time. Not being able to deploy for an hour/day might be okay. Spending resources on growing your company might be a good tradeoff vs security and availability.

- your dependencies are cached. What about deps of deps of deps?

- "Packerize the base"

If your system has a baked, reliable base, with a little bit of changes on top, then it's easier to track down and fix things that break.

One mechanism is "baking" all your random dependencies to a Docker layer.  Or, network volume -- Amazon S3 for example (deb-s3).  It can still go down, but if it's up you get everything in one place. It'll be there for you even if the original host is not happy for whatever reason. One person mentioned she had more problems with GitHub's reliability than her own.

Another failure mode: known-good version is broken. Your business depends on the "beer-1.0" package. It's been working fine for months.  Developer gets drunk and uploads a broken package, but uses the same version number -- "beer-1.0" is now broken.  You can no longer make changes to your business!  Since you own your availability, it's your problem.

- "if you treat your devs like children, they'll act like children. They'll become subject matter experts on doing things the wrong way. We as devops can be spirit guides, career counselors for your leveling up skills." Developers own the code, the availability. Give them pagers and wake them up when the site has problems.

- site should have "circuit breakers" - if the site is in "continuous partial failure", that's better than just being down for everyone, full stop.


I dig the Arrested DevOps podcast, and listen to it often. Thanks!



No comments:

Post a Comment