Friday, June 17, 2016

Quality DevOps: installing and verifying Network Time Protocol (NTP)

I lurve Ansible. It lets me install or update software on one or 100 instances, easily. The entire system becomes a set of scripts to run and run and run again until I get things exactly the way I want them.

In today's devops ecosystem, where "infrastructure is code", how do we test our infrastructure?

Ansible gives us one way to do this.  When we install or update a service, run a service-specific command to make doubly sure that things are working as expected. If something's not quite right, Ansible will abort and we can figure out what went kablooey.

Save the following into "ntp.yml" and run with ansible-playbook -vvi myhost ntp.yml

Thanks to phillipuniverse !

# ntp.yml -- install NTP time sync daemon
# Adapted from https://gist.github.com/phillipuniverse/7721288#file-ntp_playbook-yml
#
# USAGE: ansible-playbook -vvi myhost ntp.yml
#
---

- hosts: all
  become: yes
  gather_facts: no

  tasks:

    - name: Install NTP
      apt: package=ntp state=present update_cache=yes
      tags: ntp

    - name: Make sure NTP is started up
      service: name=ntp state=started enabled=yes
      tags: ntp

    - name: verify NTP synchronized
      command: timedatectl status
      register: ntp_result
      failed_when: "'synchronized: yes' not in ntp_result.stdout"
      tags: ntp


  handlers:
    - name: restart ntp
      service: name=ntp state=restarted

Thursday, June 2, 2016

DevOps, Availability, and Risk

excellent points from an episode of Arrested DevOps, entitled "Who Owns Your Availability?" (TLDR: you do!) https://www.arresteddevops.com/availability/

My thoughts:

- technical risk can produce business risk, as in "your hundred employees can't do anything for an hour", up to "your database is gone therefore the company is gone" kinds of risk.  Or, "feature X doesn't work for user class Y" kinds of risk. Do you as a business prioritize consumers paying you, or you delivering their stuff, or your admins/phone people delighting your customers, or your developers fixing bugs?

From the show (Charity Majors, Pete Cheslock, ADO crew). (Quotes are my foggy recollections, not quotes.):

- cache ("vendor") your dependencies

If you can't deploy to production because GitHub or a 3rd party package server in China is down, things are not good.  Likewise, if your server is connecting to China and all your packages are local, perhaps it's time for a security check. (If you don't know what servers your server is talking to, that's another risk.)

- what is your Risk Profile? What is considered acceptable risk?

As your company starts it's probably fine to rely on the internet being always available all the time. Not being able to deploy for an hour/day might be okay. Spending resources on growing your company might be a good tradeoff vs security and availability.

- your dependencies are cached. What about deps of deps of deps?

- "Packerize the base"

If your system has a baked, reliable base, with a little bit of changes on top, then it's easier to track down and fix things that break.

One mechanism is "baking" all your random dependencies to a Docker layer.  Or, network volume -- Amazon S3 for example (deb-s3).  It can still go down, but if it's up you get everything in one place. It'll be there for you even if the original host is not happy for whatever reason. One person mentioned she had more problems with GitHub's reliability than her own.

Another failure mode: known-good version is broken. Your business depends on the "beer-1.0" package. It's been working fine for months.  Developer gets drunk and uploads a broken package, but uses the same version number -- "beer-1.0" is now broken.  You can no longer make changes to your business!  Since you own your availability, it's your problem.

- "if you treat your devs like children, they'll act like children. They'll become subject matter experts on doing things the wrong way. We as devops can be spirit guides, career counselors for your leveling up skills." Developers own the code, the availability. Give them pagers and wake them up when the site has problems.

- site should have "circuit breakers" - if the site is in "continuous partial failure", that's better than just being down for everyone, full stop.


I dig the Arrested DevOps podcast, and listen to it often. Thanks!