Reading List: DevOps (Build / Deploy Automation, Monitoring, Logging, Instrumentation)

General Info
Docker Containers
Docker – data-mgmt-specific
Docker Swarm-specific
Docker on Windows: Linux-containers-specific
Docker on Windows: Windows-containers-specific
Docker Security / Image size optimization
Docker Monitoring
Docker – build-pipeline-specific
Kubernetes container cluster-manager
Nomad Container and generic app Mgmt.
Rancher Container Mgmt.
Mesos DCOS / Mesosphere / Marathon Framework
Key-Value Stores and Service
Discovery

Config-mgmt and versioning
Change Management
Logging and Instrumentation
QA-integration
General Build
Jenkins
Continuous Deployment / Delivery
Lean DevOps
Database-specific DevOps
Linux Administration / Security
Azure-specific
AWS-specific

General Info

  • Project Execution Methodologies – The Change (infographic) Waterfall, Agile, Devops color-coded lines
  • Subbu: Don’t Build Private Clouds – dc -> cloud journey typical phases: 1) private cloud; 2) move stateless monoliths; 3) Deal with stateful monoliths; 4) transform to cloud-native
  • Logicalis: How DevOps accelerates innovation (infographic) – process, people & tools, culture, overall benefits
  • Puppet: 2015 State of DevOps Report – impact of lean mgmt. & continuous delivery on culture & performance; application arch. & dev. productivity; how it mgrs. can help their teams win; burnout; methodology. High-performing IT organizations deploy 30x more frequently with 200x shorter lead times (debunked the myth that we need to choose between speed and reliability); they have 60x fewer failed deployments and recover (MTTR) 168x faster ( Failures are unavoidable, but how quickly you detect and recover from failure can mean the difference between leading the market and struggling to catch up with the competition); Lean management and continuous delivery practices create the conditions for delivering value faster, sustainably; High performance is achievable whether your apps are greenfield, brownfield or legacy ( Continuous delivery can be applied to any system, provided it is architected correctly (can do most testing without an integrated environment, deploy/release independently of other applications/microservices it depends on). We also found that high performers are more likely to use a microservices architecture…); Deployment pain can tell you a lot about your IT performance. throughput measures: deployment freq., deployment lead time. stability measures: mean time to recover (MTTR). why culture matters: pathological, bureaucratic, generative
  • slalom: 5 ways to incorporate DevOps into your software delivery process – 1) enable entire team to work together “Breaking down silos and bringing people together is the MOST IMPORTANT part of DevOps”; embracing agile is a major tenet in DevOps culture. Agile works aggressively toward bringing your teams together by restructuring work and introducing feedback along the way; 2) automate everything! “treat[ing] your server configuration like developers treat code.” Extract out environmentally-specific application properties into configuration files stored in source control applied using a configuration management system. That is the key to automation, and the cornerstone of DevOps; The only difference between dev and production should really boil down to a set of connection strings and environment variables; 3) Everyone is responsible for production; if you don’t task developers with production duties, they won’t write production-optimized code; 4) Get obsessed with tests, then automate them, too; automated tests have to be written not only for your code coverage, but for your infrastructure scripts as well; 5) Become comfortable deploying frequently to production
  • Alex King: A 10,000ft View of DevOps at Gogo (38min vid) – tooling should do mundane stuff, not developers; change management; foremast templating tool for spinnaker; canary deployments in prod instead of separate dev/stage/prod
  • Sasha Rosenbaum: Single Person of Failure (19min vid) – slideshare – imagine buying a server that has an uptime of 16 hrs a day, with interruptions! Humans are not highly available; antipattern #1: “you shall not pass to my production server”; “even when systems are automated there are still humans who manage them”; “why is there a single admin? Situation often evolves organically from a small team”; Solutions: role-based access, use service accounts not personal accounts for services; make sure person on call has necessary access; trust your people; antipattern #2: “be aware of the single expert”; a quote we’ve all heard: “this will take me 8 hrs to explain vs. 15mins to fix”; can you afford losing this knowledge? – delegate to juniors; new hires haven’t yet caught the “this is how it’s always been” virus; you are emotionally invested in your code; Solutions: documentation, comments, tests, automation; antipattern #3: “I cannot afford to take vacation!” Job security? Research shows that working longer hours does not increase productivity; solution: Game days – intentionally breaking infrastructure in simulated-production or even actual production (off-hours)
  • Jeff Sussna: Why DevOps Really Is About Culture – trend away from exhaustive specs and command & control work assignments, toward empowering decision-making; away from snow flake servers toward standardized template configurations
  • DeGrandis: Devops: A Software Revolution in the Making?
  • Andrew Phillips: No Quick Fix for DevOps – emphasis on dev and ops collaborating / working together vs. just being resources
  • Agile Sysadmin: Kanban for Sysadmin
  • LessThanDot: Applying Kanban to IT Processes (Part 2): Help Desk / Support Scenario
  • Matthew Skelton: What Team Structure is Right for DevOps to Flourish? – describes both patterns and anti-patterns
  • 26thCentury: Test Automation
  • Twitter: #ConwaysLaw
  • Twitter: #DevOps

Docker – general

Docker – Swarm-specific

Docker – data-mgmt-specific

Docker on Windows: Linux-containers-specific

Docker on Windows: Windows-containers-specific

Docker Security / Image size optimization

Docker Monitoring

  • Brian Christner: How to setup Docker Monitoring – cAdvisor, InfluxDB, Grafana
  • Sysdig – Container-optimized monitoring and troubleshooting tool – see this swarm github comment from Alexandre Beslic of Docker recommending Sysdig to be used for self-healing for Swarm; Sysdig has Rancher endorsement in the form of strong integration
  • Docker: Nathan LeClaire: Realtime Cluster Monitoring with Docker Swarm and Riemann – push-based monitoring; compat. with Graphite, InfluxDB, Librato et al.; riemann server, health reporting, dashboard UI
  • Joyent: Simplifying service discovery in Docker with Containerbuddy – “Discovery services like Consul provide a means of performing health checks from outside our container, but that means packaging the tooling we need into the Consul container. If we need to change the health check, then we end up re-deploying both our application and Consul, which unnecessarily couples the two…. Containerbuddy to the rescue!… Containerbuddy registers the application with Consul on start and periodically sends TTL health checks to Consul; should the application fail then Consul will not receive the health check and once the TTL expires will no longer consider the application node healthy. Meanwhile, Containerbuddy runs background workers that poll Consul, checking for changes in dependent/upstream service, and calling an external executable on change.”

Docker – build-pipeline-specific

Kubernetes container cluster manager

  • Eric Brewer: GCPNext keynote on Kubernetes and config (1hr vid)
  • KubeCon “Cloud-Scale Kubernetes at eBay” (18min vid) – case study of how Kubernetes being used at eBay; Shows how inflexible static provisioning is, vs the pool of resources managed by Mesos; EBay is a pro-opensource company, their first choice is always to use or use-and-adapt an open source tool; Kubernetes lets you declare your intent and simply “Run”, vs a traditional Provision->Deploy->Monitor->Remediate-> cycle; For network routing, they are planning to use BGP all the way to the host containers; Local storage leases help with dbs like Cassandra
  • Kubernetes 101 – Kubectl CLI and Pods – Kubectl CLI; Pod management, volumes, volume types, multiple containers
  • Kubernetes 201 – Labels, Replication Controllers, Services and Health Checking – Labels, Replication Controllers, Services, Health Checking
  • Google: Borg, Omega, and Kubernetes Lessons learned from three container-management systems over a decade – long but worth-it article about google’s evolution of container technologies; “the container has become the sole runnable entity supported by the Google infrastructure”; “Building management APIs around containers rather than machines shifts the “primary key” of the data center from machine to application”; “The design of Kubernetes as a combination of microservices and small control loops is an example of control through choreography—achieving a desired emergent behavior by combining the effects of separate, autonomous entities that collaborate”; things to avoid (e.g., have Container system manage port nbrs. – Kubernetes instead allocates a “service vip” per pod); use labels to group containers rather than numbering them; some open, hard problems: Configuration – maintain a clean separation between computation and data, use declarative form like JSON or YAML; dependency management

Nomad Container and generic app Mgmt.

Rancher Container Mgmt.

Mesos DCOS / Mesosphere / Marathon Container Mgmt.

  • Adrian Mouat: Swarm v. Fleet v. Kubernetes v. Mesos – “Swarm has the advantage (and disadvantage) of using the standard Docker interface. Whilst this makes it very simple to use Swarm and to integrate it into existing workflows, it may also make it more difficult to support the more complex scheduling that may be defined in custom interfaces; Kubernetes is an opinionated orchestration tool that comes with service discovery and replication baked-in. It may require some re-designing of existing applications, but used correctly will result in a fault-tolerant and scalable system; Mesos is a low-level, battle-hardened scheduler that supports several frameworks for container orchestration including Marathon, Kubernetes, and Swarm”

Key-Value Stores and Service Discovery

Config-mgmt and versioning

Change Management

  • Tarique Smith: Will The DevOps Movement Be The Death Of Change Management – “Historically, testing and staging environments (where the heavy lifting of testing takes place) have been underfunded. As a result, they have not always been at the same maintenance level or configuration as the target production environment(s). This mismatch often caused testing results to be suspect, contributing to a lack of trust between development and operations. As a result, testing–a pillar of the waterfall methodology–has traditionally only been able to ensure certain aspects of an application’s testing complement”; “By automating testing and deployment through continuous integration and continuous deployment (CICD), change management stakeholders can shift their attention from traditional concerns, such as separation of tasks and back-out plans. This leaves them with the ability to focus on curation and review of the DevOps/CICD process and standards in use by the development teams requesting or deploying a change”

Logging and Instrumentation

  • Peter Bourgon: Logging v. Instrumentation – services should only log actionable information; Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema; Avoid multiple production log levels (info, warn, error) and especially runtime level configuration; An exception is debug logging, which is useful during development and problem diagnosis; It’s the responsibility of your operating environment or infrastructure to route process or container stdout/stderr to the appropriate destination;  In contrast to logging, services should instrument every meaningful number available for capture
  • Aditya Mukerjee: Don’t Read your Logs – “reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools”; Antipatterns: Logs as Metrics, Logs as Debugger Tracing, Logs as Error Reporting, Logs as Durable Records; “Don’t Stop Logging, But Stop Reading Log Lines”

QA-integration

General Build

Jenkins

Continuous Deployment / Delivery/>

  • Thoughtworks: It’s not CI, it’s just CI theatre – advocates for trunk-based dev; jez humble’s definition of CI: “CI developers integrate all their work into trunk (also known as mainline or master) on a regular basis (at least daily)”; symptoms of ‘CI theatre’ include: long-lived branches, poor test coverage, allowing red builds for long periods; ‘continuous isolation’ – the practice of running CI against feature branches; “frequency reduces difficulty”; trunk-based dev “brings the pain forward rather than storing it up for merges, code reviews or delaying releases”
  • Humble / Farley: Continuous Delivery: Anatomy of the Deployment Pipeline – freely-available chapter from their Continuous Delivery book
  • Jeff Sussna: Why We Should Continuously Break Everything – reduce overall cost of failure by continuously refactoring at all levels of the IT stack, including devops automation
  • Jeff Sussna: Microservices, Have You Met… DevOps? – with the promise of increased agility and improved quality, microservice architectures represent a shift from complicated systems, where stability is paramount, to complex interconnected systems, where resilience matters most; each microservice must design for failure by treating its dependencies as it would any third-party service
  • Enabling Microservices @Orbitz (38min vid) – Key quote: “Get from code to production with as little people involvement as possible”; Where they started: A Conway’s law silo’d org, with dev using one tool for deployments and Ops using another, infrequent, large, inefficient multi-month release cycles; Where they arrived: Fully automated build pipeline from time of code review; Multiple releases per day; Automated environment supervision avoids downtime; Key technologies used: Docker for repeatable applications; Chef for repeatable infrastructure; Jenkins for repeatable releases; Other technologies used: Consul as service registry; ElasticSearch / LogStash (logstash forwards logs from each localhost); Graphite for metrics; Mesos Marathon for launching and supervision HAProxy for front-end proxy in conjunction with Bamboo; Ansible for deploy through Marathon), Chef for VM provisioning; Atlassian Stash (now Bitbucket) Git repo; Other notes: Following The Twelve-Factor App pattern – configs simple (sometimes as few as one specified param), stateless services where possible; All Jenkins slaves are docker containers running on Mesos; Easy to create pipelines – they just pass app name and version; Smoke / Acceptance testing may be done in any target environment as part of build pipeline
  • DeGrandis: Getting to one button deploy using Kanban
  • Philips / Kawaguchi: Orchestrating your Delivery Pipelines with Jenkins – tips on sharing build artifacts through the pipeline, approval via Promoted Builds Plugin
  • Cloudbees: Another Look at the Jenkins Promoted Builds Plugin – example of dev -> QA / release-mgmt handoff

Lean DevOps

Database-specific DevOps

Linux Administration / Security

Azure-specific

AWS-specific

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s