Reading List: DevOps (Build / Deploy Automation, Monitoring, Logging, Instrumentation)

General Info
Enablement
SRE
IAC
Config-mgmt and versioning
Change Management
Strategy and Planning
Serverless
Chaos Engineering
AWS Well-Architected Framework
Docker Containers
Docker – data-mgmt-specific
Docker on Windows: Windows-containers-specific
Docker Security / Image size optimization
Docker Monitoring
Docker – build-pipeline-specific
Kubernetes
AWS ECS
Key-Value Stores and Service Discovery
Logging and Instrumentation
Observability, Monitoring & Alerting
QA-integration
General Build
Jenkins
Continuous Deployment / Delivery
Canaries
Lean DevOps
Database-specific DevOps
Azure-specific
AWS-specific

General Info

  • Microsoft’s DevOps Journey (130 slide deck)
  • Nirmal Mehta:  strong belief, loosely held: Bringing Empathy to IT (45min vid) – Pareto inefficient Nash Equilibrium; Docker – empathy as code
  • Project Execution Methodologies – The Change (infographic) Waterfall, Agile, Devops color-coded lines
  • Subbu: Don’t Build Private Clouds – dc -> cloud journey typical phases: 1) private cloud; 2) move stateless monoliths; 3) Deal with stateful monoliths; 4) transform to cloud-native
  • Logicalis: How DevOps accelerates innovation (infographic) – process, people & tools, culture, overall benefits
  • Puppet: 2015 State of DevOps Report – impact of lean mgmt. & continuous delivery on culture & performance; application arch. & dev. productivity; how it mgrs. can help their teams win; burnout; methodology. High-performing IT organizations deploy 30x more frequently with 200x shorter lead times (debunked the myth that we need to choose between speed and reliability); they have 60x fewer failed deployments and recover (MTTR) 168x faster ( Failures are unavoidable, but how quickly you detect and recover from failure can mean the difference between leading the market and struggling to catch up with the competition); Lean management and continuous delivery practices create the conditions for delivering value faster, sustainably; High performance is achievable whether your apps are greenfield, brownfield or legacy ( Continuous delivery can be applied to any system, provided it is architected correctly (can do most testing without an integrated environment, deploy/release independently of other applications/microservices it depends on). We also found that high performers are more likely to use a microservices architecture…); Deployment pain can tell you a lot about your IT performance. throughput measures: deployment freq., deployment lead time. stability measures: mean time to recover (MTTR). why culture matters: pathological, bureaucratic, generative
  • slalom: 5 ways to incorporate DevOps into your software delivery process – 1) enable entire team to work together “Breaking down silos and bringing people together is the MOST IMPORTANT part of DevOps”; embracing agile is a major tenet in DevOps culture. Agile works aggressively toward bringing your teams together by restructuring work and introducing feedback along the way; 2) automate everything! “treat[ing] your server configuration like developers treat code.” Extract out environmentally-specific application properties into configuration files stored in source control applied using a configuration management system. That is the key to automation, and the cornerstone of DevOps; The only difference between dev and production should really boil down to a set of connection strings and environment variables; 3) Everyone is responsible for production; if you don’t task developers with production duties, they won’t write production-optimized code; 4) Get obsessed with tests, then automate them, too; automated tests have to be written not only for your code coverage, but for your infrastructure scripts as well; 5) Become comfortable deploying frequently to production
  • Alex King: A 10,000ft View of DevOps at Gogo (38min vid) – tooling should do mundane stuff, not developers; change management; foremast templating tool for spinnaker; canary deployments in prod instead of separate dev/stage/prod
  • Sasha Rosenbaum: Single Person of Failure (19min vid) – slideshare – imagine buying a server that has an uptime of 16 hrs a day, with interruptions! Humans are not highly available; antipattern #1: “you shall not pass to my production server”; “even when systems are automated there are still humans who manage them”; “why is there a single admin? Situation often evolves organically from a small team”; Solutions: role-based access, use service accounts not personal accounts for services; make sure person on call has necessary access; trust your people; antipattern #2: “be aware of the single expert”; a quote we’ve all heard: “this will take me 8 hrs to explain vs. 15mins to fix”; can you afford losing this knowledge? – delegate to juniors; new hires haven’t yet caught the “this is how it’s always been” virus; you are emotionally invested in your code; Solutions: documentation, comments, tests, automation; antipattern #3: “I cannot afford to take vacation!” Job security? Research shows that working longer hours does not increase productivity; solution: Game days – intentionally breaking infrastructure in simulated-production or even actual production (off-hours)
  • Jeff Sussna: Why DevOps Really Is About Culture – trend away from exhaustive specs and command & control work assignments, toward empowering decision-making; away from snow flake servers toward standardized template configurations
  • DeGrandis: Devops: A Software Revolution in the Making?
  • Andrew Phillips: No Quick Fix for DevOps – emphasis on dev and ops collaborating / working together vs. just being resources
  • Agile Sysadmin: Kanban for Sysadmin
  • LessThanDot: Applying Kanban to IT Processes (Part 2): Help Desk / Support Scenario
  • Matthew Skelton: What Team Structure is Right for DevOps to Flourish? – describes both patterns and anti-patterns
  • 26thCentury: Test Automation
  • Twitter: #ConwaysLaw
  • Twitter: #DevOps

Enablement

Viktor Farcic: How To Shift Left Infrastructure Management Using Crossplane Compositions (28min vid)- You want to enable teams to serve themselves, not require them to create tickets for you every time something needs to be deployed or a resource created; crossplane.io is an unopinionated api allowing sre / ops / platform teams to create opinionated frameworks to be consumed by dev / feature teams

SRE

  • Google: SRE vs. DevOps: competing standards or close friends? – DevOps practices and SRE implementation are very similar; “SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs)”; “…SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service”; “When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future… Google aims to ensure that at least 50% of each SRE’s time is spent doing engineering projects…”
  • Charity Majors: DevOps vs SRE: Delayed Coverage of the Dumbest War – compares google’s Site Reliability Engineering book with Effective DevOps, calls SRE “Ops with mgmt support”; “When error budgets are depleted, the focus shifts from feature development to improving reliability”
  • Charity Majors: Operational Best Practices: #Serverless – summary of Serverlessness, NoOps, the Tooth Fairy (37min vid) – “engineering cycles are going to be the scarcest resources anyone has”; “in the glorious serverless future, more responsibility for operational quality needs to come from application developers”; “operations is the constellation of your org’s technical skills, practices, and cultural values around designing, building and maintaining systems, shipping software, and solving problems with technology”; “labor can be outsourced. Caring can’t”; “ask every candidate interview questions about operations and debugging”; “never promote someone to sr. software engineer if they are net negative for operations”
  • Google: Narayan Desai: Beyond Goldilocks Reliability (39m vid) –  Goldilocks Reliability: Define some SLIs, Choose “Just Right”; Assumptions: Just Right makes sense, There is one answer, We know the questions to ask, The answers don’t change; Problems: No model of reliability, Each Goldilock metric provides a narrow window into behavior; The Trouble with Tresholds: 2-bucket dichotomy usually means lower fidelity and effectively loss of data, We have no basis to judge quality, No support for deeper insights; Reliability: Availability, Performance, Correctness; Stationarity Exposes Reliability Phenomena; Tantalizing Capabilities: Proactive reliability interventions, Data-driven prioritization of reliability investments
  • Treynor, Dahlin, Rau, Bey: The Calculus of Service Availability – this article expands upon the topic of SLOs to focus on service dependencies. Specifically, we look at how the availability of critical dependencies informs the availability of a service, and how to design in order to mitigate and minimize critical dependencies
  • Google: Implementing SLOs
  • Google: Example Error Budget Policy – goals, non-goals, miss-policy, etc.
  • Google: SRE home: SRE book online etc.

IAC

Config-mgmt and versioning

Change Management

  • Tarique Smith: Will The DevOps Movement Be The Death Of Change Management – “Historically, testing and staging environments (where the heavy lifting of testing takes place) have been underfunded. As a result, they have not always been at the same maintenance level or configuration as the target production environment(s). This mismatch often caused testing results to be suspect, contributing to a lack of trust between development and operations. As a result, testing–a pillar of the waterfall methodology–has traditionally only been able to ensure certain aspects of an application’s testing complement”; “By automating testing and deployment through continuous integration and continuous deployment (CICD), change management stakeholders can shift their attention from traditional concerns, such as separation of tasks and back-out plans. This leaves them with the ability to focus on curation and review of the DevOps/CICD process and standards in use by the development teams requesting or deploying a change”

Strategy and Planning

Serverless

Chaos Engineering

AWS Well-Architected Framework

Docker – general

Docker – data-mgmt-specific

Docker on Windows: Windows-containers-specific

Docker Security / Image size optimization

Docker Monitoring

  • Brian Christner: How to setup Docker Monitoring – cAdvisor, InfluxDB, Grafana
  • Sysdig – Container-optimized monitoring and troubleshooting tool – see this swarm github comment from Alexandre Beslic of Docker recommending Sysdig to be used for self-healing for Swarm; Sysdig has Rancher endorsement in the form of strong integration
  • Docker: Nathan LeClaire: Realtime Cluster Monitoring with Docker Swarm and Riemann – push-based monitoring; compat. with Graphite, InfluxDB, Librato et al.; riemann server, health reporting, dashboard UI
  • Joyent: Simplifying service discovery in Docker with Containerbuddy – “Discovery services like Consul provide a means of performing health checks from outside our container, but that means packaging the tooling we need into the Consul container. If we need to change the health check, then we end up re-deploying both our application and Consul, which unnecessarily couples the two…. Containerbuddy to the rescue!… Containerbuddy registers the application with Consul on start and periodically sends TTL health checks to Consul; should the application fail then Consul will not receive the health check and once the TTL expires will no longer consider the application node healthy. Meanwhile, Containerbuddy runs background workers that poll Consul, checking for changes in dependent/upstream service, and calling an external executable on change.”

Docker – build-pipeline-specific

Kubernetes

  • Brian Grant: What is Kubernetes: An architectural view (slides)
  • Eric Brewer: GCPNext keynote on Kubernetes and config (1hr vid)
  • KubeCon “Cloud-Scale Kubernetes at eBay” (18min vid) – case study of how Kubernetes being used at eBay; Shows how inflexible static provisioning is, vs the pool of resources managed by Mesos; EBay is a pro-opensource company, their first choice is always to use or use-and-adapt an open source tool; Kubernetes lets you declare your intent and simply “Run”, vs a traditional Provision->Deploy->Monitor->Remediate-> cycle; For network routing, they are planning to use BGP all the way to the host containers; Local storage leases help with dbs like Cassandra
  • Kubernetes 101 – Kubectl CLI and Pods – Kubectl CLI; Pod management, volumes, volume types, multiple containers
  • Kubernetes 201 – Labels, Replication Controllers, Services and Health Checking – Labels, Replication Controllers, Services, Health Checking
  • Google: Borg, Omega, and Kubernetes Lessons learned from three container-management systems over a decade – long but worth-it article about google’s evolution of container technologies; “the container has become the sole runnable entity supported by the Google infrastructure”; “Building management APIs around containers rather than machines shifts the “primary key” of the data center from machine to application”; “The design of Kubernetes as a combination of microservices and small control loops is an example of control through choreography—achieving a desired emergent behavior by combining the effects of separate, autonomous entities that collaborate”; things to avoid (e.g., have Container system manage port nbrs. – Kubernetes instead allocates a “service vip” per pod); use labels to group containers rather than numbering them; some open, hard problems: Configuration – maintain a clean separation between computation and data, use declarative form like JSON or YAML; dependency management

AWS ECS

Key-Value Stores and Service Discovery

Logging and Instrumentation

  • Peter Bourgon: Logging v. Instrumentation – services should only log actionable information; Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema; Avoid multiple production log levels (info, warn, error) and especially runtime level configuration; An exception is debug logging, which is useful during development and problem diagnosis; It’s the responsibility of your operating environment or infrastructure to route process or container stdout/stderr to the appropriate destination;  In contrast to logging, services should instrument every meaningful number available for capture
  • Aditya Mukerjee: Don’t Read your Logs – “reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools”; Antipatterns: Logs as Metrics, Logs as Debugger Tracing, Logs as Error Reporting, Logs as Durable Records; “Don’t Stop Logging, But Stop Reading Log Lines”

Observability, Monitoring & Alerting

  • Cory Watson: No-Nonsense Observability Improvement (31min vid slides) – “The Normal Zone” includes monitoring for Anticipated behaviors; The “Weird Zone” is about Observability of Unanticipated behaviors; Observability will be one of your most expensive projects; Incident Measures++ traditional ones incl. MTTD, MTTR can be lame – instead look to Nora Jones Cyclic Approach: Difficulties in Understanding, System-specific failure rates, Surprises, Lack of ownership, Near misses; Automation – need to avoid human being “out of the loop”; Invest in risk and need; Understand the use case
  • Cindy Sridharan: Monitoring and Observability – “Why call it monitoring? That’s not sexy enough anymore.”; “Observability, because rebranding Ops as DevOps wasn’t bad enough, now they’re devopsifying monitoring too”; ““whitebox monitoring”, which refers to a category of monitoring based on the information derived from the internals of systems.”; “Monitoring is for symptom based Alerting”; “Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.”; “…in essence “observability” captures what “monitoring” doesn’t (and ideally, shouldn’t).”; “…unlike “monitoring” which is known failure centric, “observability” doesn’t necessarily have to be closely tied to an outage or a user complaint. It can be used as a way to better understand system performance and behavior, even during the what can be perceived as “normal” operation of a system.”; “The process of examining the evidence (observations) at hand and being able to deduce still requires a good understanding of the system, the domain as well as a good sense of intuition. No amount of “observability” or “monitoring” tooling can ever be a substitute to good engineering intuition and instincts.”
  • Charity Majors: Observability — A 3-Year Retrospective – “Monitoring tools are effective for systems with a stable set of known-unknowns, and relatively few unknown-unknowns. For a system with predominantly unknown-unknowns, monitoring tools were all but useless.”; “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.“; “A system is observable to the extent that you can understand new internal system states without having to guess, pattern-match, or ship new code to understand that state.”; “The unknown-unknowns now rapidly outpace monitoring dashboards capability to explain them to the humans responsible for continuous uptime, reliability and acceptable performance.”; “vendors had latched on to “distributed tracing, metrics, and logs” as “three pillars of observability.” Ben Sigelman neatly debunked this, saying: it makes no sense because those are just three data types. You may achieve [observability] with all three, or none — what matters is what you do with the data, not the data itself.
  • Verica: MTTR is a Misleading Metric—Now What? – “The second problem with MTTx metrics is they are trying to simplify something that is inherently complex. They tell us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result. MTTx (along with other data like severity, impact, count, and so on) are what John Allspaw calls “shallow” incident data.”; “Vanessa Huerta Granda, a Solutions Engineer at Jeli, has an excellent post detailing a process of using MTTR and incident count metrics as a way to “set the direction of the analysis we do around our entire incident universe.””; “f quantitative metrics are inescapable, we suggest focusing on Service Level Objectives (SLOs) and cost of coordination data.”
  • Andrews / Lê-Quôc: Collecting Metrics Using StatsD, a Standard for Real-Time Monitoring – “StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application.”
  • Alex King: DevOps Meets Observability – Come Meet the Pyramid of Happiness! – Tier 1: Generation (tracing, logging, metrics); Tier 2: Ingestion and Monitoring; Tier 3: Alerting
  • Charity Majors on Observability and Understanding the Operational Ramifications of a System – “Engineers are now talking about observability instead of monitoring, about unknown-unknowns instead of known-unknowns”; “It will always be the engineer’s responsibility to understand the operational ramifications and failure models of what we’re building, auto-remediate the ones we can, fail gracefully where we can’t, and shift as much operational load to the providers whose core competency it is”; “Don’t attempt to “monitor everything”. You can’t. Engineers often waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft”; “In the chaotic future we’re all hurtling toward, you actually have to have the discipline to have radically fewer paging alerts – not more”; “… the health of the system no longer matters.  We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience, or each shopping cart’s experience (or other high cardinality dimensions).  With distributed systems you don’t care about the health of the system, you care about the health of the event or the slice”
  • Michael Kopp: Why Averages Suck and Percentiles are Great – looking at 50th percentile (median) and other percentiles like 90th give a lot more insight into how prevalent outliers are
  • Adrian Cockroft: Who monitors the monitoring systems? – “… it would be good to compare the common metrics across different monitoring systems to analyze how much variance there is. This could be done by looking for simple differences, or using a statistical technique called gauge repeatability and reproducibility
  • Mark McDonnell: Observability and Monitoring Best Practices – types; channels; contexts; know your graphs; choosing between a metric and a log
  • Cory Watson: Observability Crash Course (best of breed write-up links) –

QA-integration

General Build

Jenkins

Continuous Deployment / Delivery/>

  • Paul Hammant: Practices correlated with trunk-based development – from blog post Trunk Supporting Practices (which includes detail links for all categories) and earlier lean enterprise ‘deployment g-forces’ diagram  – Release Frequency vs. examples, branching model, release prep, source repo org, in-house code sharing, use of flags or toggles, change that “takes a while”, continuous integration infrastructure & strategy, QA activities, automated QA, shard integration testing environment(s) (for devs not QA), per-dev envs, pre-prod envs (via IaC), code review (continuous review), db rollbacks, db changes, app config per env, talent retention, developer activity change with proximity to release, methodology (kanban or “flow-centric’ agile), definition of “the build”, bots make decisions for humans
  • Thoughtworks: What’s the difference between CI and CD (infographic) – manual vs. automatic step just before deploy-to-prod
  • Thoughtworks: Architecting for Continuous Delivery – The trouble with monolithic codebases and approaches to break it down; Designing the test suite for optimal feedback; Setting up a deployment pipeline as the backbone of CD; Extract components from monolith; excellent CD diagrams
  • Thoughtworks: It’s not CI, it’s just CI theatre – advocates for trunk-based dev; jez humble’s definition of CI: “CI developers integrate all their work into trunk (also known as mainline or master) on a regular basis (at least daily)”; symptoms of ‘CI theatre’ include: long-lived branches, poor test coverage, allowing red builds for long periods; ‘continuous isolation’ – the practice of running CI against feature branches; “frequency reduces difficulty”; trunk-based dev “brings the pain forward rather than storing it up for merges, code reviews or delaying releases”
  • Martin Fowler: ContinuousIntegrationCertification – three things that truly comprise CI: commit and push to master at least once daily; each commit causes an auto build and test; any failure is fixed within 10min
  • Msft: Release Flow: How We Do Branching on the VSTS Team – compares ‘release flow’ with ‘github flow’ (also trunk-based) and ‘git flow’ (not trunk-based)
  • Christiaan Verwijs: Want to be Agile? Drop your DTAP-pipeline – DTAP Dev -> Test -> Acceptance -> Prod is an anti-pattern; agile alternative
  • Humble / Farley: Continuous Delivery: Anatomy of the Deployment Pipeline – freely-available chapter from their Continuous Delivery book
  • Jeff Sussna: Why We Should Continuously Break Everything – reduce overall cost of failure by continuously refactoring at all levels of the IT stack, including devops automation
  • Jeff Sussna: Microservices, Have You Met… DevOps? – with the promise of increased agility and improved quality, microservice architectures represent a shift from complicated systems, where stability is paramount, to complex interconnected systems, where resilience matters most; each microservice must design for failure by treating its dependencies as it would any third-party service
  • Enabling Microservices @Orbitz (38min vid) – Key quote: “Get from code to production with as little people involvement as possible”; Where they started: A Conway’s law silo’d org, with dev using one tool for deployments and Ops using another, infrequent, large, inefficient multi-month release cycles; Where they arrived: Fully automated build pipeline from time of code review; Multiple releases per day; Automated environment supervision avoids downtime; Key technologies used: Docker for repeatable applications; Chef for repeatable infrastructure; Jenkins for repeatable releases; Other technologies used: Consul as service registry; ElasticSearch / LogStash (logstash forwards logs from each localhost); Graphite for metrics; Mesos Marathon for launching and supervision HAProxy for front-end proxy in conjunction with Bamboo; Ansible for deploy through Marathon), Chef for VM provisioning; Atlassian Stash (now Bitbucket) Git repo; Other notes: Following The Twelve-Factor App pattern – configs simple (sometimes as few as one specified param), stateless services where possible; All Jenkins slaves are docker containers running on Mesos; Easy to create pipelines – they just pass app name and version; Smoke / Acceptance testing may be done in any target environment as part of build pipeline
  • DeGrandis: Getting to one button deploy using Kanban
  • Philips / Kawaguchi: Orchestrating your Delivery Pipelines with Jenkins – tips on sharing build artifacts through the pipeline, approval via Promoted Builds Plugin
  • Cloudbees: Another Look at the Jenkins Promoted Builds Plugin – example of dev -> QA / release-mgmt handoff

Canaries

Lean DevOps

Database-specific DevOps

Azure-specific

AWS-specific

One thought on “Reading List: DevOps (Build / Deploy Automation, Monitoring, Logging, Instrumentation)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: