General Info
Enablement
SRE
IAC
Config-mgmt and versioning
Change Management
Strategy and Planning
Serverless
Chaos Engineering
AWS Well-Architected Framework
Docker Containers
Docker – data-mgmt-specific
Docker on Windows: Windows-containers-specific
Docker Security / Image size optimization
Docker Monitoring
Docker – build-pipeline-specific
Kubernetes
AWS ECS
Key-Value Stores and Service Discovery
Logging and Instrumentation
Observability, Monitoring & Alerting
QA-integration
General Build
Jenkins
Continuous Deployment / Delivery
Canaries
Lean DevOps
Database-specific DevOps
Azure-specific
AWS-specific
- Microsoft’s DevOps Journey (130 slide deck)
- Nirmal Mehta: strong belief, loosely held: Bringing Empathy to IT (45min vid) – Pareto inefficient Nash Equilibrium; Docker – empathy as code
- Project Execution Methodologies – The Change (infographic) Waterfall, Agile, Devops color-coded lines
- Subbu: Don’t Build Private Clouds – dc -> cloud journey typical phases: 1) private cloud; 2) move stateless monoliths; 3) Deal with stateful monoliths; 4) transform to cloud-native
- Logicalis: How DevOps accelerates innovation (infographic) – process, people & tools, culture, overall benefits
- Puppet: 2015 State of DevOps Report – impact of lean mgmt. & continuous delivery on culture & performance; application arch. & dev. productivity; how it mgrs. can help their teams win; burnout; methodology. High-performing IT organizations deploy 30x more frequently with 200x shorter lead times (debunked the myth that we need to choose between speed and reliability); they have 60x fewer failed deployments and recover (MTTR) 168x faster ( Failures are unavoidable, but how quickly you detect and recover from failure can mean the difference between leading the market and struggling to catch up with the competition); Lean management and continuous delivery practices create the conditions for delivering value faster, sustainably; High performance is achievable whether your apps are greenfield, brownfield or legacy ( Continuous delivery can be applied to any system, provided it is architected correctly (can do most testing without an integrated environment, deploy/release independently of other applications/microservices it depends on). We also found that high performers are more likely to use a microservices architecture…); Deployment pain can tell you a lot about your IT performance. throughput measures: deployment freq., deployment lead time. stability measures: mean time to recover (MTTR). why culture matters: pathological, bureaucratic, generative
- slalom: 5 ways to incorporate DevOps into your software delivery process – 1) enable entire team to work together “Breaking down silos and bringing people together is the MOST IMPORTANT part of DevOps”; embracing agile is a major tenet in DevOps culture. Agile works aggressively toward bringing your teams together by restructuring work and introducing feedback along the way; 2) automate everything! “treat[ing] your server configuration like developers treat code.” Extract out environmentally-specific application properties into configuration files stored in source control applied using a configuration management system. That is the key to automation, and the cornerstone of DevOps; The only difference between dev and production should really boil down to a set of connection strings and environment variables; 3) Everyone is responsible for production; if you don’t task developers with production duties, they won’t write production-optimized code; 4) Get obsessed with tests, then automate them, too; automated tests have to be written not only for your code coverage, but for your infrastructure scripts as well; 5) Become comfortable deploying frequently to production
- Alex King: A 10,000ft View of DevOps at Gogo (38min vid) – tooling should do mundane stuff, not developers; change management; foremast templating tool for spinnaker; canary deployments in prod instead of separate dev/stage/prod
- Sasha Rosenbaum: Single Person of Failure (19min vid) – slideshare – imagine buying a server that has an uptime of 16 hrs a day, with interruptions! Humans are not highly available; antipattern #1: “you shall not pass to my production server”; “even when systems are automated there are still humans who manage them”; “why is there a single admin? Situation often evolves organically from a small team”; Solutions: role-based access, use service accounts not personal accounts for services; make sure person on call has necessary access; trust your people; antipattern #2: “be aware of the single expert”; a quote we’ve all heard: “this will take me 8 hrs to explain vs. 15mins to fix”; can you afford losing this knowledge? – delegate to juniors; new hires haven’t yet caught the “this is how it’s always been” virus; you are emotionally invested in your code; Solutions: documentation, comments, tests, automation; antipattern #3: “I cannot afford to take vacation!” Job security? Research shows that working longer hours does not increase productivity; solution: Game days – intentionally breaking infrastructure in simulated-production or even actual production (off-hours)
- Jeff Sussna: Why DevOps Really Is About Culture – trend away from exhaustive specs and command & control work assignments, toward empowering decision-making; away from snow flake servers toward standardized template configurations
- DeGrandis: Devops: A Software Revolution in the Making?
- Andrew Phillips: No Quick Fix for DevOps – emphasis on dev and ops collaborating / working together vs. just being resources
- Agile Sysadmin: Kanban for Sysadmin
- LessThanDot: Applying Kanban to IT Processes (Part 2): Help Desk / Support Scenario
- Matthew Skelton: What Team Structure is Right for DevOps to Flourish? – describes both patterns and anti-patterns
- 26thCentury: Test Automation
- Twitter: #ConwaysLaw
- Twitter: #DevOps
Viktor Farcic: How To Shift Left Infrastructure Management Using Crossplane Compositions (28min vid)- You want to enable teams to serve themselves, not require them to create tickets for you every time something needs to be deployed or a resource created; crossplane.io is an unopinionated api allowing sre / ops / platform teams to create opinionated frameworks to be consumed by dev / feature teams
- Google: SRE vs. DevOps: competing standards or close friends? – DevOps practices and SRE implementation are very similar; “SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs)”; “…SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service”; “When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future… Google aims to ensure that at least 50% of each SRE’s time is spent doing engineering projects…”
- Charity Majors: DevOps vs SRE: Delayed Coverage of the Dumbest War – compares google’s Site Reliability Engineering book with Effective DevOps, calls SRE “Ops with mgmt support”; “When error budgets are depleted, the focus shifts from feature development to improving reliability”
- Charity Majors: Operational Best Practices: #Serverless – summary of Serverlessness, NoOps, the Tooth Fairy (37min vid) – “engineering cycles are going to be the scarcest resources anyone has”; “in the glorious serverless future, more responsibility for operational quality needs to come from application developers”; “operations is the constellation of your org’s technical skills, practices, and cultural values around designing, building and maintaining systems, shipping software, and solving problems with technology”; “labor can be outsourced. Caring can’t”; “ask every candidate interview questions about operations and debugging”; “never promote someone to sr. software engineer if they are net negative for operations”
- Google: Narayan Desai: Beyond Goldilocks Reliability (39m vid) – Goldilocks Reliability: Define some SLIs, Choose “Just Right”; Assumptions: Just Right makes sense, There is one answer, We know the questions to ask, The answers don’t change; Problems: No model of reliability, Each Goldilock metric provides a narrow window into behavior; The Trouble with Tresholds: 2-bucket dichotomy usually means lower fidelity and effectively loss of data, We have no basis to judge quality, No support for deeper insights; Reliability: Availability, Performance, Correctness; Stationarity Exposes Reliability Phenomena; Tantalizing Capabilities: Proactive reliability interventions, Data-driven prioritization of reliability investments
- Treynor, Dahlin, Rau, Bey: The Calculus of Service Availability – this article expands upon the topic of SLOs to focus on service dependencies. Specifically, we look at how the availability of critical dependencies informs the availability of a service, and how to design in order to mitigate and minimize critical dependencies
- Google: Implementing SLOs
- Google: Example Error Budget Policy – goals, non-goals, miss-policy, etc.
- Google: SRE home: SRE book online etc.
- Weave.works: GitOps – Operations by Pull Request – ” By using Git as our source of truth, we can operate almost everything. For example, version control, history, peer review, and rollback happen through Git without needing to poke around with tools like kubectl”
- Thoughtworks: Using Pipelines to Manage Environments with Infrastructure as Code – three patterns described: Put all of the environments into a single stack; Define each environment in a separate stack; Create a single stack definition and promote it through a pipeline
- Thoughtworks: Infrastructure as Code: The Automation Fear Spiral – “Infrastructure teams need to break this spiral to use automation successfully”
- CloudBees: To Terraform Or Not To Terraform: Configuration Management In AWS (And Other Cloud Computing Providers) – advocates for terraform as a non-aws-locked-in solution vs CloudFormation, trade-offs vs. ansible / packer
- Thoughtworks: Infrastructure as Code: From the Iron Age to the Cloud Age – treat the configuration of systems the same way that software source code is treated; definitions used to create and update system configurations should be externalizable in a format that can be stored in off the shelf version control systems; It should be possible to validate definitions at various levels of granularity, so you can apply a variation of the test pyramid… this offers the benefits of fast feedback and correction of changes, and is the foundation for Continuous Integration and a building a Continuous Delivery pipeline
- Martin Fowler: SnowflakeServer – use config mgmt tool “recipes” to update servers rather than manual installs and config file editing which lead to brittle server environments
- Kief Morris: ImmutableServer – deployed server never modified, instead replaced with new instance
- Quora: Configuration Management: Chef / Puppet / Ansible / Saltstack / Docker compared
- jacksoncage: Use Salt to manage and deploy Docker containers
- Mark Seemann: Semantic Versioning with Continuous Deployment – argues for semantic / explicit versioning for w.x.y, and for .z as well (vs. having a build server assign .z)
- Tarique Smith: Will The DevOps Movement Be The Death Of Change Management – “Historically, testing and staging environments (where the heavy lifting of testing takes place) have been underfunded. As a result, they have not always been at the same maintenance level or configuration as the target production environment(s). This mismatch often caused testing results to be suspect, contributing to a lack of trust between development and operations. As a result, testing–a pillar of the waterfall methodology–has traditionally only been able to ensure certain aspects of an application’s testing complement”; “By automating testing and deployment through continuous integration and continuous deployment (CICD), change management stakeholders can shift their attention from traditional concerns, such as separation of tasks and back-out plans. This leaves them with the ability to focus on curation and review of the DevOps/CICD process and standards in use by the development teams requesting or deploying a change”
- James Shore: Where do you want your complexity – Monolith vs. Microservices, Monorepo vs. repo-per-service
- Thoughtworks: Treat DevOps Stories like User Stories
- Mike Roberts: Serverless Architectures – vendor-agnostic description of serverless, often referencing AWS Lambda;
- Tom McLaughlin: Serverless Ops: What do we do when the server goes away? – ops role in x-functional team rather than part of dedicated devops team; skillsets: systems eng.; platform / tooling understanding; ppl skills
- Adrian Coyler: Serverless computing: economic and architectural impact – the serverless impact on system designs
- Alex Ellis: OpenFaaS
- @openfaas
- Chris Ward: Embracing the Chaos of Chaos Engineering – Form a hypothesis, Communicate to your team, Run experiments, Analyze the results, Increase the scope, Automate experiments
- Jeff Hodges: Notes on Distributed Systems for Young Bloods – implement backpressure; metrics; use percentiles not averages; feature flags for infrastructure too
- principlesofchaos.org
AWS Well-Architected Framework
- AWS: AWS Well-Architected home – AWS answers, AWS solutions, case studies, cloud security
- AWS: AWS Well-Architected Framework (76pgs)
- AWS: AWS Well-Architected Framework: Operational Excellence Pillar (whitepaper)
- AWS: AWS Well-Architected Framework: Security Pillar (whitepaper)
- AWS: AWS Well-Architected Framework:: Reliability Pillar (whitepaper)
- AWS: AWS Well-Architected Framework:: Performance Efficiency Pillar (whitepaper)
- AWS: AWS Well-Architected Framework:: Cost Optimization Pillar (whitepaper)
- Dee Kumar, Brandon Royal: Modernize Traditional Apps with Docker (28min webinar vid)
- OSI: Open Container Runtime Specification – Container Principles; Roadmap; Implementations (runc); Filesystem bundle; Runtime and Lifecycle; Configuration
- Mike Coleman: Containers are not VMs – apartments vs. houses metaphor
- Docker: Solomon Hykes: Introduction to Docker – the seminal Oct 2013 talk at Twitter, laying out the vision; ambassador pattern for mocking 3d party services
- @graphaelli: EC2 Metadata service mocking with Docker
- Raman Gupta: Why Docker Data Containers are Good
- Ian Lewis: Creating Smaller Docker Images
- Will Sargent: Docker Cheat Sheet
- Tom Linford: Integration Tests with Docker – mock services for 3d party services; docker-compose for each test case including error cases when service down; expose endpoints for E2E testing; Quick enough to run often and simple enough to make adding new cases easy
- Marc Campbell: The misunderstood Docker tag: latest – since docker latest means latest non-tagged, it is best avoided, use explicit tags instead
- Adrian Cockcroft: State of the Art in MicroServices (38min video)
- Docker: online tutorial (takes about 5min)
- Shopify: Docker at Shopify: How we built containers that power over 100,000 online shops – why containerize? the 100 rule; containerizing your app; process hierarchy
- Shopify: Docker at Shopify: From This-Looks-Fun to Production (35min vid)
- Docker: Docker containers blog posts
- Docker: Mesos blog posts
- Ian Eyberg: Life in a Post-Container World and Why Linux Will Play a Diminished Role – given that most servers are serving as app servers, not multi-user servers, trend is toward managing containers, not servers, unikernels are promising as a way to host containers without VM overhead
- Bridget Kromhout: Docker in Production: Reality, not Hype
- Evan Machnic: Production Deployment with Docker – rails scenario
- Sirupsen: Why Docker is Not Yet Succeeding Widely in Production
- Akshay Karle: Operating System Containers vs. Application Containers – OS Containers (e.g., LXC) are similar to VMs, suitable for hosting multiple processes and services; Application Containers (e.g., Docker) host a single process or service
- Reddit: Docker articles
- PaaS Magazine: Docker articles
- Docker: Webinar / demo archive
- Docker: Community Forums
- ClusterHQ: Introducing Flocker 0.1, a lightweight volume & container manager for Docker
- ClusterHQ: Tutorial: PipelineDB Persistence with Flocker and Docker Swarm
- Blockbridge: Elastic Storage for Container Applications – Storage As A Container (STaaC)
Docker on Windows: Windows-containers-specific
- Alex Ellis: Docker Windows Containers blog posts
- Msft: Getting started with containers and Docker on Windows Server 2016 Technical Preview 5 – utilizes install-ContainerHost.ps1
- Msft: Windows Container Images – general info
- Msft: Container Host Deployment – Windows Server – describes update-ContainerHost.ps1
- Msft: Windows Containers Sample Dockerfiles – includes dockerfiles for base images including iis, dotnet35; also samples
- Stefan Scherer: Dockerfiles for Windows – docker-compose, Consul, golang, Jenkins Swarm
- Buc Rogers: Dockerfiles for Windows – ASP.NET, SQL Server, PostgreSQL, Python, Ruby, Swarm
- Docker Labs: Tutorial: Run Swarm on a mix of Linux and Windows Nodes – use swarm label to specify target Linux or Windows host
- Msft: Mark Russinovich: Containers: Docker, Windows and Trends – details on windows service containers, hyper-v containers
- Scott Hanselman: Brainstorming development workflows with Docker, Kitematic, VirtualBox, Azure, ASP.NET, and Visual Studio
- Msft: Getting started with Nano Server
- Docker: Build, Ship, Run with Docker and Microsoft
- Msft: Containers Virtualization Docs, tools and image source
- Msft: ASP.NET Core samples – ASP.NET core runs on both Linux and Windows containers
- Msft: Windows Containers forum
Docker Security / Image size optimization
- Abby Fuller: Creating Effective Images (43min vid) – minimal images, minimizing attack surface, high level best practices for windows container images
- Aleksa Sarai: Docker Internals and Implementing Rebase – experimental attempt to implement git-like rebase for docker layers; good info on image format
- Docker: Security home – “Introduction to Container Security” white paper (incl. discussion of seccomp (Secure Computation) sandboxing mechanism)
- Docker: Engine security – intrinsic kernel security features; daemon attack service; loopholes in Container config profile (AppArmor, SELinux, GRSEC cited)
- CIS Security Benchmark against Docker 1.6
- Docker Bench for Security tool – automates may of the checks in the CIS Security Benchmark above, on a given Docker Engine host, analyzes any running Containers on that host as well.
- Diogo & Ryaz: Secure Substrate: Least Privilege Container Deployment (37min vid)
- nccgroup: Understanding and Hardening Linux Containers –
- Docker Saigon: Docker Caveats: What You Should Know About Running Docker In Production
- Jessie Frazelle: AppArmor profile generator for Docker Containers
- LinuxCon / ContainerCon 2015: Jerome Petazzoni: Containers, Docker, and Security: State Of The Union – immutable containers via –read-only; docker diff allows easy audit of changes; image provenance: must trust upstream, must trust registry host, must trust transport; Notary; Defense in depth = VM + Containers
- Docker Security Cheat Sheet (based on Adrian Mouat talk below) – types of security threats and how to avoid them
- GOTO London 2015: Adrian Mouat: Docker Security (33min vid) – boxing round-by-round winners: round 1: isolation guarantees: VM; round 2: attack surface: Container; round 3: controls: Container; round 4: auditing: Container; round 5: track record: VM; use VMs to segregate groups of Containers; Security paradigms: Defense in depth; least privilege; set container fs (or volumes) to read-only via “–read-only” with volume-mount; memory limits via “-m”; “RUN find / -perm +6000 -type f -exec chmod a-s {} \; \ || true”; turn off inter-container comm. via daemon arg “–icc=false”; specifying “–iptables” allows only linked containers to communicate; shared secrets: use secured key value stores (e.g., vault, keywiz) instead of env. vars since the latter too visible
- Adrian Mouat: Docker Security: Using Containers Safely in Production (free ebook, Dan Walsh forward and references) – Only run container images from trusted parties; Container applications should drop privileges or run without privileges whenever possible; Make sure the kernel is always updated with the latest security fixes; the security kernel is critical; Make sure you have support teams watching for security flaws in the kernel; Use a good quality supported host system for running the containers, with regular security updates; Do not disable security features of the host operating system; Examine your container images for security flaws and make sure the provider fixes them in a timely manner; Defense in depth; Least privilege; segregate sensitive-info Containers using VM; Image provenance; Limit container networking; Defang binaries; Limit memory; Limit cpu; Limit restarts; Limit filesystems; Apply resource limits (ulimits); Run a hardened kernel; Linux Security Modules (LSMs); Auditing; Incident response
- Container Camp 2015: Diogo Monica: Docker Content Trust (26min vid) – TUF; Notary; TOFUs; Docker Content Trust; –disable-content-trust-false *or* DOCKER_CONTENT_TRUST=1; Demo of Docker Content Trust
- Docker: Intro to Docker Security 2016-03-24 (41min vid) – seccomp demo starts at 26:00; Integrated security: Who (Access); What (Content); Where (Platform); “today there is no reason to deploy an application directly on bare metal or directly on a VM when you have such a powerful sandbox protecting all of your applications”; areas:cGroups, Namespaces, User Namespaces, LSMs, Seccomp, Best Practices and Tools; User namespaces feature supports non-root user, also supports container root not being host root
- Dockercon 2015: Diogo Monica / Nathan McCauley: Least-privilege Microservices (1hr vid and slides) – profiles: front-end server: access to a lot of downstream services, most exposed; back-end server: io-intensive, limited network access; Worker host: cpu-intensive, wide range of workloads; use strace to determine syscalls used
- Eric Windisch: docker-pull – pull docker images from inside of a container, exporting for use with ‘docker load’ – see also this discussion thread on security of docker load
- docker-slim – Optimize and secure your Docker containers – uses static and dynamic analysis to reduce image size; autogenerates seccomp profiles
- Microcontainers: Iron.io’s New Hack to Shrink Docker Containers – use scratch or Alpine as base image – see iron.io example Dockerfiles for base, gcc, node, go, java, python, ruby, scala
- Atlassian: Smaller Java images with Alpine Linux – both open JDK and oracle JDK covered; image reduction technique by using docker save / docker load
- Jacob Kaplan-Moss: A Reading List for InfoSec Engineers – oriented towards providers of Software-, Platform-, and Infrastructure-as-a-Service
- Brian Christner: How to setup Docker Monitoring – cAdvisor, InfluxDB, Grafana
- Sysdig – Container-optimized monitoring and troubleshooting tool – see this swarm github comment from Alexandre Beslic of Docker recommending Sysdig to be used for self-healing for Swarm; Sysdig has Rancher endorsement in the form of strong integration
- Docker: Nathan LeClaire: Realtime Cluster Monitoring with Docker Swarm and Riemann – push-based monitoring; compat. with Graphite, InfluxDB, Librato et al.; riemann server, health reporting, dashboard UI
- Joyent: Simplifying service discovery in Docker with Containerbuddy – “Discovery services like Consul provide a means of performing health checks from outside our container, but that means packaging the tooling we need into the Consul container. If we need to change the health check, then we end up re-deploying both our application and Consul, which unnecessarily couples the two…. Containerbuddy to the rescue!… Containerbuddy registers the application with Consul on start and periodically sends TTL health checks to Consul; should the application fail then Consul will not receive the health check and once the TTL expires will no longer consider the application node healthy. Meanwhile, Containerbuddy runs background workers that poll Consul, checking for changes in dependent/upstream service, and calling an external executable on change.”
Docker – build-pipeline-specific
- JFrog: Taking Docker to Production with Confidence – leveraging virtual repository feature, defined here: JFrog: Push the Limits of Virtual Repositories
- Brian Grant: What is Kubernetes: An architectural view (slides)
- Eric Brewer: GCPNext keynote on Kubernetes and config (1hr vid)
- KubeCon “Cloud-Scale Kubernetes at eBay” (18min vid) – case study of how Kubernetes being used at eBay; Shows how inflexible static provisioning is, vs the pool of resources managed by Mesos; EBay is a pro-opensource company, their first choice is always to use or use-and-adapt an open source tool; Kubernetes lets you declare your intent and simply “Run”, vs a traditional Provision->Deploy->Monitor->Remediate-> cycle; For network routing, they are planning to use BGP all the way to the host containers; Local storage leases help with dbs like Cassandra
- Kubernetes 101 – Kubectl CLI and Pods – Kubectl CLI; Pod management, volumes, volume types, multiple containers
- Kubernetes 201 – Labels, Replication Controllers, Services and Health Checking – Labels, Replication Controllers, Services, Health Checking
- Google: Borg, Omega, and Kubernetes Lessons learned from three container-management systems over a decade – long but worth-it article about google’s evolution of container technologies; “the container has become the sole runnable entity supported by the Google infrastructure”; “Building management APIs around containers rather than machines shifts the “primary key” of the data center from machine to application”; “The design of Kubernetes as a combination of microservices and small control loops is an example of control through choreography—achieving a desired emergent behavior by combining the effects of separate, autonomous entities that collaborate”; things to avoid (e.g., have Container system manage port nbrs. – Kubernetes instead allocates a “service vip” per pod); use labels to group containers rather than numbering them; some open, hard problems: Configuration – maintain a clean separation between computation and data, use declarative form like JSON or YAML; dependency management
- awsgeek: Amazon Elastic Container Service (infograph)
Key-Value Stores and Service Discovery
- Amazon: Service Discovery via Consul with Amazon ECS
- Mammatus: Microservice Service Discovery with Consul
- Jeff Lindsay: Understanding Modern Service Discovery with Docker
- Jeff Lindsay: Consul Service Discovery with Docker
- Hashicorp: Twelve-Factor Applications with Consul – describes use of envconsul to propagate Consul key/value store values to env vars
- Peter Bourgon: Logging v. Instrumentation – services should only log actionable information; Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema; Avoid multiple production log levels (info, warn, error) and especially runtime level configuration; An exception is debug logging, which is useful during development and problem diagnosis; It’s the responsibility of your operating environment or infrastructure to route process or container stdout/stderr to the appropriate destination; In contrast to logging, services should instrument every meaningful number available for capture
- Aditya Mukerjee: Don’t Read your Logs – “reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools”; Antipatterns: Logs as Metrics, Logs as Debugger Tracing, Logs as Error Reporting, Logs as Durable Records; “Don’t Stop Logging, But Stop Reading Log Lines”
Observability, Monitoring & Alerting
- Cory Watson: No-Nonsense Observability Improvement (31min vid slides) – “The Normal Zone” includes monitoring for Anticipated behaviors; The “Weird Zone” is about Observability of Unanticipated behaviors; Observability will be one of your most expensive projects; Incident Measures++ traditional ones incl. MTTD, MTTR can be lame – instead look to Nora Jones Cyclic Approach: Difficulties in Understanding, System-specific failure rates, Surprises, Lack of ownership, Near misses; Automation – need to avoid human being “out of the loop”; Invest in risk and need; Understand the use case
- Cindy Sridharan: Monitoring and Observability – “Why call it monitoring? That’s not sexy enough anymore.”; “Observability, because rebranding Ops as DevOps wasn’t bad enough, now they’re devopsifying monitoring too”; ““whitebox monitoring”, which refers to a category of monitoring based on the information derived from the internals of systems.”; “Monitoring is for symptom based Alerting”; “Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.”; “…in essence “observability” captures what “monitoring” doesn’t (and ideally, shouldn’t).”; “…unlike “monitoring” which is known failure centric, “observability” doesn’t necessarily have to be closely tied to an outage or a user complaint. It can be used as a way to better understand system performance and behavior, even during the what can be perceived as “normal” operation of a system.”; “The process of examining the evidence (observations) at hand and being able to deduce still requires a good understanding of the system, the domain as well as a good sense of intuition. No amount of “observability” or “monitoring” tooling can ever be a substitute to good engineering intuition and instincts.”
- Charity Majors: Observability — A 3-Year Retrospective – “Monitoring tools are effective for systems with a stable set of known-unknowns, and relatively few unknown-unknowns. For a system with predominantly unknown-unknowns, monitoring tools were all but useless.”; “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.“; “A system is observable to the extent that you can understand new internal system states without having to guess, pattern-match, or ship new code to understand that state.”; “The unknown-unknowns now rapidly outpace monitoring dashboards capability to explain them to the humans responsible for continuous uptime, reliability and acceptable performance.”; “vendors had latched on to “distributed tracing, metrics, and logs” as “three pillars of observability.” Ben Sigelman neatly debunked this, saying: it makes no sense because those are just three data types. You may achieve [observability] with all three, or none — what matters is what you do with the data, not the data itself.“
- Verica: MTTR is a Misleading Metric—Now What? – “The second problem with MTTx metrics is they are trying to simplify something that is inherently complex. They tell us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result. MTTx (along with other data like severity, impact, count, and so on) are what John Allspaw calls “shallow” incident data.”; “Vanessa Huerta Granda, a Solutions Engineer at Jeli, has an excellent post detailing a process of using MTTR and incident count metrics as a way to “set the direction of the analysis we do around our entire incident universe.””; “f quantitative metrics are inescapable, we suggest focusing on Service Level Objectives (SLOs) and cost of coordination data.”
- Andrews / Lê-Quôc: Collecting Metrics Using StatsD, a Standard for Real-Time Monitoring – “StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application.”
- Alex King: DevOps Meets Observability – Come Meet the Pyramid of Happiness! – Tier 1: Generation (tracing, logging, metrics); Tier 2: Ingestion and Monitoring; Tier 3: Alerting
- Charity Majors on Observability and Understanding the Operational Ramifications of a System – “Engineers are now talking about observability instead of monitoring, about unknown-unknowns instead of known-unknowns”; “It will always be the engineer’s responsibility to understand the operational ramifications and failure models of what we’re building, auto-remediate the ones we can, fail gracefully where we can’t, and shift as much operational load to the providers whose core competency it is”; “Don’t attempt to “monitor everything”. You can’t. Engineers often waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft”; “In the chaotic future we’re all hurtling toward, you actually have to have the discipline to have radically fewer paging alerts – not more”; “… the health of the system no longer matters. We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience, or each shopping cart’s experience (or other high cardinality dimensions). With distributed systems you don’t care about the health of the system, you care about the health of the event or the slice”
- Michael Kopp: Why Averages Suck and Percentiles are Great – looking at 50th percentile (median) and other percentiles like 90th give a lot more insight into how prevalent outliers are
- Adrian Cockroft: Who monitors the monitoring systems? – “… it would be good to compare the common metrics across different monitoring systems to analyze how much variance there is. This could be done by looking for simple differences, or using a statistical technique called gauge repeatability and reproducibility
- Mark McDonnell: Observability and Monitoring Best Practices – types; channels; contexts; know your graphs; choosing between a metric and a log
- Cory Watson: Observability Crash Course (best of breed write-up links) –
- Geek Guide: Slow Down to Speed Up – Continuous QA in DevOps – DevOps Maturity Model
- Jeff Sussna: Continuous Quality: What DevOps means for QA – “Tester -> Quality Advocate”, “DevOps is about spanning boundaries”, “Ultimate Definition of Quality… continuously deliver”
- How we build code at Netflix – Culture, Cloud and Microservices; Build using Nebula; Integrate triggering Spinnaker; Bake; Deploy
- Jenkinsfile-runner + Configuration-as-code (discussion forum thread), about the jenkinsfile-runner project
- Doug Campbell: Jenkins Jobs as Code with Groovy DSL – describe’s Gogo’s runway convention for using Job DSL
- Marcel Birkner: Using Jenkins Job DSL for Job Lifecycle Management – see also the Scriptler plugin; Access GitLab REST API; Access Oracle / MySQL Db; Docker example
- Dennis Schulte: Continuous Delivery for Microservices with Jenkins and the Job DSL Plugin
- Daniel Reuter: Generated Jenkins Jobs and automatic Branch Merging for Feature Branches – case study for Job DSL Plugin
- Netflix: Public Continuous Integration Builds for our OSS Projects (using Job DSL Plugin)
Continuous Deployment / Delivery/>
- Paul Hammant: Practices correlated with trunk-based development – from blog post Trunk Supporting Practices (which includes detail links for all categories) and earlier lean enterprise ‘deployment g-forces’ diagram – Release Frequency vs. examples, branching model, release prep, source repo org, in-house code sharing, use of flags or toggles, change that “takes a while”, continuous integration infrastructure & strategy, QA activities, automated QA, shard integration testing environment(s) (for devs not QA), per-dev envs, pre-prod envs (via IaC), code review (continuous review), db rollbacks, db changes, app config per env, talent retention, developer activity change with proximity to release, methodology (kanban or “flow-centric’ agile), definition of “the build”, bots make decisions for humans
- Thoughtworks: What’s the difference between CI and CD (infographic) – manual vs. automatic step just before deploy-to-prod
- Thoughtworks: Architecting for Continuous Delivery – The trouble with monolithic codebases and approaches to break it down; Designing the test suite for optimal feedback; Setting up a deployment pipeline as the backbone of CD; Extract components from monolith; excellent CD diagrams
- Thoughtworks: It’s not CI, it’s just CI theatre – advocates for trunk-based dev; jez humble’s definition of CI: “CI developers integrate all their work into trunk (also known as mainline or master) on a regular basis (at least daily)”; symptoms of ‘CI theatre’ include: long-lived branches, poor test coverage, allowing red builds for long periods; ‘continuous isolation’ – the practice of running CI against feature branches; “frequency reduces difficulty”; trunk-based dev “brings the pain forward rather than storing it up for merges, code reviews or delaying releases”
- Martin Fowler: ContinuousIntegrationCertification – three things that truly comprise CI: commit and push to master at least once daily; each commit causes an auto build and test; any failure is fixed within 10min
- Msft: Release Flow: How We Do Branching on the VSTS Team – compares ‘release flow’ with ‘github flow’ (also trunk-based) and ‘git flow’ (not trunk-based)
- Christiaan Verwijs: Want to be Agile? Drop your DTAP-pipeline – DTAP Dev -> Test -> Acceptance -> Prod is an anti-pattern; agile alternative
- Humble / Farley: Continuous Delivery: Anatomy of the Deployment Pipeline – freely-available chapter from their Continuous Delivery book
- Jeff Sussna: Why We Should Continuously Break Everything – reduce overall cost of failure by continuously refactoring at all levels of the IT stack, including devops automation
- Jeff Sussna: Microservices, Have You Met… DevOps? – with the promise of increased agility and improved quality, microservice architectures represent a shift from complicated systems, where stability is paramount, to complex interconnected systems, where resilience matters most; each microservice must design for failure by treating its dependencies as it would any third-party service
- Enabling Microservices @Orbitz (38min vid) – Key quote: “Get from code to production with as little people involvement as possible”; Where they started: A Conway’s law silo’d org, with dev using one tool for deployments and Ops using another, infrequent, large, inefficient multi-month release cycles; Where they arrived: Fully automated build pipeline from time of code review; Multiple releases per day; Automated environment supervision avoids downtime; Key technologies used: Docker for repeatable applications; Chef for repeatable infrastructure; Jenkins for repeatable releases; Other technologies used: Consul as service registry; ElasticSearch / LogStash (logstash forwards logs from each localhost); Graphite for metrics; Mesos Marathon for launching and supervision HAProxy for front-end proxy in conjunction with Bamboo; Ansible for deploy through Marathon), Chef for VM provisioning; Atlassian Stash (now Bitbucket) Git repo; Other notes: Following The Twelve-Factor App pattern – configs simple (sometimes as few as one specified param), stateless services where possible; All Jenkins slaves are docker containers running on Mesos; Easy to create pipelines – they just pass app name and version; Smoke / Acceptance testing may be done in any target environment as part of build pipeline
- DeGrandis: Getting to one button deploy using Kanban
- Philips / Kawaguchi: Orchestrating your Delivery Pipelines with Jenkins – tips on sharing build artifacts through the pipeline, approval via Promoted Builds Plugin
- Cloudbees: Another Look at the Jenkins Promoted Builds Plugin – example of dev -> QA / release-mgmt handoff
- Itay Shakury: Deployment Strategies Defined – compares canary, blue / green and other approaches, describes how canary may be used together with blue / green
- Adrian Colyer: The evolution of continuous experimentation in software product development – based on msft paper of same name; Experimentation Evolution Model chart with Crawl -> Walk -> Run -> Fly columns
- Adrian Colyer: Peeking at A/B tests: continuous monitoring without pain – Early stopping with Bayesian testing
- Continuous Delivery blog (Jez Humble et al.)
- Evan Bottcher: Projects are Evil and must be Destroyed – advocates for the product paradigm vs. the project paradigm to optimize for reliability, monitorability, deployability and maintainability
- Spotify: Nhan Ngo: Visualizations of Continuous Delivery diagram based on Continuous Delivery book
- Netflix: Adrian Cockcroft: Ops, DevOps and PaaS (NoOps) at Netflix – describes how Netflix evolved to a cloud-based deployment environment and self-service APIs as part of that, describes their scenario where “Chef was overkill” – good discussion in the comments including response to Allspaw post below
- Etsy: John Allspaw: Response to Adrian Cockcroft “Ops, DevOps and PaaS (NoOps) at Netflix – -pushes back on NoOps concept from the Cockcroft post above
- Netflix: Adrian Cockcroft: Patterns for Continuous Delivery, High Availability, DevOps & Cloud Native Open Source with NetflixOSS – includes AWS cost optimization info
- Jez Humble et al.: Lean Enterprise (book) – 21min book “deep dive” podcast by author
- Redgate: Using Migration Scripts in Database Deployments – differentiates between when automated vs manual migration scripts are needed
- Redgate: Why Put Your Database into Source Control? – benefits including auditing, integration-with-app-code, automation
- Redgate: Agile Database Development – including three tenets to db deployments: robust, fast, specific
- Redgate: Database Deployment Challenges – compares create vs backup/restore vs script, upgrade considerations, automated advantages, critique of shared-live db model
- Redgate: An Incremental Database Development and Deployment Framework – db version #s, one-click deploys, rolling back approaches, repeatability
- Redgate: Automating SQL Server Database Deployments: A Worked Example
- Redgate: Automating SQL Server Database Deployments: Scripting Details – script do’s and don’ts
- Redgate: The unnecessary evil of the shared development database
- Redgate: DBAs vs Developers: A Sad Tale of Unnecessary Conflict – advantage of cross-functional teams over silos
- Redgate: How Mature is Your Database Change Management Process? – characteristics of: baseline -> beginner -> intermediate -> advanced
- Redgate: Automating deployments with the SQL Compare command line – cautions in automating db deployment, multi-db targeting
- Redgate: Database Build and Release with Jenkins – uses Jenkins Promotions plugin, describes scripts to stop / start replication during deployment
- Redgate: 7 Steps to Build a Database Deployment Pipeline with Red Gate and TeamCity, Webinar Recording (1hr video)
- Redgate: Cleaning up SQL Server Deployment Scripts – includes replication considerations, auto-parsing to check for dangerous auto-promote scripts
- Redgate: Database Lifecycle Management (DLM) blog
- Martin Fowler, Pramod Sadalage: Evolutionary Database Design – general guidance on db refactorings, automation goals, dev:dba ratio
- Scott Ambler: Database Refactoring
- Paul Stovell: How to deploy a database – lays out 5 fundamental goals of db deployment strategy including source control, testability, CI
- Troy Hunt: Automated database releases with TeamCity and Red Gate – example of build server integration
- DbMaestro: The Secrets of Database Change Deployment Automation – considerations for safe database deployment automation, user comments on other approach tradeoffs including deltas
- DbMaestro: The Definitive Guide to Database Version Control – compares 4 approaches: 1) dev-alter scripts; 2) changelog-tracking; 3) Compare & sync; 4) Db-enforced chg-mgmt (DbMaestro)
- Azure Resource Manager QuickStart Templates – all currently available Azure Resource Manager templates contributed by the community
- Rik Hepworth: Optimising IaaS deployments in Azure Resource Templates
- Rik Hepworth: IaaS Environment Resource Templates for Demos
- Rik Hepworth: Azure Resource Templates blog posts
- Charity Majors: AWS, Networking, Environments, and You – managing multiple environments in AWS: Single account; multi-account; single-account multi-VPCs
Great post, most informative, didn’t realise devops were into this.
LikeLike