Agile – 26thcentury

KubeCon 2019 insights

Attended KubeCon 2019 in San Diego last week. Great conference, liked the 30min prezo format, got more out of this one than any I’ve attended. And, late November is a great time to be in southern CA, though it’s almost disorienting to see the sun every day.

Key Takeaways

Teams should be structured to minimize cognitive load in order to maximize effectiveness, “minimize cognitive load for others”, “use small, long-lived teams as the standard” – from the Team Topologies book authors
DevEx (Developer Experience) term (new to me) used by the Team Topologies book authors – though this is easily confused with devx.com, I like it. “DevEx must be optimized to maximize feature flow”
CD platform Spinnaker has really taken off with wide adoption and a healthy ecosystem – Spinnaker Summit, now in its third year, is now 500+ attendees over three days, just prior to KubeCon / CloudNativeCon. Armory offers a hosted Spinnaker solution. A couple of my ex-co-workers have presented at Spinnaker Summit and are active members of the community: Joel Vasallo, Steven Basgall
Serverless Frameworks can mean different things, ranging from as simple as scale-to-zero to functions-as-a-service. Examples of the latter are OpenFAAS (which supports k8s, Swarm, ECS / Fargate) and vendor-specific Lambda, Google Cloud Functions, Azure Functions; of the former, Knative (which is more of a serverless building-blocks technology which appears destined to be commonly-paired with k8s)
Platform as a Product leads to a reliable platform
Observability is largely about being smarter re unanticipated failure events (the “weird zone”) and refining craft in using the monitoring / alerting tooling one already has
Security requires ongoing effort, “treat vulnerabilities like earthquakes” (they’re going to happen, maximize resiliency / recovery)
Kubernetes (k8s) ecosystem is strong and getting stronger, many using it in production, GKE currently more widely used than EKS, but EKS is certainly working for some. “Kubernetes is the OS for the cloud”
Service Mesh delivers the most benefit when most or all applications are on a service-mesh-ready deployment technology (incl. k8s, AWS ECS / Fargate). There are competing service mesh solutions (incl. Istio, Linkerd, AWS-specific-AppMesh, a la carte Envoy, and others), with perhaps no clear leader yet (though Knative may drive Istio adoption). “Service mesh does for service-to-service communication what Kubernetes has done for orchestration”

Favorite talks / workshops

The Elephant in the Kubernetes Room: Team Interactions at Scale – Manuel Pais, Independent (co-author of “Team Topologies”) ** slides ** Team Topologies book ** teamtopologies.com ** How Airbnb Simplified the Kubernetes Workflow for 1000+ Engineers

I’m a third of the way into the book, best I’ve read since Lean Enterprise. Really well-written, just the right balance of theory / case-study, and the diagrams are well chosen.

DevEx == Dev Experience – simplify, simplify, simplify
Platform as a Product – leads to a Reliable Platform
Cognitive Load – minimize it within any team, and optimize DevEx by minimizing it for Dev team customers (by providing easy-to-use abstractions)
Platform Team: Platforms fit for purpose, optimized for DevEx
Primary 3 team types: Stream-aligned (feature delivery), Enabling (DevOps embedding / support for stream-aligned), Platform
Platform should make it easier to do the right thing, encouraging dev teams to use the platform and not diverge and be on their own
Kubernetes should be a hidden impl. detail with the Platform providing abstractions for good DevEx

OpenFaaS Cloud + Linkerd: A Secure, Multi-Tenant Serverless Platform – Charles Pretzer, Buoyant & Alex Ellis, OpenFaaS, LTD ** slides

OpenFAAS is “Serverless 2.0” (Any code, Anywhere), vs. “Serverless 1.0” vendor-specific platforms incl. AWS Lambda, Google Cloud Functions, Azure Functions
Anywhere means k8s, Swarm, ECS / EKS, Datacenter, Local
Two options: OpenFAAS, OpenFAAS Cloud which bundles Git-based CI / CD, Runtime secrets, OAuth, Linkerd-based Service Mesh
Simplicity (short stack.yml + handler.js for a simple node.js example, vs. 6 distinct longer configs for k8s). Note OpenFAAS Dockerfile is optional but is supported
Linkerd (bundled with “OpenFAAS Cloud”): Only service mesh currently in CNCF (incubating): Actionable metrics, Deep runtime diagnostics, CLI-debugging, 60sec install, lightweight. Traffic-splitting, mTLS, Dashboard showing routing paths.

No-Nonsense Observability Improvement – Cory Watson, SignalFx ** slides

“The Normal Zone” includes monitoring for Anticipated behaviors
The “Weird Zone” is about Observability of Unanticipated behaviors
Observability will be one of your most expensive projects
Incident Measures++ traditional ones incl. MTTD, MTTR can be lame – instead look to Nora Jones Cyclic Approach: Difficulties in Understanding, System-specific failure rates, Surprises, Lack of ownership, Near misses
Automation – need to avoid human being “out of the loop”
Invest in risk and need
Understand the use cases

Making an Internal Kubernetes Offering Generally Available – James Wen, Spotify ** slides

“Take complexity for your developers” (more complex devops tooling, better abstractions can be worth it for a better DevEx)
Between the extremes “Complete Team Autonomy” and “Centralized Ops” is their happy medium: Ops (embedded) in teams, Core-Infra Org, Golden Path
“Establish trust through monitoring”
Metrics incl.: Status of backups
“If you don’t have restored backups, you don’t actually have backups”

Doing Things Prometheus Can’t Do with Prometheus – Tim Simmons, DigitalOcean ** slides

Metrics need to be Actionable, Contextual
Learn existing tools deeply – more valuable than shiny new ‘observability’ tools
Jeff Smith: “Maintenance is Revenue Protection”
Anomaly detection can be easily done with custom code, don’t always need a product with that feature

How Yelp Moved Security From the App to the Mesh with Envoy and OPA – Daniel Popescu, Yelp & Ben Plotnick, Cruise ** slides

OPA (Open Policy Agent) case study
OPA incl. unit-testability
OPA decision logs published to log collector (Splunk)
AuthN, AuthZ via Envoy sidecar
For projects like this, start from the use cases, be mindful of scope creep

Design Decisions for Communication Systems – Eric Anderson, Google ** slides

Excellent historical context / pros & cons of Messaging mechanisms incl. gRPC, REST, Unix socket, TCP socket, older ones incl. DCOM / CORBA

Weaveworks EKS AppMesh Gitops workshop

Pretty well-constructed workshop targeting EKS & AppMesh with a GitOps workflow, using Flux as a k8s operator to promote container images
If you were at the conference, you got a nice Cuttlefish shirt with proof of completing this, pictured at the link above

Process Pattern: Blocks and Blocked Swimlanes

Identifying and mitigating blockers is one of the most important practices in keeping projects on-schedule and maximizing throughput.

I’ve observed three different practices used for JIRA projects in the past to track blocked work:

Flag – this provides a strong visual indication on the board, but requires the flag be manually added / removed
Blocked workflow state – clear board visibility (mapped to column), but requires the state be manually entered / exited
Priority field set to Blocker – clear board visibility (same as for any priority field value, may optionally be swimlane’d), but requires field be manually set to Blocker / back to other priority

The approach I like to use is to rely on the JIRA Blocks / Is-blocked-by link type, to show both blocked and blocking tickets, each in their own swimlane at the top of the board. “Blocking” tickets are important to track when your project is blocking another.

These filters require the popular Adaptavist ScriptRunner plugin be installed on the JIRA server.

Example JQL Filters for Blocked and Blocks

JQL Filters

The intent of the generic JQL filters described here is to support adding “Blocks” and “Blocking” swimlanes to any JIRA project’s board.

AllProj-BlockedDirect: Returns all open tickets having a direct “is blocked by” link to at least one open ticket

issueFunction in linkedIssuesOf(“hasLinks=’blocks’ AND resolution is empty “, “blocks”) AND resolution is empty

AllProj-BlockedIndirectParent: Returns all open tickets having an indirect (parent of blocked sub-task) “is blocked by” link to at least one open ticket

issueFunction in parentsOf(“filter=AllProj-BlockedDirect”) AND resolution is empty

AllProj-BlockedndirectSubtask: Returns all open tickets having an indirect (sub-task of blocked-parent) “is blocked by” link to at least one open ticket

issueFunction in subtasksOf(“filter=AllProj-BlockedDirect”) ) AND resolution is empty

AllProj-BlocksDirect: Returns all open tickets having a direct “blocks” link to at least one open ticket

issueFunction in linkedIssuesOf(“hasLinks=’is blocked by’ AND resolution is empty”, “is blocked by”) AND resolution is empty

Add Blocked and Blocks Swimlanes to board

The following JIRA board swimlanes are defined as:

Blocks: Show all open project tickets directly blocking open ticket(s) on any other project
Blocked: Show all open project tickets directly or indirectly (parent-of or sub-task-of) blocked by open ticket(s) on any other open project

Screen Shot 2018-12-05 at 6.43.58 AM

Notes

To run any of the above JIRA filters standalone for a project (outside the board), simply add a “project=” clause and save them to a project-specific filter name
Be sure to set the saved JQL Filters permissions to “Everyone” or another permission level giving everyone access who needs it
It sometimes makes sense to omit ‘AllProj-BlockedndirectSubtask’ from the above ‘Blocked’ board filter, to avoid showing too many sub-tasks as blocked
With Scriptrunner Cloud, it is sometimes necessary to do the following for a change to be picked up by the filter and by the swimlane, for each of the 4 filters:
- Refresh the query in the scriptrunner query editor
- Select “Sync Filter”
- For a Jira board already viewed in a browser: Select “board settings then “back to board”

Process Pattern: Sub-task lexicon

One of the challenges I often encounter with JIRA boards is the limited screen real estate available for the ticket summary – often, the visible first few words are not enough to know what the ticket is without clicking on it for detail.

When using sub-tasks, this is often compounded by a sub-task summary which is redundant with the parent ticket, further reducing the effectiveness of the board.

To help improve usability of the board for sub-tasks, I’ve had good success using the “sub-task lexicon” process pattern on a number of projects, described below.

Example Sub-task lexicon prefixes

The following table of example prefixes presumes the following simple workflow states: Backlog -> Selected -> In-Progress -> QA -> Done

A smallish number of prefixes works best to keep it simple and consistent over time – when in DOubt use “DO”:

Prefix	Description	Typ. Workflow States	Examples	Comment
DO	a taskish sub-task	Selected -> In-Progress -> Done	DO Test Env. config.	Do a non-code/qa/spec activity Often requires no AC or QA verify
FIX	a bug	Selected -> In-Progress -> QA	FIX Error when saving	Code the fix Unit-test the fix
IMPL	a story	Selected -> In-Progress -> QA	IMPL User REST resource	Code the story Unit-test the story
QA	E Test Verify	QA-> Done	QA Error when saving QA User REST resource	“E Test” = Exploratory Test Verify ensures AC are met
SPEC	Description detail, AC	Backlog -> Selected	SPEC User REST resource

AC = Accept Criteria

Benefits

Improved readability on the board
Standardized prefixes encourage consistency
Encourages team thinking in a routinized way about typical sub-task workflow “profiles like:
- Feature: SPEC -> IMPL -> QA
- Bug: FIX -> QA

Other notes

Some teams use sub-tasks for Code Review (CR), others rely on external tools’ code review workflow such as GitHub / GitLab
Some teams create the typical sub-tasks up-front (typical of scrum’s sprint-planning), others create them JIT (works well with kanban or scrum-ban)

Whale riding Docker in a sea of Microservices

make development more consistent and deployment more reliable

Saw a couple interesting talks on Docker / Microservices last week – “State of the Art in Microservices”, the DockerCon Europe 2014 keynote, by Adrian Cockcroft ; and “Docker in Production – Reality, Not Hype”, at the March-2015 DevOps-Chicago meetup, by Bridget Kromhout (links below).

Adrian’s Microservices talk was interesting in that it was not limited to the purely technical realm of Microservices and Docker, but also described the organizational culture and structure needed to make it work:

Breaking Down the SILOs – a traditional “Monolithic Delivery” team must interface with each of 8 autonomous silo groups in his example, often using ticket-driven, bottleneck-prone workflow, vs. having two cross-functional “Microservices” teams (Product Team, Platform Team) which each span formerly-silo’d areas of expertise – making the point that introducing these DevOps-oriented cross-functional teams is a Re-Org
Microservice-based production updates may be made independently of other service updates, facilitating continuous delivery by each Microservice team and the reduced-bottleneck, high-throughput that results from it – contrasted with Monolithic Delivery deployments, which work well only with a small number of developers and single language in use
Docker containers facilitate the above by isolating configurations for each Microservice in their own containers, which are resource-light and start in seconds (and might live for only minutes), vs. a traditional VM-based approach which is more resource-hungry, starts in minutes and is typically up for weeks
Microservice Definition: Loosely coupled service oriented architecture with bounded contexts – this is the most succinct definition I’ve seen, contrasted with the broader SOA term which can describe either a loosely or tightly coupled (often in the form of RPC-like WSDL / SOAP implementations) – loose coupling is essential for the independent production updates mentioned above, with bounded contexts (how much a service has to know about other services) an indication of loose coupling. A common example of tightly-coupled system is a centralized database schema, with the database being the “contract” between two or more more components
AWS lambda is an interesting service that scales on demand with 100ms granularity (currently in preview)
Example Microservice architectures shown for: Netflix, Twitter, Gilt, Hailo
Opportunity identified – of Docker Hub as an enterprise app store for components
Book Recommendation – Lean Enterprise: Adopting Continuous Delivery, DevOps and Lean Startup at Scale

Bridget’s talk about how DramaFever uses Docker in production (since late 2013) described some of the benefits of using Docker:

Development more consistent – when developers share docker containers for their environment, it both reduces friction during development and eases deployment handoff to shared-dev, QA, staging, production environments. Another side benefit is a production container can be easily and quickly pulled by a developer to a local environment to troubleshoot. In their case they went from a 17min Vagrant-based developer setup (which also differed from production in its configuration) to a < 1min Docker-based one
Deployment more repeatable – scaling via provision-on-demand may be done more confidently and in a more automated fashion knowing that the containers are correct. They take the exact image from the QA environment and promote it to Staging then Prod

… and some technical details / challenges:

Docker containers in the build pipeline – Docker base images (main app, MySQL emulation of AWS-RDS) built weekly, and Microservice-specific builds of Docker containers dictated by the Dockerfiles in Git source control – she heavily emphasized the importance of a build-server-driven build and deployment pipeline (Jenkins in their case), the importance of having a fully-automated build and deploy chain (no laptops in the build pipeline)
Monitoring beyond the high-level offered by AWS CloudWatch implemented via Graphite, Sentry
Fig (now named “compose”) used to help containers find each other
“Our Own Private Registry” – they found it worked best to run local registries on each workstation rather than a centralized private registry
“Getting the Logs out” – host filesystem may be mounted from within the Docker container, to facilitate log export
“Containerize all the things” – they use Docker for most things, but have found Chef more effective for some of the infrastructure pieces such as Graphite. As she put it, you need to decide “what do you bake in your images vs. configure on the host after the fact”
“About those Race Conditions” – they use the Jenkins “Naginator” plugin to automatically re-run jobs which fail with certain messages such as “Cannot destroy container”

I’m looking forward to leveraging Docker to help optimize the deployment process for my current project, which will become even more important as we move toward a more Microservice-based architecture.

References:

Adrian Cockcroft: State of the Art in MicroServices (38min video)
Bridget Kromhout: Docker in Production – Reality, Not Hype (sysadvent article) – related slides with a few added tech details