So, what’s been going on? I’ve been a bit lax in blogging here of late, which I hope to fix in the near future. So what’s the news?
Well, new item number one if that I’m about to move on from Domain Group, where I’ve been Windows DevOps Wonk for the last three years, and head to Octopus Deploy, where I’ll be doing some really exciting Cloud Architecture work and generally trying to reach as many people as possible with the good news of Modern Windows Ops.
In early June, I’ll be leaving Domain and heading for the awesome @OctopusDeploy where I’ll be Cloud Architecting lots of exciting things.
— Jason Brown (@cloudyopspoet) May 16, 2017
So that’s the first thing out of the way. What else has been going on?
Oh yeah, there’s that whole WannaCry thing which went by. At Domain we were entirely unaffected. Why?
Well, most of us are running Windows 10 on our laptops, which was immune to the specific vulnerability. That was a major factor. But I don’t manage the client OS fleet. I manage the servers.
Good solid firewall practice was a major factor. SMB would never be open to the internet, and we have periodic security scanning that checks our Cloud environments against an exhaustive set of rules. We absolutely don’t allow SMB shares on our fleet – that was common practice at one time, but rapidly deemed anticloud because it does nothing except enable bad deployment practice.
However, an interesting wrinkle on the subsequent community debate: At Domain, we turn off Windows Update on our Robot Army Windows Server fleet.
“WHAAAAAT?” you say. “WHYYYY?”
There’s a specific reason for these instances. We found quite early on that occasionally Windows Update would fire off on instances, and push CPU usage to 100%, triggering scale-up events. In some cases, we’d end up with alerts and minor outages as a result of this behaviour. It also skewed the metrics we collect on resource usage by causing spikes at weird times, and was known to delay deployments
So we made a considered, reasoned decision to disable Windows Update on the autoscaling fleet. That’s a few hundred boxes right there.
As threat mitigation, we don’t allow RDP access to those boxes, we run Windows Server Core Edition with a minimal set of features enabled, and we closely monitor and correct changes to things like Security Group rules, service state and installed Windows features. All boxes are disposable at a moment’s notice, and we renew our AMI images on a regular basis – sometimes several times a week. Those images are fully patched and tested – with a suite of Pester tests – at creation time. We also maintain a small number of more “traditional” servers, which do have updates enabled, but none of these run customer-facing workloads
Make no mistake, Troy Hunt is absolutely right that no client OS should have updates disabled. But a modern server workload may have a case for it, as long as other measures are taken to protect the OS from threats. Your mileage may vary. Treat all advice with caution. I am not a doctor.
Last, here’s a new bit from Microsoft Virtual Academy (published 15 May 2017) which I think did a decent job of explaining modern DevOps practices to the curious or confused. The video and I certainly differ on some specific points of dogma, but the big picture is good – automate, tighten your feedback loops, throw away the old stuff, treat servers as software objects, move fast, apply laziness early on, build often, deploy often etc. Worth a look even if you’re a grizzled veteran, I’d say.