“Who Should Write the Terraform?”
Software Engineer at Spacelift[0] here - a CI/CD specialized for Infra as Code (including Terraform).
A pattern we're seeing increasingly commonly are Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those, or write their own simple service-specific Terraform Stacks which however extensively use modules developed by the former.
This does also seem like the sweet spot to me, where most of the Terraform code (and especially the advanced Terraform bits) is handled by a team that's specialized for it. If you don't have a Platform Engineering team, or one that is playing its role (even if its called DevOps or Ops or SRE) in even a medium company, you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.
And yes, I can wholeheartedly recommend Spacelift if you're trying to scale Terraform usage across people and teams - and not just because I work there.
Disclaimer: Opinions are my own.
[0]: https://spacelift.io
ITT people arguing for embedding infrastructure engineers into product teams.
Ayyyy, dios mio.
a) If you need to embed, then actually, you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons. In today's labor-constrained economy, good luck finding qualified people for every role on every team! And if one of them leaves, who ensured that they documented everything for the next guy? Or that you'll find someone to fill the role quickly? If you have a 30 person company, fine, no big deal. 150+ and it starts to become a serious problem.
b) Particularly for infrastructure, you will shoot yourself in the foot on your production cloud bill. If you share no infrastructure with other teams, then you will find no shared efficiency in sharing the same infrastructure. Conway's Law will burn your runway. If you're 100% serverless then this doesn't really apply, but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better.
Product teams need to own their product top to bottom. Platform teams need to make that easy for them, because modern stacks are huge, it's not possible to staff a single team with all the necessary experts, and all that expertise is a genuine necessity. The lines are drawn in different places in different companies depending on available labor and technical requirements.
Re this segment:
> "There were endless complaints about the time taken to get ‘central IT’ to do their bidding, and frequent demands for more autonomy and freedom. A cloud project was initiated by the centralised DBA team to enable that autonomy. [...] Cue howls of despair from the development teams that they need a centralised DBA service"
Author makes it sound like users didn't know what they wanted. This is not true -- I have seen this play in practice, and what author omits is _it was a different set of people_ who were complaining before and after.
If a dev team has at least 2 engineers who are happy working with infrastructure, then the team will benefit from autonomy. If there is no one like that on the dev team, they will cry in despair.
I've started to believe that product engineers should manage their own infrastructure. I think the key ingredient is _isolation_, so that it's not that they have to figure out how to fit their service into the unholy single production account with 15,000 running instances, it's that they get to start fresh from a basic template and then move from there. Most services, when isolated from the other microservices, are just not _that_ complicatedl
For what it's worth this is how AWS operates, and I think it's the mindset with which they build products. You certainly _can_ go your own way and run something like k8s on top of it and build a mini-cloud in the cloud, but it's incredibly expensive.
It's a mistake I've made repeatedly -- "Oh, I'll just add this little abstraction to make it easier for developers!" But now the poor developer has to understand both the tools I built on top of _and_ whatever I was thinking at the time, and inevitably it's an under-resourced area.
Now, at a certain scale for core services, sure, you'll end up with infrastructure specialized folks. But I'm unconvinced that the place you want to start is, "Okay, I need a new service, better go talk to the beleaguered central team that never has quite enough time for anyone."
A bit of ranting...
As for me
1, HashiCorp is forcing enterprise upsales whenever possible, even if it'll hurt Adoption Rates and overall Development Experience
2. Existing TF design issues are ignored, which is causing people some state management trouble irrelevant for TFE. So, yet again, why fix something that will end up in upsales ?
3. MPL requires for the PR's to be available in case someone will really fix something, but it's near impossible to contribute into Terraform with any major design improvements.
4. Existing Providers issues are neglected, and Accepting Working PR's takes around 3-4 weeks...
5. Some Providers (helm) are neglected in favour of the New Product Release (Waypoint provider) and there a Forced Obsolescence Factor alongside with Forced Adoption.
Deficient Relationship Marketing is the Key Factor in deciding who Will actually write Terraform (maybe not even HashiCorp), Who will Wrap Terraform and Into What (terragrunt, terraspace, pulumi, crossplane etc or some custom gitops SaaS), and Who will Support the target providers when Hashicorp solutions will magically turn into an abandonware due to upsales.
If you're interested in more of the authors thoughts on DevOps, Kubernetes or writing, check out an interview I did with him recently: https://kubernetespodcast.com/episode/185-writing/
> The development team didn’t want to – or couldn’t – do the Ops work
Most devs I've spoke to are in this camp they don't want to do any Ops work at all. They want a 9-5 job without evenings are weekends wasted by services failing. No on call rota and all that jazz just writing code that's all.
I could be wrong, but 20 years of experience tells me that company size has a lot to do with this.
Tiny organisms like amoeba can be simple. But as organism size increases, so too does complexity. They eventually need a nervous system, circulatory system, extra sensors, a more powerful brain to process sensory information and handle movement, motion tracking for hunting. Suddenly, packs of these animals will hunt together, so they'll evolve communication: signals, sounds, language...
Well, if you're a 4-person start-up sitting in the same room, decisions can be made quickly, you don't need departments, managers. But as you grow your need to be extremely careful that you build a nervous system, circulatory system, sensors ... "management brain".
The biggest failures in ops aren't "who does X?". It's about creating right-sized teams that own functions that are important enough to have specific owners. With further growth, certain functions get more complex, and suddenly you might need dedicated network, database & security teams. And if it gets huge, then you probably need to need multiple copies of those specific functions embedded inside large subsections of the organisation. And they all need to communicate effectively with each other. It's a constant dance. You can't make a single rule and just stick rigidly to it. You need to keep tabs on complexity, workload, morale, lead times. You need to be ready to refactor your teams.
When I hear stores like "it was taking 8 weeks to get a DB provisioned" I think "if that company makes it to IPO and the CTO gets a few $100M, there's absolutely no justice in the world".
There is good stuff in this article, though I wish more writers would hire editors to help trim these articles (I always hire an editor when I write something this long). I think this is the heart of it, though you have to go pretty far into the article to get to this bit:
"What’s the point of this long historical digression? Well, it’s to explain that, with a few exceptions, the division between Dev and Ops, and between centralisation and distribution of responsibility has never been resolved. And the reasons why the industry seems to see-saw are the same reasons why the answer to the original question is never simple."
It is true that the answer is context dependent. I consult with several startups, I give different answers to different CTOs, depending on what stage their organization is at, and how much they will actually need devops in the future (I recently consulted for Paireyewear.com, a company that relies on Shopify to provide the public facing store through which they sell. As such, they will never need much in terms of devops. Instead I brought in Chris Clarke, one of the best devops talents I know, and he consults with them part-time, and that is as much devops talent as they need.)
What a long winded article to say "it depends", I liked the history though.
It got me thinking, here at amazon, we deliver "infrastructure as code" using the Cloud Development Kit: https://aws.amazon.com/cdk/
We expect engineers (not devops) to define their infrastructure in typescript and configure it through code. That code gets turned into cloudformation scripts and stands up the how cloud system for the api you're building.
I think this is a great hybrid approach. Knowing what you want is different than knowing all the intricate details of defining, say an API gateway. But the CDK lets me stand up an API Gateway and configure it with a swagger and security policy and be done. This lowers the barrier for devs to do devops work, and lets teams own and move fast when making changes.
The author calls out a few reasons why DevOps fails for organizations all of which I agree with - however the one that I've never completely understood: Regulatory reasons for keeping Ops centralized.
I work in healthcare which I guess should fall under this rule - but in practice I haven't really seen that impeding DevOps. Teams that have the capabilities to build the full stack get handed a subscription to a cloud provider and they go off and do so. They still fill out and track change logs, audit changes and seek approvals - but after that's done, it's still the team who presses "the button".
Anybody in a regulated industry where you've hit hard walls that prevent you and your team from going full on DevOps? If so, what rules were quoted that stopped you.
The thought leadership seems to be to get Dev and Ops to work directly together and avoid handoffs by creating a totally separate department called DevOps and having them do all their handoffs with dev and ops. You can call them platform engineering so nobody figures it out, though.
> But despite a lot of effort, the vast majority of organisations couldn’t make this ideal work in practice, even if they tried.
This matches my on-the-ground experience. The teams who lived the dream of DevOps were teams which built their software as cloud native (instead of later trying to migrate to the cloud). This is purely because the PaaS tooling let them efficiently be both Devs and Admins.
When you involve many teams instead of just a smallish group of devs, you have momentum to deal with. Plus, specialization - some of these ops people just don't like coding, or at least not the kind of coding you need to be doing to be effective DevOps engineers.
Indeed this leads to SRE - just because "Buying it" is usually easier than "Building it".
In my old employer, "Ops" and "Security" still had role for managing fundamental components of the system. In AWS terms, that means deploying AWS accounts in the organization, automation for best practices and compliance detection, IAM roles, VPCs, etc.
Security and network teams also built custom terraform modules which the deve teams were forced to use that were guardrails. You didn't use aws_s3_bucket, you used custom_aws_s3_bucket that mandated certain fields and prerequisites. This was the compromise struck to allow devs otherwise to go ham in their own AWS accounts and self-manage their deploys, databases, and so on.
At my company, I do. As a backend engineer. Guided by the devops team.
It is a nightmare of arcane copy pasting.
Terraform is an overengineered mess, a complex enemy I need to beat to deploy my simple changes.
What is a Platform? An excuse for an executive to adopt a trend and pass the buck to the next exec after he gets promoted for accomplishing his initiative (but before it's apparent that it was all a sham).
How we got here? Business doesn't want to pay for a well designed enterprise and the organization is shitty, so hire people who aren't very good at tech to build an unnecessarily complicated engineering organization that [after they waste millions poorly building cloud tech without prior experience, realize is] still a cost center and tell them to chase fads.
Factor One: Non-Negotiable Standards. Tell everyone they have to do the same thing, even if it makes no sense for what they're building or supporting.
Factor Two: Engineer Capability. Make sure you put unrealistic deadlines in the hands of amateur engineers and then turn up the scope creep.
Factor Three: Management Capability. Make sure your management can always blame somebody else for why your ridiculous initiative and poorly managed company didn't achieve its goals by its stated deadline. Market timing and "I didn't have enough resources" are good stand-bys.
Factor Four: Platform Team Capability. Pay a million in salaries to some middling full-time engineers, put them in a silo, make them build really basic tech from scratch that 50 different managed service companies sell for pennies. Don't Scrum with the teams that will be forced to use it. Make sure everyone is required to use the platform, even when it's not actually ready to go live, so that building any kind of product at all is mostly infeasible, incredibly slow, and painful.
Factor Five: Time to Market. Do everything you can to avoid value chain analysis, training employees on standard practices, unified communications, or getting stakeholders to work with you on initiatives. When your competitor lands a feature a year earlier than you planned, blame the consultants/contractors you never listened to.
Who should write the terraform? An overworked systems engineer in a siloed team. Definitely not someone working on the product. This way they can write 5 layers of unnecessary module abstractions, be unaware of how non-functional the module is from not actually running it on the product [and watching it fail 6 ways from sunday], and still not provide what the business needs.
Developers should be able to do the work.
"we aren't living in 2016 anymore, and the cloud moves fast. Platform teams are expensive and hard to do, offer a mediocre service at best, destroy velocity, and create bad incentives." [1]
[1]https://twitter.com/iamvlaaaaaaad/status/1534489514818686976...
I've had some thoughts around this issue more recently after moving from DevOps -> Software Engineering.
I love the idea of cross functional teams, but from what I have seen of the most recent implementation of it that I'm working in, there are as always, issues of definition around what a cross functional team actually is and should be.
IMO grabbing a bunch of backend SEs and making them handle their own DevOps is a joke of Academy Awards host level proportions. The shit I see as an ex DevOps dude is horrific. The notion that a bunch of people who've never done the role can somehow figure it out without specific training doesn't work, from my experience.
A cross functional team should actually be cross functional, where you have an engineer, whose specialty is the work you intend for them to complete within that team. Otherwise we're just being overburdened with extra shit that we frankly will never get the time to actually complete in a meaningful way, and it just generates more and more technical debt.
This misses the point a bit. Even if app teams write terraform, there is no way a security constrained company will let them deploy it without running a security check (OPA, Checkov).
So, either way, a large organization is going to punt that terraform/cfn/cdk template down a pipeline with a bunch of automated compliance reviews. Whether the App team or Ops team wrote it.
My experience being on a team that owned its infrastructure was that it wasn't really a terrible experience per se, but there was so much time between stories that required infra changes that the context decay was massive. We always managed, but it would take a lot of time to rebuild context and remember where everything was and generally how Terraform worked.
I've been in a team where Platform/Infrastructure Engineers handled everything Terraform and it was great. You just described what you wanted to them and they did it. Developers never touched a .tf file.
Then I moved to a team where Ops write Terraform but also expected developers to contribute. They pitched this as "Developers should be able to make small changes". Turned out we had very different understandings of the definition of "small".
I'm currently in a team with no Ops and developers are fully responsible for managing infra all the way to production. The Terraform implementation is an absolute mess. There is, however, an understanding that it needs fixing and Ops support has been promised.
My answer to "Who should write Terraform" is it's the Platform Engineers. A developer can maybe optionally pitch in if they feel confident enough but ultimately Platform Engineers should own the platform.
Wow that was excellent, very thorough but also easy to read and with minimal fluff
My personal take is that DevOps doesn't work (for me, and probably many others) because it amounts to context-switching (recently featured on HN: https://news.ycombinator.com/item?id=32390499). By being responsible for both Dev and Ops, my time (and my brain) gets split 50/50 into two entirely different sets of:
- Concerns
- Languages
- Tools
- Mindsets
This is both super draining, and counter-productive, for me. If Ops can be made so simple (by a platform team or otherwise) that it doesn't amount to a whole separate headspace, then great, I'll manage instances myself. But as long as it's a whole separate domain, trying to have one foot on each side of the fence is just not going to be workable.
The answer to "Who Should?" anything in my organization is "me". I go from writing ruby, to terraform, to javascript, html, css, to bash scripts, SQL, etc. Oh, and I have to manage people, and do code reviews, and support, and meet with clients...
Help... me...
Anyway, I've got the members of my dev team writing terraform for their changes now too. It's working, more or less. They are excited to do it because it pads their resumes, because it's new. But we continue to increase demands on devs, they need to get paid for their trouble or the responsibilities must be diffused.
On "how we got here" you use "bulleted list" rather than "numbered list". This is important as "If you rearrange the items in a bulleted list, the list's meaning does not change. "
Credit to https://developers.google.com/tech-writing/one/lists-and-tab... which pointed this out and has stuck in my craw ever since.
Good article, I enjoyed it. I agree with the premise that every company is different and they need to adopt what works for them. Ownership alone is a huge issue I've run into in the past.
The answer to this question is that you should never be writing Terraform/CDK from scratch, you are wasting time.
1. Scaffold your infrastructure with simple point & click in web console.
2. Generate terraform/CDK code by scanning your AWS account with typically available tools.
3. Edit an update said Infrastructure as Code as needed, swapping out the parameters with the vectors you need to change according to CI/CD
The whole "i want to write infrastructure as code from day 1" is not only stupid , its a waste of resources.
> The development team didn’t want to – or couldn’t – do the Ops work
But this is just because companies wanted to put the "ops" work into the shoulders of developers. What should be done is to hire one (or more) specific "ops/platform" engineers per team. Such engineers are the gateway for the team for all platform-related stuff. I'm not talking about SREs here. I think SREs are more about making the products as performant and efficient as possible (while platform engineers per team are more about setting up infrastructure). Sure both roles (in addition to the SWE role) do their job best if they are working together in the same team.
What I see nowadays in small and mid-size companies is either:
1. There is a "platform" team. They own infrastructure repositories, but they let product teams to make PRs to such repos (e.g., the platform team usually creates some kind of guidelines for managing infrastructure, like "How to create a staging mongo db"). The "platform" team is on charge of reviewing such PRs and merge them. Now, there are certain aspects of the infrastructure that only the "platform" team can actually work on (because the product teams either don't care about it or don't know about it). This doesn't work because the "platform" team becomes a bottleneck when the number of product teams starts to grow (the #platorm Slack channel becomes a nightmare with dozens of requests per day. Many platform engineers end up burned out because they see themselves as "customer service" for developers)
2. Developers pushing "product features" and at the same time they do "infrastructure" stuff. Companies usually call this "you build, you run it". In reality it's just cheap management (companies don't want to hire infrastructure engineers and they think the developers are excited to learn "docker/k8s/aws/gcp/terraform", so let them have fun). This is ultimately a nightmare for many developers because they end up burned out ("I want to work on product features! I don't want to fix GitLab pipelines").
I think the original idea of DevOps is totally valid. Just don't force your SWEs to work on infrastructure stuff. Instead, hire one or more infrastructure engineers for every product team you have. This way SWEs (dev) and infrastructure engineers (ops) can work close together and push stuff faster. Obviously almost no company is doing this because it is more expensive than the alternatives stated above.
Would you let your SWEs to design your frontpage? No. They obviously have a voice in the process of designing the frontpage, but ultimately the ones that should design it are your Product Designers (obviously for this to work, both your SWEs and your PDs should work in the same team).
I've thought for a while that sysadmin, operator, "devops engineer" and sre were all more or less the same job, but always felt like saying it would be silly of me.
In the future, I'll just link to this piece.
If you're at a company that doesn't have a Platform team, but that still struggles with wanting centralized guardrails and best practices, and a consistent set of patterns across services and teams, the answer to this question might be a developer experience platform like what we're building at Coherence (I'm a cofounder). In this case, someone else writes the terraform, and you just tell us how to map it onto your code. This lets us give you nice things like a dashboard to manage deployments, cloud IDEs, branch preview environments, etc. while still giving your dev/devops folks total control and visibility, since it runs in your own cloud...
Would love anyone interested to give it a spin at withcoherence.com and please feel free to ping hn@withcoherence.com with any feedback or issues!
At Terrateam[0] we specialize in Terraform automation with GitHub Pull Requests and GitHub Actions.
We talk to a lot of Terraform users from a lot of different companies. The most popular way of doing things is having your SRE/DevOps team write the bulk of the Terraform modules for your organization. Other members of engineering then consume these modules to create resources for their platform/application/etc. This code can either live in a Terraform monorepo or inside an application-specific repo. We've seen many approaches.
Scaling Terraform inside your organization is incredibly convenient with Terrateam as we leverage many pieces of GitHub.
[0]: https://terrateam.io
With the history lesson part I find this article totally omits the actual technology change where we stopped hand cranking individual servers and started treating infrastructure as code…
Usually the answer on my circles is "the DevOps team".
No one
Nobody should write terraform
undefined
Devs writing Terraform means ops doesn't move fast enough. Fix ops instead of forcing the teams to roll their own.