Infrastructure as Code with Terraform: Real-World Patterns and Pitfalls
Stop treating Terraform like a script and start treating it like software. From state management at scale to the testing revolution, here is how we build resilient infrastructure in 2026.

The 3 AM State Lock
You are at the terminal at 3 AM. A terraform apply just hung on a circular dependency in your production VPC, and now your primary database is in a 'modifying' state with no clear path back. The CI/CD pipeline is red, the site is down, and your state file is locked. This isn't a failure of the tool; it's a failure of architecture. In my fifteen years of building production systems, I've seen more outages caused by 'Infrastructure as Code' than by manual errors, simply because we treat our HCL (HashiCorp Configuration Language) like a set of scripts rather than a mission-critical software application.
Why Terraform Architecture Matters in 2026
In 2026, we aren't just provisioning virtual machines. We are managing multi-cloud service meshes, ephemeral development environments, and policy-as-code sidecars. Terraform 1.10+ has introduced features that significantly improve the developer experience—like native provider functions and enhanced testing frameworks—but the fundamental trap remains: the 'Big Ball of State.' When you put your VPC, your RDS cluster, and your Kubernetes workloads in the same state file, you create a monolithic blast radius. One minor change to a security group can trigger a cascading refresh that touches every resource in your stack. To survive at scale, you must decouple.
Section 1: The Micro-State Pattern and Layered Architecture
The most critical pattern I've implemented across high-growth startups is the 'Micro-State' pattern. Instead of one massive repository, we split infrastructure into logical layers with their own lifecycles and state files.
- Core Layer: Global networking, IAM roles, and DNS. Changes once a quarter.
- Platform Layer: EKS clusters, RDS instances, and shared Redis caches. Changes once a week.
- Application Layer: Task definitions, ingress rules, and app-specific S3 buckets. Changes multiple times a day.
By using terraform_remote_state or modern OCI-backed modules, the Application layer can consume the VPC ID from the Core layer without having the permission to modify it. This limits the blast radius. If an engineer fat-fingers an ingress rule, there is zero risk of the VPC being accidentally deleted or modified.
Section 2: Testing Infrastructure with Native Tooling
For years, testing Terraform meant using 'Terratest' and writing Go code. While powerful, the barrier to entry was too high for many platform teams. In 2026, we lean heavily on the native terraform test framework. It allows us to perform unit tests and integration tests using HCL itself. This is no longer optional; if your module doesn't have a .tftest.hcl file, it shouldn't be in production.
Example: Validating a VPC Module
Here is a copy-paste-ready example of a modern test suite. This ensures that your VPC CIDR matches requirements and that the subnets are correctly distributed across availability zones before a single real resource is created.
hcl
main.tftest.hcl
variables { vpc_cidr = "10.0.0.0/16" az_count = 3 }
run "validate_vpc_logic" { command = plan
assert { condition = aws_vpc.main.cidr_block == var.vpc_cidr error_message = "VPC CIDR block does not match the input variable." }
assert { condition = length(aws_subnet.private) == var.az_count error_message = "The number of private subnets created does not match az_count." } }
run "verify_tags" { command = plan
assert { condition = aws_vpc.main.tags["ManagedBy"] == "Terraform" error_message = "Resources must have the ManagedBy tag set to Terraform." } }
Section 3: Safe Refactoring with 'moved' Blocks
One of the biggest pitfalls in the early days of Terraform was renaming a resource. Terraform would see the new name, plan a destroy, and then a create. For a database, this is catastrophic. In current versions, the moved block allows us to refactor our code—such as moving a resource into a module—without impacting the underlying infrastructure. It is a declarative way to update the state file.
Example: Moving a Resource into a Module
Imagine you started with a standalone AWS instance and now want to wrap it in a compute module. Instead of running terraform state mv (which is manual and error-prone), you use this:
hcl
refactor.tf
moved { from = aws_instance.web_server to = module.compute_cluster.aws_instance.web_server[0] }
The module invocation
module "compute_cluster" { source = "./modules/compute"
... variables
}
When you run terraform plan, Terraform detects the moved block and simply updates the state metadata. No resources are touched. This is how you pay down technical debt without scheduling downtime.
Section 4: Security and the OIDC Revolution
Stop using static IAM user keys in your CI/CD pipelines. It's 2026; if you have a credentials file on a runner, you're doing it wrong. We now use OIDC (OpenID Connect) for short-lived, identity-based access. Whether you're using GitHub Actions, GitLab CI, or Terraform Cloud, the runner assumes a role dynamically.
Pro Tip: Combine OIDC with 'Policy as Code' using OPA (Open Policy Agent). We run a check before every apply that ensures no security group allows
0.0.0.0/0on port 22. If it does, the pipeline fails before the plan is even generated.
Real-World Gotchas: What the Docs Don't Tell You
- The Circular Dependency Trap: This often happens between ECS services and Load Balancers. The LB needs the Service for the Target Group, and the Service needs the LB for the listener. Break the cycle by defining the
aws_lb_listener_ruleas a separate resource from theaws_lb_listener. - Count vs. For_Each: Never use
countfor resources that might be removed from the middle of a list (like a list of users). Terraform identifies resources by index. If you delete user #2 in a list of 5, Terraform will shift #3 to #2, #4 to #3, and so on, forcing a recreate of every subsequent resource. Always usefor_eachwith a unique map key. - Provider Bloat: Every provider you add increases the time it takes to run
terraform initandplan. Pin your provider versions using~>to allow minor updates but prevent breaking major changes. I've seen builds fail because a cloud provider released a breaking change to their API and Terraform pulled the latest provider version automatically. - The Ghost in the State: Sometimes, manual changes in the console ('ClickOps') create drift that Terraform can't easily fix. Use the
terraform plan -refresh-onlycommand to reconcile the state without making infrastructure changes before you attempt a major apply.
Takeaway: Audit Your Blast Radius Today
The most impactful thing you can do right now is to audit your state files. If your terraform state list returns more than 100 resources, or if a terraform plan takes longer than 3 minutes, your state is too large. Choose one logical component—like your database or your networking layer—and migrate it to its own state file using moved blocks. Your future self, standing over a terminal at 3 AM, will thank you.","tags":["Terraform","DevOps","Infrastructure as Code","SRE","Automation"],"seoTitle":"Terraform Best Practices 2026: Real-World Patterns & Pitfalls","seoDescription":"Senior engineer Ugur Kaval shares production-tested Terraform patterns, code examples for terraform test, and strategies to avoid state-related outages."}