DevOps engineers write a lot of YAML. A lot of Bash. A lot of Dockerfiles that fail at layer 11 of 12. A lot of GitHub Actions workflows that work perfectly in dev and break inexplicably in CI.
Claude Code changes this. Not because it writes perfect infrastructure code on the first try — it doesn't. But because it understands why your Docker layer cache is busted, why your GitHub Actions matrix job is timing out, and why your Terraform plan is drifting. It reads your configs, reasons through the problem, and gives you a fix — not a generic StackOverflow answer.
This guide covers how DevOps engineers actually use Claude Code in production. Real prompts. Real workflow patterns. CLAUDE.md setup for infrastructure repos. And the mistakes that cost you time.
1. GitHub Actions: Writing & Debugging Workflows
GitHub Actions is where Claude Code delivers the most immediate ROI for DevOps work. Workflow YAML is verbose, unforgiving, and full of gotchas — expression syntax, context variables, concurrency groups, job outputs — that take time to get right even with the docs open.
Creating Workflows From Scratch
The key is context. Don't ask Claude to "write a CI pipeline." Describe your stack, your deployment target, and your constraints.
That single prompt gets you 90% of the way to a production-ready workflow because it specifies the exact constraints Claude needs to make good decisions: cache strategy, AWS auth method, registry URI, deployment target.
Debugging Workflow Failures
This is where Claude Code earns its keep. Paste the failing workflow + the error log and ask Claude to diagnose:
A common failure: Error: unable to get local issuer certificate in a Node.js job that works fine locally. Claude will correctly identify this as a corporate proxy issue, suggest setting NODE_EXTRA_CA_CERTS in the job environment, and explain why it only manifests in the runner environment.
Matrix Builds & Conditional Logic
2. Docker: Optimizing Builds & Fixing Failures
Dockerfiles are deceptively simple to write and surprisingly hard to optimize. Cache busting, layer ordering, multi-stage builds, and base image choices all compound. Claude Code excels here because it can read your entire Dockerfile and reason about the full build graph.
Dockerfile Review & Optimization
Debugging Build Failures
Docker build failures deep in a dependency installation are painful. The error message rarely tells you the real cause.
Docker Compose for Development Environments
Always give Claude your exact base image tag (e.g., node:20.14-alpine3.19) rather than a generic version. It changes the recommendations significantly — especially for security hardening and available package managers.
3. Kubernetes: Config Generation & Debugging
Kubernetes manifests are notoriously verbose and error-prone. A missing field in a resource limits spec, a wrong selector label, a probe that doesn't match the container port — these are silent failures that only surface at runtime. Claude Code catches them before kubectl apply.
Generating Production-Ready Manifests
Debugging CrashLoopBackOff and Other Failures
RBAC Configuration
4. Terraform & Infrastructure as Code
Terraform is where context matters most. Claude Code needs to understand your provider version, existing state, and naming conventions before it can generate useful modules.
Module Generation
Debugging Plan Drift
Never paste real AWS account IDs, ARNs, or secrets into Claude Code prompts. Use placeholder values (e.g., 123456789012, REDACTED) when sharing configs for review. Claude doesn't need real values to diagnose or generate.
5. CLAUDE.md for Infrastructure Repos
The CLAUDE.md file is your infrastructure team's Claude Code onboarding document. It tells Claude about your stack, conventions, and constraints — so every engineer gets consistent, project-aware assistance instead of generic output.
Here's a battle-tested CLAUDE.md template for infrastructure repos:
# CLAUDE.md — Platform Team Infrastructure Repo
## What This Repo Is
Terraform modules + Kubernetes manifests + CI/CD pipelines for [Company] production infrastructure.
Do NOT treat this as a generic project. Every suggestion must align with the conventions below.
## Provider Versions
- Terraform: 1.8.x
- AWS Provider: ~> 5.0
- Kubernetes Provider: ~> 2.30
## AWS Configuration
- Primary region: us-east-1
- Secondary region: us-west-2
- Account IDs: [use variables, never hardcode]
- Auth: OIDC for CI/CD, SSO profiles for local dev
## Resource Naming Convention
{env}-{team}-{service}-{resource-type}
Examples: prod-platform-api-ecs, staging-platform-worker-rds
## Required Tags on All Resources
{
Environment = var.environment
Team = var.team
ManagedBy = "terraform"
Repository = "github.com/company/infra"
}
## Security Rules (non-negotiable)
- No resources with 0.0.0.0/0 inbound except ALB on 443
- All ECS tasks run as non-root (readonlyRootFilesystem = true where possible)
- S3 buckets: block_public_acls, block_public_policy, ignore_public_acls, restrict_public_buckets = true
- SSM Parameter Store for secrets, not env vars
- Enable CloudTrail logging for all new accounts
## Kubernetes Conventions
- Cluster version: 1.30
- Ingress controller: nginx
- Service mesh: none (use Network Policies)
- Resource requests are REQUIRED — never omit them
- Always set both requests and limits
## GitHub Actions
- Use OIDC for AWS auth (no static credentials ever)
- Cache: actions/cache@v4
- Docker builds: docker/build-push-action@v6
- Pinned runner versions: ubuntu-24.04
## Files to Read Before Editing
- modules/ecs-service/main.tf — existing ECS module pattern
- .github/workflows/deploy.yml — current deployment flow
- environments/prod/main.tf — production environment config
Put this at the root of your infra repo. Engineers on your team who open Claude Code against this repo will immediately get context-aware suggestions — no manual re-explaining every session.
6. 30 Copy-Paste DevOps Prompts
These are production-tested prompts organized by task. Each assumes Claude Code has access to your repo files — which it does when you run it from your project root.
GitHub Actions (8 prompts)
- Pipeline creation: "Write a GitHub Actions workflow that runs tests on PR and deploys to [env] on push to main. Read the existing Makefile for the test and build commands."
- Cache optimization: "Our CI pipeline takes 14 minutes. Review the workflow and identify every cache opportunity. Implement caching for dependencies, Docker layers, and build artifacts."
- Secrets scanning: "Add a secrets scanning step to this workflow using trufflesecurity/trufflehog before any build steps. Fail the job if secrets are detected."
- Slack notifications: "Add Slack notifications to this workflow: success message on main deploy, failure alert on any job failure. Use the existing SLACK_WEBHOOK_URL secret."
- Environment protection: "Add manual approval gates to the production deploy job in this workflow. Only allow approval from the 'platform-leads' GitHub team."
- Reusable workflow: "Convert our deploy job into a reusable workflow that other repos can call. It should accept image URI, environment, and cluster name as inputs."
- Cost reduction: "Our GitHub Actions bill doubled this month. Review this workflow and identify jobs that can be skipped on path-filtered changes or run on spot runners."
- Timeout debug: "This job times out at 60 minutes but the logs stop updating at 45m. Diagnose what's likely hanging and add timeout-minutes to each step to isolate it."
Docker (8 prompts)
- Layer audit: "Analyze this Dockerfile and show me exactly which layers will be cache-busted when a developer changes a single source file. Reorder layers to maximize cache hits."
- Security hardening: "Harden this Dockerfile: run as non-root, use minimal base image, remove unnecessary tools, set COPY --chown, and add HEALTHCHECK."
- Build argument handling: "Refactor this Dockerfile to accept BUILD_ENV and APP_VERSION as build arguments, bake the version into the binary, but NOT expose credentials as build args."
- Multi-architecture: "Modify this Dockerfile to build for both linux/amd64 and linux/arm64. Identify any dependencies that may not have arm64 builds."
- Dependency audit: "Scan this Dockerfile for the top 5 ways an attacker could exploit the running container. Suggest mitigations for each."
- Build time reduction: "Our Docker build takes 12 minutes. Analyze the Dockerfile and suggest changes to get it under 4 minutes including layer caching strategy."
- .dockerignore: "Generate a .dockerignore for this project. Read the directory structure and exclude everything that doesn't need to be in the build context."
- Base image update: "Our base image is node:16-alpine which is EOL. Migrate to node:20-alpine. Identify any compatibility issues in the Dockerfile and the package.json engines field."
Kubernetes (7 prompts)
- Resource right-sizing: "Our pods are OOMKilled 2-3 times per week. Here are the last 30 days of memory metrics. Recommend new resource requests and limits."
- Network Policy: "Create a NetworkPolicy that allows this deployment to receive traffic only from the ingress controller and make outbound calls only to the database service. Deny everything else."
- PodDisruptionBudget: "We have a 5-replica deployment. Create a PodDisruptionBudget that allows rolling updates but prevents more than 1 pod from being unavailable during node drains."
- Helm chart creation: "Convert these 4 Kubernetes manifest files into a Helm chart with a values.yaml that exposes the commonly changed fields."
- Resource quota: "Set up ResourceQuota and LimitRange for the staging namespace. Staging should never use more than 8 CPU cores and 16Gi memory total."
- Init containers: "Our app requires the database to be accepting connections before starting. Add an init container that polls the DB healthcheck endpoint every 5 seconds before the main container starts."
- Canary deployment: "Implement a canary deployment pattern for this service using two Deployments and a Service with weighted traffic splitting via ingress annotations."
Terraform (7 prompts)
- Module refactoring: "This Terraform code has 400 lines of repeated resource blocks. Identify the pattern and refactor into a module with for_each."
- State import: "We have 3 manually created S3 buckets that aren't in Terraform state. Write the import blocks and resource configs to bring them under management without recreating them."
- Cost estimation: "Review these new Terraform resources and estimate the monthly AWS cost. Flag any resources that might have unexpected data transfer charges."
- Policy as code: "Write OPA/Rego policies that enforce: no public S3 buckets, all EC2 instances require a Name tag, no security groups with 0.0.0.0/0 inbound on port 22."
- Dependency graph: "This Terraform plan has a cycle error. Here's the error and the relevant resource blocks. Identify the cycle and suggest how to break it."
- Version upgrade: "We're upgrading from Terraform 1.5 to 1.8. Review our code for any deprecated syntax or provider changes we need to handle."
- Sensitive outputs: "Audit all outputs.tf files in this repo. Flag any that expose sensitive values and replace them with sensitive = true or remove them entirely."
7. Five Anti-Patterns That Cost DevOps Engineers Time
1. Pasting Errors Without Context
Don't: "My pipeline is broken, here's the error."
Do: Paste the full workflow YAML, the error log (not just the last 5 lines), and any relevant env/secret configuration. Claude needs to see the full picture to diagnose correctly.
2. Asking for Generic Infrastructure
Claude will generate a working Kubernetes deployment without knowing your cluster version, ingress controller, or naming conventions. That deployment will work — but it won't fit your environment and you'll spend an hour adapting it. Specify your exact constraints upfront.
3. No CLAUDE.md in Infrastructure Repos
Infrastructure repos have more conventions and constraints than application code. Without a CLAUDE.md, Claude generates technically correct configs that violate your security policies, naming conventions, or tagging requirements. This is the highest-ROI file you can create for your infra repo.
A CLAUDE.md that encodes your security non-negotiables (no public S3, OIDC auth only, required tags) is worth more than 100 individual prompts. Claude will apply those rules to every resource it generates — without you having to remind it every time.
4. Letting Claude Write Terraform Without Specifying Provider Versions
The AWS provider changed significantly between 4.x and 5.x. Kubernetes provider 2.x has different resource structures than 1.x. Always specify your provider versions in the prompt — or better, put them in CLAUDE.md — to avoid generating configs that won't apply against your pinned versions.
5. Accepting First-Draft Security Groups
Claude will generate functional security groups. "Functional" and "secure" are not the same thing. Always follow up any infrastructure generation with: "Review the security groups in this config. Are any of them wider than they need to be? Apply the principle of least privilege." Make this a habit.
Your Next Step
The single highest-ROI action: create a CLAUDE.md at your infra repo root using the template above. Customize it for your actual provider versions, naming conventions, and security non-negotiables. Every engineer on your team who opens Claude Code against that repo will get context-aware output immediately — no re-explaining every session.
If you want this for every team in your engineering organization — platform, backend, frontend, QA — with role-specific CLAUDE.md templates, a full onboarding playbook, and best practices for team deployment at scale, the Claude Code Team Playbook has it packaged and ready to use.
Claude Code Team Playbook — $49
Complete team onboarding playbook for Claude Code. CLAUDE.md templates for every role (DevOps, backend, frontend, PM, QA), workflow guides, rollout checklist, and prompt libraries. One-time purchase. Instant download.
Get Instant Access ($49) → CLAUDE.md Starter Kit ($19) →