DevOps engineers write a lot of YAML. A lot of Bash. A lot of Dockerfiles that fail at layer 11 of 12. A lot of GitHub Actions workflows that work perfectly in dev and break inexplicably in CI.

Claude Code changes this. Not because it writes perfect infrastructure code on the first try — it doesn't. But because it understands why your Docker layer cache is busted, why your GitHub Actions matrix job is timing out, and why your Terraform plan is drifting. It reads your configs, reasons through the problem, and gives you a fix — not a generic StackOverflow answer.

This guide covers how DevOps engineers actually use Claude Code in production. Real prompts. Real workflow patterns. CLAUDE.md setup for infrastructure repos. And the mistakes that cost you time.

73% of DevOps time is YAML & config debugging
faster pipeline debugging with AI assistance
~60% of Docker build failures are preventable

1. GitHub Actions: Writing & Debugging Workflows

GitHub Actions is where Claude Code delivers the most immediate ROI for DevOps work. Workflow YAML is verbose, unforgiving, and full of gotchas — expression syntax, context variables, concurrency groups, job outputs — that take time to get right even with the docs open.

Creating Workflows From Scratch

The key is context. Don't ask Claude to "write a CI pipeline." Describe your stack, your deployment target, and your constraints.

Prompt
Write a GitHub Actions workflow for a Node.js 20 app that: - Runs on push to main and PR to main - Installs with npm ci and caches node_modules by package-lock.json hash - Runs lint (npm run lint), test (npm test), and build (npm run build) in parallel jobs - On push to main only: builds and pushes a Docker image to ECR, then deploys to ECS - Uses OIDC authentication for AWS (no static credentials) - ECR repo: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app - ECS cluster: prod-cluster, service: my-app-service Read the existing Dockerfile at the root before writing the workflow.

That single prompt gets you 90% of the way to a production-ready workflow because it specifies the exact constraints Claude needs to make good decisions: cache strategy, AWS auth method, registry URI, deployment target.

Debugging Workflow Failures

This is where Claude Code earns its keep. Paste the failing workflow + the error log and ask Claude to diagnose:

Prompt
This GitHub Actions workflow is failing with the error below. Diagnose the root cause and provide a fix. Don't just work around it — explain why it's failing. [paste workflow YAML] [paste error log]
✅ Real-World Example

A common failure: Error: unable to get local issuer certificate in a Node.js job that works fine locally. Claude will correctly identify this as a corporate proxy issue, suggest setting NODE_EXTRA_CA_CERTS in the job environment, and explain why it only manifests in the runner environment.

Matrix Builds & Conditional Logic

Prompt
Add a matrix build to this workflow that tests against Node 18, 20, and 22, and against Ubuntu and macOS runners. The matrix should exclude the macOS + Node 18 combination (too slow). Add a condition that only runs the macOS jobs on push to main, not on PRs.

2. Docker: Optimizing Builds & Fixing Failures

Dockerfiles are deceptively simple to write and surprisingly hard to optimize. Cache busting, layer ordering, multi-stage builds, and base image choices all compound. Claude Code excels here because it can read your entire Dockerfile and reason about the full build graph.

Dockerfile Review & Optimization

Prompt
Review this Dockerfile for a Python FastAPI app. I want you to: 1. Identify any layer cache busting issues 2. Convert it to a multi-stage build (builder + final) 3. Ensure the final image runs as a non-root user 4. Minimize the final image size 5. Add appropriate health check [paste Dockerfile] Current image size is 2.1GB. Target is under 500MB.

Debugging Build Failures

Docker build failures deep in a dependency installation are painful. The error message rarely tells you the real cause.

Prompt
My Docker build is failing at step 11/14 during pip install. The error is: ERROR: Cannot install -r requirements.txt because these package versions have conflicting dependencies. Here's the full build log: [paste log] Here's my requirements.txt: [paste file] Diagnose the conflict, tell me which packages are incompatible, and suggest a fix that doesn't require pinning every dependency to an exact version.

Docker Compose for Development Environments

Prompt
Create a docker-compose.yml for local development of this stack: - Node.js API (Dockerfile in ./api/) - Python worker (Dockerfile in ./worker/) - PostgreSQL 15 - Redis 7 - LocalStack (S3 and SQS only) Requirements: - API and worker should hot-reload on code changes (mount source) - Services should start in the correct order - Expose only the API port (3000) to the host - Use a .env file for secrets - Include a health check for each service
💡 DevOps Tip

Always give Claude your exact base image tag (e.g., node:20.14-alpine3.19) rather than a generic version. It changes the recommendations significantly — especially for security hardening and available package managers.

3. Kubernetes: Config Generation & Debugging

Kubernetes manifests are notoriously verbose and error-prone. A missing field in a resource limits spec, a wrong selector label, a probe that doesn't match the container port — these are silent failures that only surface at runtime. Claude Code catches them before kubectl apply.

Generating Production-Ready Manifests

Prompt
Generate a complete Kubernetes deployment for a Node.js API with these requirements: - Deployment with 3 replicas, rolling update strategy - Resource requests: 256m CPU, 256Mi memory. Limits: 500m CPU, 512Mi memory - Liveness probe: HTTP GET /health on port 3000, initialDelay 30s - Readiness probe: HTTP GET /ready on port 3000, initialDelay 10s - HorizontalPodAutoscaler: min 2, max 10, target CPU 70% - Service (ClusterIP) + Ingress (nginx) with TLS termination - ConfigMap for non-secret env vars, SecretRef for DB credentials - PodDisruptionBudget: minAvailable 2 Namespace: production App labels: app=my-api, team=platform, env=production

Debugging CrashLoopBackOff and Other Failures

Prompt
My pod is in CrashLoopBackOff. Here's the output of: - kubectl describe pod [pod-name] - kubectl logs [pod-name] --previous - The deployment YAML [paste all three] Tell me the most likely root cause and what to check next. If it's a configuration issue, show me the fix.

RBAC Configuration

Prompt
Create RBAC manifests for a CI/CD service account that needs to: - Deploy to the "staging" namespace (create/update/delete Deployments, Services, ConfigMaps) - Read secrets in "staging" namespace - NOT have access to any other namespace - NOT be able to modify RBAC itself Follow the principle of least privilege. Include ServiceAccount, Role, and RoleBinding.

4. Terraform & Infrastructure as Code

Terraform is where context matters most. Claude Code needs to understand your provider version, existing state, and naming conventions before it can generate useful modules.

Module Generation

Prompt
Create a reusable Terraform module for an AWS ECS service. It should accept: - var.name (string) — service name - var.image (string) — container image URI - var.cpu, var.memory — Fargate task sizing - var.environment (map of strings) — env vars - var.secrets (map of strings) — SSM Parameter Store paths The module should create: ECS Task Definition, ECS Service, IAM role with least-privilege policies, CloudWatch log group, and security group with no inbound. We use Terraform 1.8, AWS provider ~> 5.0, and all resources should have a standard tags variable.

Debugging Plan Drift

Prompt
My terraform plan is showing unexpected changes to this resource even though I haven't changed the config. Here's the plan output: [paste plan diff] Here's the current resource config: [paste .tf block] Explain why Terraform thinks this needs to change and whether it's safe to apply.
⚠️ Important

Never paste real AWS account IDs, ARNs, or secrets into Claude Code prompts. Use placeholder values (e.g., 123456789012, REDACTED) when sharing configs for review. Claude doesn't need real values to diagnose or generate.

5. CLAUDE.md for Infrastructure Repos

The CLAUDE.md file is your infrastructure team's Claude Code onboarding document. It tells Claude about your stack, conventions, and constraints — so every engineer gets consistent, project-aware assistance instead of generic output.

Here's a battle-tested CLAUDE.md template for infrastructure repos:

# CLAUDE.md — Platform Team Infrastructure Repo

## What This Repo Is
Terraform modules + Kubernetes manifests + CI/CD pipelines for [Company] production infrastructure.
Do NOT treat this as a generic project. Every suggestion must align with the conventions below.

## Provider Versions
- Terraform: 1.8.x
- AWS Provider: ~> 5.0
- Kubernetes Provider: ~> 2.30

## AWS Configuration
- Primary region: us-east-1
- Secondary region: us-west-2
- Account IDs: [use variables, never hardcode]
- Auth: OIDC for CI/CD, SSO profiles for local dev

## Resource Naming Convention
{env}-{team}-{service}-{resource-type}
Examples: prod-platform-api-ecs, staging-platform-worker-rds

## Required Tags on All Resources
{
  Environment = var.environment
  Team        = var.team
  ManagedBy   = "terraform"
  Repository  = "github.com/company/infra"
}

## Security Rules (non-negotiable)
- No resources with 0.0.0.0/0 inbound except ALB on 443
- All ECS tasks run as non-root (readonlyRootFilesystem = true where possible)
- S3 buckets: block_public_acls, block_public_policy, ignore_public_acls, restrict_public_buckets = true
- SSM Parameter Store for secrets, not env vars
- Enable CloudTrail logging for all new accounts

## Kubernetes Conventions
- Cluster version: 1.30
- Ingress controller: nginx
- Service mesh: none (use Network Policies)
- Resource requests are REQUIRED — never omit them
- Always set both requests and limits

## GitHub Actions
- Use OIDC for AWS auth (no static credentials ever)
- Cache: actions/cache@v4
- Docker builds: docker/build-push-action@v6
- Pinned runner versions: ubuntu-24.04

## Files to Read Before Editing
- modules/ecs-service/main.tf — existing ECS module pattern
- .github/workflows/deploy.yml — current deployment flow
- environments/prod/main.tf — production environment config

Put this at the root of your infra repo. Engineers on your team who open Claude Code against this repo will immediately get context-aware suggestions — no manual re-explaining every session.

6. 30 Copy-Paste DevOps Prompts

These are production-tested prompts organized by task. Each assumes Claude Code has access to your repo files — which it does when you run it from your project root.

GitHub Actions (8 prompts)

  1. Pipeline creation: "Write a GitHub Actions workflow that runs tests on PR and deploys to [env] on push to main. Read the existing Makefile for the test and build commands."
  2. Cache optimization: "Our CI pipeline takes 14 minutes. Review the workflow and identify every cache opportunity. Implement caching for dependencies, Docker layers, and build artifacts."
  3. Secrets scanning: "Add a secrets scanning step to this workflow using trufflesecurity/trufflehog before any build steps. Fail the job if secrets are detected."
  4. Slack notifications: "Add Slack notifications to this workflow: success message on main deploy, failure alert on any job failure. Use the existing SLACK_WEBHOOK_URL secret."
  5. Environment protection: "Add manual approval gates to the production deploy job in this workflow. Only allow approval from the 'platform-leads' GitHub team."
  6. Reusable workflow: "Convert our deploy job into a reusable workflow that other repos can call. It should accept image URI, environment, and cluster name as inputs."
  7. Cost reduction: "Our GitHub Actions bill doubled this month. Review this workflow and identify jobs that can be skipped on path-filtered changes or run on spot runners."
  8. Timeout debug: "This job times out at 60 minutes but the logs stop updating at 45m. Diagnose what's likely hanging and add timeout-minutes to each step to isolate it."

Docker (8 prompts)

  1. Layer audit: "Analyze this Dockerfile and show me exactly which layers will be cache-busted when a developer changes a single source file. Reorder layers to maximize cache hits."
  2. Security hardening: "Harden this Dockerfile: run as non-root, use minimal base image, remove unnecessary tools, set COPY --chown, and add HEALTHCHECK."
  3. Build argument handling: "Refactor this Dockerfile to accept BUILD_ENV and APP_VERSION as build arguments, bake the version into the binary, but NOT expose credentials as build args."
  4. Multi-architecture: "Modify this Dockerfile to build for both linux/amd64 and linux/arm64. Identify any dependencies that may not have arm64 builds."
  5. Dependency audit: "Scan this Dockerfile for the top 5 ways an attacker could exploit the running container. Suggest mitigations for each."
  6. Build time reduction: "Our Docker build takes 12 minutes. Analyze the Dockerfile and suggest changes to get it under 4 minutes including layer caching strategy."
  7. .dockerignore: "Generate a .dockerignore for this project. Read the directory structure and exclude everything that doesn't need to be in the build context."
  8. Base image update: "Our base image is node:16-alpine which is EOL. Migrate to node:20-alpine. Identify any compatibility issues in the Dockerfile and the package.json engines field."

Kubernetes (7 prompts)

  1. Resource right-sizing: "Our pods are OOMKilled 2-3 times per week. Here are the last 30 days of memory metrics. Recommend new resource requests and limits."
  2. Network Policy: "Create a NetworkPolicy that allows this deployment to receive traffic only from the ingress controller and make outbound calls only to the database service. Deny everything else."
  3. PodDisruptionBudget: "We have a 5-replica deployment. Create a PodDisruptionBudget that allows rolling updates but prevents more than 1 pod from being unavailable during node drains."
  4. Helm chart creation: "Convert these 4 Kubernetes manifest files into a Helm chart with a values.yaml that exposes the commonly changed fields."
  5. Resource quota: "Set up ResourceQuota and LimitRange for the staging namespace. Staging should never use more than 8 CPU cores and 16Gi memory total."
  6. Init containers: "Our app requires the database to be accepting connections before starting. Add an init container that polls the DB healthcheck endpoint every 5 seconds before the main container starts."
  7. Canary deployment: "Implement a canary deployment pattern for this service using two Deployments and a Service with weighted traffic splitting via ingress annotations."

Terraform (7 prompts)

  1. Module refactoring: "This Terraform code has 400 lines of repeated resource blocks. Identify the pattern and refactor into a module with for_each."
  2. State import: "We have 3 manually created S3 buckets that aren't in Terraform state. Write the import blocks and resource configs to bring them under management without recreating them."
  3. Cost estimation: "Review these new Terraform resources and estimate the monthly AWS cost. Flag any resources that might have unexpected data transfer charges."
  4. Policy as code: "Write OPA/Rego policies that enforce: no public S3 buckets, all EC2 instances require a Name tag, no security groups with 0.0.0.0/0 inbound on port 22."
  5. Dependency graph: "This Terraform plan has a cycle error. Here's the error and the relevant resource blocks. Identify the cycle and suggest how to break it."
  6. Version upgrade: "We're upgrading from Terraform 1.5 to 1.8. Review our code for any deprecated syntax or provider changes we need to handle."
  7. Sensitive outputs: "Audit all outputs.tf files in this repo. Flag any that expose sensitive values and replace them with sensitive = true or remove them entirely."

7. Five Anti-Patterns That Cost DevOps Engineers Time

1. Pasting Errors Without Context

Don't: "My pipeline is broken, here's the error."
Do: Paste the full workflow YAML, the error log (not just the last 5 lines), and any relevant env/secret configuration. Claude needs to see the full picture to diagnose correctly.

2. Asking for Generic Infrastructure

Claude will generate a working Kubernetes deployment without knowing your cluster version, ingress controller, or naming conventions. That deployment will work — but it won't fit your environment and you'll spend an hour adapting it. Specify your exact constraints upfront.

3. No CLAUDE.md in Infrastructure Repos

Infrastructure repos have more conventions and constraints than application code. Without a CLAUDE.md, Claude generates technically correct configs that violate your security policies, naming conventions, or tagging requirements. This is the highest-ROI file you can create for your infra repo.

🔑 Key Insight

A CLAUDE.md that encodes your security non-negotiables (no public S3, OIDC auth only, required tags) is worth more than 100 individual prompts. Claude will apply those rules to every resource it generates — without you having to remind it every time.

4. Letting Claude Write Terraform Without Specifying Provider Versions

The AWS provider changed significantly between 4.x and 5.x. Kubernetes provider 2.x has different resource structures than 1.x. Always specify your provider versions in the prompt — or better, put them in CLAUDE.md — to avoid generating configs that won't apply against your pinned versions.

5. Accepting First-Draft Security Groups

Claude will generate functional security groups. "Functional" and "secure" are not the same thing. Always follow up any infrastructure generation with: "Review the security groups in this config. Are any of them wider than they need to be? Apply the principle of least privilege." Make this a habit.

Your Next Step

The single highest-ROI action: create a CLAUDE.md at your infra repo root using the template above. Customize it for your actual provider versions, naming conventions, and security non-negotiables. Every engineer on your team who opens Claude Code against that repo will get context-aware output immediately — no re-explaining every session.

If you want this for every team in your engineering organization — platform, backend, frontend, QA — with role-specific CLAUDE.md templates, a full onboarding playbook, and best practices for team deployment at scale, the Claude Code Team Playbook has it packaged and ready to use.

Claude Code Team Playbook — $49

Complete team onboarding playbook for Claude Code. CLAUDE.md templates for every role (DevOps, backend, frontend, PM, QA), workflow guides, rollout checklist, and prompt libraries. One-time purchase. Instant download.

Get Instant Access ($49) → CLAUDE.md Starter Kit ($19) →
P
Patrick AI operator and author of Ask Patrick. I build and test Claude Code workflows full-time so you don't have to figure it out yourself. Platform engineer background — I write a lot of YAML.