100 AI Prompts for DevOps & SRE — Complete Guide
DevOps and Site Reliability Engineering require mastering an enormous toolchain — from CI/CD pipelines to Kubernetes manifests to incident runbooks. AI assistants can help you write and debug infrastructure code, design resilient systems, and communicate clearly during incidents. These 100 prompts cover the full DevOps lifecycle.
CI/CD & Automation
Prompts to build and improve continuous integration and deployment pipelines.
Write a GitHub Actions workflow
BeginnerCreate CI/CD pipelines
Write a GitHub Actions workflow for a [language/framework] application that runs on every pull request and push to main. Include: dependency caching, linting, unit tests, build, Docker image build and push to [registry], and deployment to [environment]. Use environment secrets for credentials and set up branch protection rules.
Build a multi-stage deployment pipeline
IntermediateDesign production deployment pipelines
Design a multi-stage deployment pipeline for [application] going through dev → staging → production. Each stage should run: automated tests, security scanning (SAST, dependency audit), smoke tests after deployment, and require manual approval for production. Include rollback mechanism and notification to Slack.
Set up GitOps with ArgoCD
IntermediateImplement GitOps workflows
Set up a GitOps workflow using ArgoCD to deploy a [application] to Kubernetes. Define the ArgoCD Application manifest, configure automated sync with self-heal, set up app-of-apps pattern for multiple environments, implement image updater for automatic image tag updates, and configure RBAC for team access.
Write a deployment script
BeginnerAutomate deployment scripts
Write a bash deployment script for [application] that performs: pre-deployment health check, blue-green deployment with nginx upstream switch, post-deployment smoke test (curl endpoints), automatic rollback if smoke test fails, and sends a deployment notification to [Slack/PagerDuty]. Include set -euo pipefail.
Implement canary deployments
AdvancedRoll out changes safely with canary
Design and implement a canary deployment strategy for [service] on Kubernetes using [Argo Rollouts/Flagger/manual approach]. Define: traffic split percentages and progression schedule, metrics to analyze (error rate, latency p99), automatic rollback conditions, and how to manually promote or abort a canary.
Optimize CI pipeline speed
IntermediateSpeed up slow CI pipelines
My CI pipeline for [application] takes [X minutes]. Here is the current workflow config: [paste config]. Identify the top bottlenecks and provide specific optimizations: dependency caching strategies, test parallelization, Docker layer caching, conditional step execution, and build artifact reuse. Target: under [Y minutes].
Set up semantic versioning automation
IntermediateAutomate semantic versioning
Implement automated semantic versioning for a [language] project using conventional commits (feat/fix/breaking change). Set up: commit message linting with commitlint, automated CHANGELOG generation with semantic-release, automatic version bump in package.json/pyproject.toml, git tag creation, and GitHub Release publishing.
Write a Makefile for a project
BeginnerCreate project Makefiles
Write a Makefile for a [language/framework] project with targets for: installing dependencies, running tests, building Docker image, running locally with docker-compose, linting, formatting, database migrations, and generating documentation. Include a help target that lists all targets with descriptions.
Implement pipeline as code testing
AdvancedTest CI/CD pipelines
Write tests for a [Jenkins/GitHub Actions/GitLab CI] pipeline using [pipeline testing framework]. Test that: the correct stages run for different branch types, environment variables are properly set, secrets are not logged, build artifacts are produced, and deployment only occurs when tests pass.
Build a release management process
IntermediateFormalize release processes
Design a release management process for a team of [N] engineers releasing [service] [frequency]. Include: release candidate creation process, version freeze policy, QA handoff checklist, release notes template, production deployment checklist, post-deployment monitoring window, and hotfix process.
Infrastructure as Code
Prompts to write and improve Terraform, Ansible, and cloud infrastructure code.
Write Terraform for a web application
IntermediateProvision cloud infrastructure
Write Terraform code to provision a production-grade web application infrastructure on [AWS/GCP/Azure]. Include: [VPC with public/private subnets, load balancer, auto-scaling group / ECS / Kubernetes cluster], RDS database with read replica, S3 for static assets, CloudFront CDN, and IAM roles with least-privilege policies.
Create reusable Terraform modules
IntermediateBuild reusable IaC modules
Create a reusable Terraform module for [resource type: e.g., 'ECS service with ALB']. The module should accept variables for [list configurable parameters], expose outputs for [list outputs], include input validation, follow naming conventions with [prefix] variable, and include README documentation with usage example.
Write an Ansible playbook
BeginnerAutomate server configuration
Write an Ansible playbook to configure [server type] servers running [OS]. The playbook should: install and configure [list software], set up systemd services, configure [firewall rules], deploy application from [source], and run a post-configuration smoke test. Use roles for organization and support both Ubuntu and RHEL.
Audit Terraform for security issues
IntermediateSecurity audit IaC code
Review the following Terraform code for security issues: [paste Terraform]. Check for: overly permissive IAM policies, public S3 buckets, security groups open to 0.0.0.0/0 on sensitive ports, unencrypted storage, missing VPC flow logs, and hardcoded credentials. Provide fixes for each finding.
Design Terraform state management
AdvancedManage Terraform state at scale
Design a Terraform state management strategy for a team of [N] engineers managing [number] environments across [number] AWS accounts. Include: remote state backend configuration (S3 + DynamoDB locking), state file organization (per-environment, per-service), workspace vs separate backends tradeoffs, and state migration procedure.
Write a Helm chart
IntermediatePackage apps as Helm charts
Write a Helm chart for [application name]. Include: Deployment with configurable replicas and resource limits, Service, Ingress with TLS, ConfigMap and Secret management, HorizontalPodAutoscaler, PodDisruptionBudget, and sensible default values.yaml. The chart should support multiple environments via values overrides.
Implement cost optimization with Terraform
AdvancedReduce cloud infrastructure costs
Review my Terraform infrastructure for [AWS/GCP] and suggest cost optimization opportunities. Current monthly cost is [$X]. Identify: oversized instances, idle resources, savings plan/reserved instance opportunities, storage optimization, data transfer cost reductions, and unused resources. Provide Terraform changes for each optimization.
Set up Terraform CI/CD
IntermediateAutomate Terraform deployments
Set up a CI/CD pipeline for Terraform code in [GitHub/GitLab]. Include: terraform fmt and validate checks, tfsec security scanning, cost estimation with Infracost, plan output as PR comment, automated apply on merge to main (with manual approval for production), and state lock handling.
Write a CloudFormation template
IntermediateWrite CloudFormation templates
Write a CloudFormation template to deploy [infrastructure component]. Use nested stacks for modularity, implement cross-stack references with exports, add resource tagging with mandatory tags (Environment, Team, CostCenter), include stack policies to prevent accidental deletion, and document each resource with Metadata.
Implement drift detection
AdvancedDetect infrastructure drift
Design an infrastructure drift detection system for Terraform-managed resources. Set up scheduled terraform plan runs, parse plan output to detect unapproved changes, alert via [PagerDuty/Slack] on drift detected, generate a drift report, and define the remediation workflow (manual review vs automatic revert).
Kubernetes & Containers
Prompts for managing containers, Kubernetes clusters, and workloads.
Write a Kubernetes deployment manifest
BeginnerDeploy apps to Kubernetes
Write a production-ready Kubernetes Deployment manifest for [application]. Include: proper resource requests and limits (CPU: [X]m/[Y]m, Memory: [X]Mi/[Y]Mi), liveness and readiness probes, anti-affinity rules to spread across nodes, security context (non-root, read-only filesystem), and environment variables from ConfigMap and Secret.
Optimize a Dockerfile
BeginnerOptimize Docker images
Optimize the following Dockerfile for [application]: [paste Dockerfile]. Minimize image size using multi-stage builds, maximize layer cache efficiency, run as non-root user, remove unnecessary packages, use specific base image versions, and add HEALTHCHECK. Explain each optimization and the size reduction achieved.
Debug a CrashLoopBackOff
BeginnerDebug Kubernetes pod crashes
My Kubernetes pod is in CrashLoopBackOff state. Here is the output of kubectl describe pod [pod-name]: [paste output]. Here are the recent logs: [paste logs]. Diagnose the root cause, explain what each relevant line in the output means, and provide step-by-step remediation instructions.
Design Kubernetes resource limits strategy
IntermediateRight-size Kubernetes resources
Design a resource requests and limits strategy for a Kubernetes cluster running [describe workloads]. Define QoS class targets per workload type (Guaranteed/Burstable/BestEffort), LimitRange defaults per namespace, ResourceQuota per team, and VPA vs HPA recommendation per workload type. Provide the manifests.
Implement Kubernetes network policies
IntermediateSecure Kubernetes networking
Write Kubernetes NetworkPolicy manifests to implement zero-trust networking for a [microservices application]. Allow only necessary communication: [describe service communication requirements]. Add a default deny-all policy per namespace, allow DNS resolution, and permit monitoring scraping from the prometheus namespace.
Set up Kubernetes monitoring
IntermediateMonitor Kubernetes clusters
Set up a Kubernetes monitoring stack using kube-prometheus-stack (Prometheus + Grafana + Alertmanager). Define: custom recording rules for [key metrics], alerting rules with appropriate severity levels and runbook URLs, Grafana dashboard for cluster overview and per-namespace resource usage, and PagerDuty integration for critical alerts.
Design a multi-cluster strategy
AdvancedDesign multi-cluster architectures
Design a multi-cluster Kubernetes strategy for [company] needing [describe requirements: HA, multi-region, environments]. Compare: single cluster with namespaces vs multiple clusters. Define cluster topology, workload placement policy, cross-cluster service discovery, unified observability, and GitOps management with [ArgoCD/Flux].
Implement pod security
AdvancedEnforce Kubernetes pod security
Implement Pod Security Standards for a Kubernetes cluster. Configure Pod Security Admission controller with restricted policy for production namespaces, baseline for staging, set up exceptions for privileged system workloads, and create OPA Gatekeeper policies for custom constraints like required labels and image registry allowlist.
Write a Kubernetes operator
AdvancedBuild Kubernetes operators
Outline and write the skeleton of a Kubernetes operator in Go using kubebuilder for managing [custom resource: e.g., 'DatabaseCluster']. Define the CRD schema, reconcile loop logic, status conditions, finalizer for cleanup, events for state changes, and metrics exposed for the operator itself.
Perform a Kubernetes security audit
IntermediateAudit Kubernetes security
Perform a security audit of the following Kubernetes cluster configuration: [paste relevant manifests or describe configuration]. Check for: privileged containers, host network/PID/IPC usage, missing security contexts, overly permissive RBAC roles, exposed Kubernetes API, etcd encryption, and audit log configuration.
Observability & Incident Response
Prompts for logging, monitoring, alerting, and handling incidents effectively.
Design a logging strategy
BeginnerDesign centralized logging
Design a centralized logging strategy for [application] deployed on [infrastructure]. Define: log levels and when to use each, structured log format (JSON fields to include), log aggregation stack ([ELK/Loki/Datadog]), retention policy, log-based alerting rules, and how to correlate logs across microservices using trace IDs.
Write SLOs and SLIs
IntermediateDefine service reliability targets
Define SLOs and SLIs for [service name] based on its user-facing functions: [describe functions]. For each SLO, specify: the SLI metric and measurement method, target percentage over [28-day rolling window], error budget in minutes/requests, alerting policy (burn rate alerts), and what actions to take when the error budget is exhausted.
Write an incident runbook
IntermediateCreate incident runbooks
Write an incident response runbook for [service name] for the alert: '[alert name]'. Include: alert description and impact, immediate triage steps (commands to run), diagnostic decision tree, remediation procedures for each likely root cause, escalation path, customer communication template, and post-incident checklist.
Debug a production issue
BeginnerDebug production incidents
Help me debug a production issue: [describe symptoms, error messages, when it started]. Here is the relevant metrics/logs: [paste data]. Walk me through a structured debugging process: hypothesis formation, verification commands to run, and how to isolate the root cause without making the incident worse.
Write a post-mortem
IntermediateWrite incident post-mortems
Write a blameless post-mortem for the following incident: [describe what happened, impact, duration, timeline]. Include: incident summary, timeline, root cause analysis (5 Whys), contributing factors, impact assessment, what went well, what went wrong, and action items with owners and due dates. Follow Google SRE post-mortem format.
Set up distributed tracing
IntermediateImplement distributed tracing
Set up distributed tracing for a microservices application using [Jaeger/Zipkin/Tempo + OpenTelemetry]. Instrument [Node.js/Python/Go] services with OpenTelemetry SDK, propagate trace context across HTTP and message queue boundaries, configure sampling strategy for production, and set up trace-based alerting for latency anomalies.
Design alerting rules
IntermediateConfigure meaningful alerts
Design alerting rules for a [service type] using Prometheus/Alertmanager. Define rules for: error rate (>1% for 5 min = warning, >5% = critical), latency p99 (>500ms = warning, >2s = critical), resource saturation, and business metrics. Include alert labels for routing, severity, and runbook URL. Avoid alert fatigue.
Build a chaos engineering experiment
AdvancedTest system resilience
Design a chaos engineering experiment for [service/system] to test [specific resilience hypothesis: e.g., 'the system handles database failover within 30 seconds']. Define: steady state hypothesis and metrics, chaos injection method using [Chaos Monkey/Litmus/toxiproxy], blast radius containment, success criteria, and rollback procedure.
Implement on-call rotation
IntermediateDesign sustainable on-call processes
Design an on-call rotation policy for a team of [N] engineers supporting [service] with SLA of [response time]. Define: rotation schedule, escalation chain, on-call responsibilities, alert acknowledgment SLA, handoff process, compensation policy, and how to reduce on-call burden (reduce alert noise, improve runbooks, automate toil).
Create a capacity planning report
AdvancedPlan infrastructure capacity
Create a capacity planning analysis for [service] based on the following current metrics: [paste CPU, memory, storage, network utilization]. Project resource needs for [3/6/12] months assuming [X%] monthly growth. Identify when each resource will hit critical thresholds, and recommend scaling actions with cost estimates.
Security & Compliance
Prompts to harden infrastructure and meet compliance requirements.
Harden a Linux server
IntermediateHarden Linux servers
Write an Ansible playbook to harden an Ubuntu 22.04 server following CIS Benchmark Level 1. Include: SSH hardening (key-only auth, disable root), firewall configuration (ufw), fail2ban setup, automatic security updates, kernel parameter hardening via sysctl, filesystem permissions audit, and auditd configuration.
Design IAM least-privilege policy
BeginnerApply least-privilege IAM
Design an IAM policy for [AWS/GCP/Azure] for a service that needs to: [list what the service needs to do]. Write the policy with minimum required permissions, use resource-level restrictions where possible, add conditions (MFA required, IP restrictions, time of day), and explain each permission granted and why it is needed.
Set up AWS Security Hub
IntermediateCentralize security findings
Set up AWS Security Hub with [CIS AWS Foundations / PCI DSS / NIST 800-53] standard enabled across [N] accounts using AWS Organizations. Configure: automatic finding aggregation to security account, EventBridge rules to create Jira tickets for HIGH/CRITICAL findings, suppression rules for accepted risks, and weekly compliance report.
Implement secret scanning
BeginnerPrevent credential leaks
Set up secret scanning in a [GitHub/GitLab] repository to prevent credentials from being committed. Configure: [git-secrets/gitleaks/detect-secrets] as pre-commit hook, CI pipeline secret scanning step, GitHub secret scanning alerts, baseline for existing false positives, and a response process for when a real secret is detected.
Design a compliance-as-code framework
AdvancedAutomate compliance checks
Design a compliance-as-code framework for [SOC 2 / ISO 27001 / HIPAA] using [Open Policy Agent / AWS Config / Chef InSpec]. Map [5 key controls] to automated policy checks, define evidence collection automation, integrate checks into CI/CD pipeline, and generate audit-ready compliance reports automatically.
Write a network security audit
IntermediateAudit network security
Audit the following network architecture for security issues: [describe VPC/network topology with subnets, security groups, NACLs]. Check for: overly permissive inbound rules, missing egress restrictions, public-facing resources that should be private, lack of network segmentation, unencrypted inter-service communication, and missing VPC flow logs.
Implement zero-trust networking
AdvancedImplement zero-trust security
Design a zero-trust network architecture for [application] currently using perimeter-based security. Define: identity-based access control using [service mesh / mTLS], micro-segmentation approach, device trust verification, just-in-time access for privileged operations, and migration path from current architecture.
Set up vulnerability scanning pipeline
IntermediateAutomate vulnerability scanning
Set up an end-to-end vulnerability scanning pipeline for container images and infrastructure. Include: Trivy for container scanning in CI, AWS Inspector for running workloads, tfsec/checkov for IaC, define severity thresholds that block deployment, aggregate findings in [platform], and set SLA for remediation by severity.
Design a backup and recovery strategy
IntermediatePlan backup and recovery
Design a backup and recovery strategy for [infrastructure components: databases, file storage, config] meeting RPO of [X hours] and RTO of [Y hours]. Define: backup frequency and retention per data type, cross-region replication for disaster recovery, backup encryption, access controls, regular restore testing schedule, and cost estimate.
Respond to a security incident
AdvancedRespond to security incidents
Write a security incident response playbook for [incident type: e.g., 'compromised AWS access keys']. Cover: immediate containment actions (step-by-step commands), forensic evidence collection, scope assessment, communication to stakeholders, remediation steps, regulatory notification requirements, and hardening actions to prevent recurrence.
Pro Tips
Include your current tool versions
DevOps tooling changes rapidly. Always specify exact versions — Kubernetes 1.29, Terraform 1.7, Helm 3.14 — to get accurate syntax and avoid deprecated APIs. A prompt without versions often returns outdated configuration that will fail silently.
Ask for rollback plans alongside deployment plans
Whenever you ask for a deployment or migration plan, add 'include a complete rollback procedure'. Production engineers know that rollback must be as well-planned as the deployment itself. This habit prevents many serious incidents.
Paste error messages verbatim
When debugging, paste the complete error message, stack trace, or kubectl output verbatim — never paraphrase it. AI assistants pattern-match on exact error text, and summarizing often loses the specific details needed for accurate diagnosis.
Request idempotent scripts
Always ask for 'idempotent scripts that are safe to run multiple times'. This single requirement forces the AI to add proper checks (if exists, skip), making automation scripts significantly safer for production use.
Ask for runbook documentation with every script
Add 'generate a runbook section explaining when to use this script, what it does, required permissions, expected output, and how to verify it succeeded'. This documentation is invaluable during 3am incidents when the original author is unavailable.