100 AI Prompts for DevOps & SRE — Complete Guide

DevOps and Site Reliability Engineering require mastering an enormous toolchain — from CI/CD pipelines to Kubernetes manifests to incident runbooks. AI assistants can help you write and debug infrastructure code, design resilient systems, and communicate clearly during incidents. These 100 prompts cover the full DevOps lifecycle.

Filter by level:

50 prompts

CI/CD & Automation

Prompts to build and improve continuous integration and deployment pipelines.

Write a GitHub Actions workflow

Beginner

Create CI/CD pipelines

Write a GitHub Actions workflow for a [language/framework] application that runs on every pull request and push to main. Include: dependency caching, linting, unit tests, build, Docker image build and push to [registry], and deployment to [environment]. Use environment secrets for credentials and set up branch protection rules.

Build a multi-stage deployment pipeline

Intermediate

Design production deployment pipelines

Design a multi-stage deployment pipeline for [application] going through dev → staging → production. Each stage should run: automated tests, security scanning (SAST, dependency audit), smoke tests after deployment, and require manual approval for production. Include rollback mechanism and notification to Slack.

Set up GitOps with ArgoCD

Intermediate

Implement GitOps workflows

Set up a GitOps workflow using ArgoCD to deploy a [application] to Kubernetes. Define the ArgoCD Application manifest, configure automated sync with self-heal, set up app-of-apps pattern for multiple environments, implement image updater for automatic image tag updates, and configure RBAC for team access.

Write a deployment script

Beginner

Automate deployment scripts

Write a bash deployment script for [application] that performs: pre-deployment health check, blue-green deployment with nginx upstream switch, post-deployment smoke test (curl endpoints), automatic rollback if smoke test fails, and sends a deployment notification to [Slack/PagerDuty]. Include set -euo pipefail.

Implement canary deployments

Advanced

Roll out changes safely with canary

Design and implement a canary deployment strategy for [service] on Kubernetes using [Argo Rollouts/Flagger/manual approach]. Define: traffic split percentages and progression schedule, metrics to analyze (error rate, latency p99), automatic rollback conditions, and how to manually promote or abort a canary.

Optimize CI pipeline speed

Intermediate

Speed up slow CI pipelines

My CI pipeline for [application] takes [X minutes]. Here is the current workflow config: [paste config]. Identify the top bottlenecks and provide specific optimizations: dependency caching strategies, test parallelization, Docker layer caching, conditional step execution, and build artifact reuse. Target: under [Y minutes].

Set up semantic versioning automation

Intermediate

Automate semantic versioning

Implement automated semantic versioning for a [language] project using conventional commits (feat/fix/breaking change). Set up: commit message linting with commitlint, automated CHANGELOG generation with semantic-release, automatic version bump in package.json/pyproject.toml, git tag creation, and GitHub Release publishing.

Write a Makefile for a project

Beginner

Create project Makefiles

Write a Makefile for a [language/framework] project with targets for: installing dependencies, running tests, building Docker image, running locally with docker-compose, linting, formatting, database migrations, and generating documentation. Include a help target that lists all targets with descriptions.

Implement pipeline as code testing

Advanced

Test CI/CD pipelines

Write tests for a [Jenkins/GitHub Actions/GitLab CI] pipeline using [pipeline testing framework]. Test that: the correct stages run for different branch types, environment variables are properly set, secrets are not logged, build artifacts are produced, and deployment only occurs when tests pass.

Build a release management process

Intermediate

Formalize release processes

Design a release management process for a team of [N] engineers releasing [service] [frequency]. Include: release candidate creation process, version freeze policy, QA handoff checklist, release notes template, production deployment checklist, post-deployment monitoring window, and hotfix process.

Infrastructure as Code

Prompts to write and improve Terraform, Ansible, and cloud infrastructure code.

Write Terraform for a web application

Intermediate

Provision cloud infrastructure

Write Terraform code to provision a production-grade web application infrastructure on [AWS/GCP/Azure]. Include: [VPC with public/private subnets, load balancer, auto-scaling group / ECS / Kubernetes cluster], RDS database with read replica, S3 for static assets, CloudFront CDN, and IAM roles with least-privilege policies.

Create reusable Terraform modules

Intermediate

Build reusable IaC modules

Create a reusable Terraform module for [resource type: e.g., 'ECS service with ALB']. The module should accept variables for [list configurable parameters], expose outputs for [list outputs], include input validation, follow naming conventions with [prefix] variable, and include README documentation with usage example.

Write an Ansible playbook

Beginner

Automate server configuration

Write an Ansible playbook to configure [server type] servers running [OS]. The playbook should: install and configure [list software], set up systemd services, configure [firewall rules], deploy application from [source], and run a post-configuration smoke test. Use roles for organization and support both Ubuntu and RHEL.

Audit Terraform for security issues

Intermediate

Security audit IaC code

Review the following Terraform code for security issues: [paste Terraform]. Check for: overly permissive IAM policies, public S3 buckets, security groups open to 0.0.0.0/0 on sensitive ports, unencrypted storage, missing VPC flow logs, and hardcoded credentials. Provide fixes for each finding.

Design Terraform state management

Advanced

Manage Terraform state at scale

Design a Terraform state management strategy for a team of [N] engineers managing [number] environments across [number] AWS accounts. Include: remote state backend configuration (S3 + DynamoDB locking), state file organization (per-environment, per-service), workspace vs separate backends tradeoffs, and state migration procedure.

Write a Helm chart

Intermediate

Package apps as Helm charts

Write a Helm chart for [application name]. Include: Deployment with configurable replicas and resource limits, Service, Ingress with TLS, ConfigMap and Secret management, HorizontalPodAutoscaler, PodDisruptionBudget, and sensible default values.yaml. The chart should support multiple environments via values overrides.

Implement cost optimization with Terraform

Advanced

Reduce cloud infrastructure costs

Review my Terraform infrastructure for [AWS/GCP] and suggest cost optimization opportunities. Current monthly cost is [$X]. Identify: oversized instances, idle resources, savings plan/reserved instance opportunities, storage optimization, data transfer cost reductions, and unused resources. Provide Terraform changes for each optimization.

Set up Terraform CI/CD

Intermediate

Automate Terraform deployments

Set up a CI/CD pipeline for Terraform code in [GitHub/GitLab]. Include: terraform fmt and validate checks, tfsec security scanning, cost estimation with Infracost, plan output as PR comment, automated apply on merge to main (with manual approval for production), and state lock handling.

Write a CloudFormation template

Intermediate

Write CloudFormation templates

Write a CloudFormation template to deploy [infrastructure component]. Use nested stacks for modularity, implement cross-stack references with exports, add resource tagging with mandatory tags (Environment, Team, CostCenter), include stack policies to prevent accidental deletion, and document each resource with Metadata.

Implement drift detection

Advanced

Detect infrastructure drift

Design an infrastructure drift detection system for Terraform-managed resources. Set up scheduled terraform plan runs, parse plan output to detect unapproved changes, alert via [PagerDuty/Slack] on drift detected, generate a drift report, and define the remediation workflow (manual review vs automatic revert).

Kubernetes & Containers

Prompts for managing containers, Kubernetes clusters, and workloads.

Write a Kubernetes deployment manifest

Beginner

Deploy apps to Kubernetes

Write a production-ready Kubernetes Deployment manifest for [application]. Include: proper resource requests and limits (CPU: [X]m/[Y]m, Memory: [X]Mi/[Y]Mi), liveness and readiness probes, anti-affinity rules to spread across nodes, security context (non-root, read-only filesystem), and environment variables from ConfigMap and Secret.

Optimize a Dockerfile

Beginner

Optimize Docker images

Optimize the following Dockerfile for [application]: [paste Dockerfile]. Minimize image size using multi-stage builds, maximize layer cache efficiency, run as non-root user, remove unnecessary packages, use specific base image versions, and add HEALTHCHECK. Explain each optimization and the size reduction achieved.

Debug a CrashLoopBackOff

Beginner

Debug Kubernetes pod crashes

My Kubernetes pod is in CrashLoopBackOff state. Here is the output of kubectl describe pod [pod-name]: [paste output]. Here are the recent logs: [paste logs]. Diagnose the root cause, explain what each relevant line in the output means, and provide step-by-step remediation instructions.

Design Kubernetes resource limits strategy

Intermediate

Right-size Kubernetes resources

Design a resource requests and limits strategy for a Kubernetes cluster running [describe workloads]. Define QoS class targets per workload type (Guaranteed/Burstable/BestEffort), LimitRange defaults per namespace, ResourceQuota per team, and VPA vs HPA recommendation per workload type. Provide the manifests.

Implement Kubernetes network policies

Intermediate

Secure Kubernetes networking

Write Kubernetes NetworkPolicy manifests to implement zero-trust networking for a [microservices application]. Allow only necessary communication: [describe service communication requirements]. Add a default deny-all policy per namespace, allow DNS resolution, and permit monitoring scraping from the prometheus namespace.

Set up Kubernetes monitoring

Intermediate

Monitor Kubernetes clusters

Set up a Kubernetes monitoring stack using kube-prometheus-stack (Prometheus + Grafana + Alertmanager). Define: custom recording rules for [key metrics], alerting rules with appropriate severity levels and runbook URLs, Grafana dashboard for cluster overview and per-namespace resource usage, and PagerDuty integration for critical alerts.

Design a multi-cluster strategy

Advanced

Design multi-cluster architectures

Design a multi-cluster Kubernetes strategy for [company] needing [describe requirements: HA, multi-region, environments]. Compare: single cluster with namespaces vs multiple clusters. Define cluster topology, workload placement policy, cross-cluster service discovery, unified observability, and GitOps management with [ArgoCD/Flux].

Implement pod security

Advanced

Enforce Kubernetes pod security

Implement Pod Security Standards for a Kubernetes cluster. Configure Pod Security Admission controller with restricted policy for production namespaces, baseline for staging, set up exceptions for privileged system workloads, and create OPA Gatekeeper policies for custom constraints like required labels and image registry allowlist.

Write a Kubernetes operator

Advanced

Build Kubernetes operators

Outline and write the skeleton of a Kubernetes operator in Go using kubebuilder for managing [custom resource: e.g., 'DatabaseCluster']. Define the CRD schema, reconcile loop logic, status conditions, finalizer for cleanup, events for state changes, and metrics exposed for the operator itself.

Perform a Kubernetes security audit

Intermediate

Audit Kubernetes security

Perform a security audit of the following Kubernetes cluster configuration: [paste relevant manifests or describe configuration]. Check for: privileged containers, host network/PID/IPC usage, missing security contexts, overly permissive RBAC roles, exposed Kubernetes API, etcd encryption, and audit log configuration.

Observability & Incident Response

Prompts for logging, monitoring, alerting, and handling incidents effectively.

Design a logging strategy

Beginner

Design centralized logging

Design a centralized logging strategy for [application] deployed on [infrastructure]. Define: log levels and when to use each, structured log format (JSON fields to include), log aggregation stack ([ELK/Loki/Datadog]), retention policy, log-based alerting rules, and how to correlate logs across microservices using trace IDs.

Write SLOs and SLIs

Intermediate

Define service reliability targets

Define SLOs and SLIs for [service name] based on its user-facing functions: [describe functions]. For each SLO, specify: the SLI metric and measurement method, target percentage over [28-day rolling window], error budget in minutes/requests, alerting policy (burn rate alerts), and what actions to take when the error budget is exhausted.

Write an incident runbook

Intermediate

Create incident runbooks

Write an incident response runbook for [service name] for the alert: '[alert name]'. Include: alert description and impact, immediate triage steps (commands to run), diagnostic decision tree, remediation procedures for each likely root cause, escalation path, customer communication template, and post-incident checklist.

Debug a production issue

Beginner

Debug production incidents

Help me debug a production issue: [describe symptoms, error messages, when it started]. Here is the relevant metrics/logs: [paste data]. Walk me through a structured debugging process: hypothesis formation, verification commands to run, and how to isolate the root cause without making the incident worse.

Write a post-mortem

Intermediate

Write incident post-mortems

Write a blameless post-mortem for the following incident: [describe what happened, impact, duration, timeline]. Include: incident summary, timeline, root cause analysis (5 Whys), contributing factors, impact assessment, what went well, what went wrong, and action items with owners and due dates. Follow Google SRE post-mortem format.

Set up distributed tracing

Intermediate

Implement distributed tracing

Set up distributed tracing for a microservices application using [Jaeger/Zipkin/Tempo + OpenTelemetry]. Instrument [Node.js/Python/Go] services with OpenTelemetry SDK, propagate trace context across HTTP and message queue boundaries, configure sampling strategy for production, and set up trace-based alerting for latency anomalies.

Design alerting rules

Intermediate

Configure meaningful alerts

Design alerting rules for a [service type] using Prometheus/Alertmanager. Define rules for: error rate (>1% for 5 min = warning, >5% = critical), latency p99 (>500ms = warning, >2s = critical), resource saturation, and business metrics. Include alert labels for routing, severity, and runbook URL. Avoid alert fatigue.

Build a chaos engineering experiment

Advanced

Test system resilience

Design a chaos engineering experiment for [service/system] to test [specific resilience hypothesis: e.g., 'the system handles database failover within 30 seconds']. Define: steady state hypothesis and metrics, chaos injection method using [Chaos Monkey/Litmus/toxiproxy], blast radius containment, success criteria, and rollback procedure.

Implement on-call rotation

Intermediate

Design sustainable on-call processes

Design an on-call rotation policy for a team of [N] engineers supporting [service] with SLA of [response time]. Define: rotation schedule, escalation chain, on-call responsibilities, alert acknowledgment SLA, handoff process, compensation policy, and how to reduce on-call burden (reduce alert noise, improve runbooks, automate toil).

Create a capacity planning report

Advanced

Plan infrastructure capacity

Create a capacity planning analysis for [service] based on the following current metrics: [paste CPU, memory, storage, network utilization]. Project resource needs for [3/6/12] months assuming [X%] monthly growth. Identify when each resource will hit critical thresholds, and recommend scaling actions with cost estimates.

Security & Compliance

Prompts to harden infrastructure and meet compliance requirements.

Harden a Linux server

Intermediate

Harden Linux servers

Write an Ansible playbook to harden an Ubuntu 22.04 server following CIS Benchmark Level 1. Include: SSH hardening (key-only auth, disable root), firewall configuration (ufw), fail2ban setup, automatic security updates, kernel parameter hardening via sysctl, filesystem permissions audit, and auditd configuration.

Design IAM least-privilege policy

Beginner

Apply least-privilege IAM

Design an IAM policy for [AWS/GCP/Azure] for a service that needs to: [list what the service needs to do]. Write the policy with minimum required permissions, use resource-level restrictions where possible, add conditions (MFA required, IP restrictions, time of day), and explain each permission granted and why it is needed.

Set up AWS Security Hub

Intermediate

Centralize security findings

Set up AWS Security Hub with [CIS AWS Foundations / PCI DSS / NIST 800-53] standard enabled across [N] accounts using AWS Organizations. Configure: automatic finding aggregation to security account, EventBridge rules to create Jira tickets for HIGH/CRITICAL findings, suppression rules for accepted risks, and weekly compliance report.

Implement secret scanning

Beginner

Prevent credential leaks

Set up secret scanning in a [GitHub/GitLab] repository to prevent credentials from being committed. Configure: [git-secrets/gitleaks/detect-secrets] as pre-commit hook, CI pipeline secret scanning step, GitHub secret scanning alerts, baseline for existing false positives, and a response process for when a real secret is detected.

Design a compliance-as-code framework

Advanced

Automate compliance checks

Design a compliance-as-code framework for [SOC 2 / ISO 27001 / HIPAA] using [Open Policy Agent / AWS Config / Chef InSpec]. Map [5 key controls] to automated policy checks, define evidence collection automation, integrate checks into CI/CD pipeline, and generate audit-ready compliance reports automatically.

Write a network security audit

Intermediate

Audit network security

Audit the following network architecture for security issues: [describe VPC/network topology with subnets, security groups, NACLs]. Check for: overly permissive inbound rules, missing egress restrictions, public-facing resources that should be private, lack of network segmentation, unencrypted inter-service communication, and missing VPC flow logs.

Implement zero-trust networking

Advanced

Implement zero-trust security

Design a zero-trust network architecture for [application] currently using perimeter-based security. Define: identity-based access control using [service mesh / mTLS], micro-segmentation approach, device trust verification, just-in-time access for privileged operations, and migration path from current architecture.

Set up vulnerability scanning pipeline

Intermediate

Automate vulnerability scanning

Set up an end-to-end vulnerability scanning pipeline for container images and infrastructure. Include: Trivy for container scanning in CI, AWS Inspector for running workloads, tfsec/checkov for IaC, define severity thresholds that block deployment, aggregate findings in [platform], and set SLA for remediation by severity.

Design a backup and recovery strategy

Intermediate

Plan backup and recovery

Design a backup and recovery strategy for [infrastructure components: databases, file storage, config] meeting RPO of [X hours] and RTO of [Y hours]. Define: backup frequency and retention per data type, cross-region replication for disaster recovery, backup encryption, access controls, regular restore testing schedule, and cost estimate.

Respond to a security incident

Advanced

Respond to security incidents

Write a security incident response playbook for [incident type: e.g., 'compromised AWS access keys']. Cover: immediate containment actions (step-by-step commands), forensic evidence collection, scope assessment, communication to stakeholders, remediation steps, regulatory notification requirements, and hardening actions to prevent recurrence.

Pro Tips

Include your current tool versions

DevOps tooling changes rapidly. Always specify exact versions — Kubernetes 1.29, Terraform 1.7, Helm 3.14 — to get accurate syntax and avoid deprecated APIs. A prompt without versions often returns outdated configuration that will fail silently.

Ask for rollback plans alongside deployment plans

Whenever you ask for a deployment or migration plan, add 'include a complete rollback procedure'. Production engineers know that rollback must be as well-planned as the deployment itself. This habit prevents many serious incidents.

Paste error messages verbatim

When debugging, paste the complete error message, stack trace, or kubectl output verbatim — never paraphrase it. AI assistants pattern-match on exact error text, and summarizing often loses the specific details needed for accurate diagnosis.

Request idempotent scripts

Always ask for 'idempotent scripts that are safe to run multiple times'. This single requirement forces the AI to add proper checks (if exists, skip), making automation scripts significantly safer for production use.

Ask for runbook documentation with every script

Add 'generate a runbook section explaining when to use this script, what it does, required permissions, expected output, and how to verify it succeeded'. This documentation is invaluable during 3am incidents when the original author is unavailable.