The SRE and DevOps discipline has always demanded breadth: you need to be part software engineer, part systems architect, part data analyst, and part communicator — often simultaneously, during high-stress outages when every second of downtime has a dollar value attached. In 2026, AI models have become the most powerful force multiplier available to infrastructure engineers, not because they replace expertise, but because they eliminate the blank-page problem and compress hours of templating work into minutes.
The engineers getting the most out of AI tools are not the ones asking generic questions. They are the ones who have learned to write precise, context-rich prompts that treat the AI as a knowledgeable colleague who happens to have never seen their specific stack. This guide gives you the exact prompt templates that work — ones that have been refined through real production use across teams running everything from bare-metal Kubernetes clusters to fully managed cloud-native architectures.
Each section covers a core SRE or DevOps workflow, explains why AI is particularly useful for it, provides a copy-paste prompt template with clearly marked placeholders, and includes a pro tip for squeezing even more value out of the model. Whether you are a senior SRE who wants to draft documents faster or a DevOps engineer exploring AI-assisted automation, these prompts will save you hours every week.
1. Incident Response & Triage
Incident response is one of the highest-leverage areas for AI assistance. During an active incident, cognitive load is at its peak — you are simultaneously diagnosing, communicating, coordinating, and documenting. AI can offload the communication and initial hypothesis-generation tasks, freeing your mental bandwidth for the actual debugging work.
The key is to give the model maximum context up front: your symptoms, the affected services, recent deployments, and any error messages you have already seen. The model will not have access to your observability stack, so you need to bring the data to it. Think of it like briefing a very knowledgeable on-call engineer who has just been paged in with no prior context.
You are a senior SRE helping triage an active production incident. Here is the current situation: Service affected: [SERVICE_NAME] Symptoms: [DESCRIBE_SYMPTOMS — e.g., "API p99 latency spiked from 120ms to 4.2s at 02:14 UTC"] Error logs (paste relevant lines): [ERROR_LOG_SNIPPET] Recent deployments in the last 6 hours: [DEPLOYMENT_LIST] Infrastructure: [CLOUD_PROVIDER + SERVICES — e.g., "AWS EKS, RDS Aurora, ElastiCache Redis"] Current traffic: [TRAFFIC_LEVEL — e.g., "~60% of normal, load shedding active"] Please: 1. List the 5 most likely root causes ranked by probability 2. For each cause, provide the exact commands or queries I should run to confirm or rule it out 3. Suggest the fastest mitigation path for the top two causes 4. Draft a 3-sentence status update I can post to our incident Slack channel right now
2. Runbook Generation
Runbooks are one of those things every team knows they need but rarely has enough time to write properly. The result is a graveyard of half-finished wiki pages that are out of date six months after they were written. AI changes this equation dramatically: you can generate a comprehensive runbook draft in under two minutes and then invest your time in reviewing and customizing it rather than writing from scratch.
The best runbooks generated by AI include not just the happy path, but also common failure modes and their mitigations. The prompt below is designed to produce operational runbooks that your most junior on-call engineer can follow at 2 AM without needing to escalate.
Generate a production-quality operational runbook for the following procedure: Procedure name: [PROCEDURE_NAME — e.g., "Database Failover for Aurora MySQL Cluster"] Service: [SERVICE_NAME] Infrastructure: [STACK_DETAILS] Frequency: [HOW_OFTEN_THIS_RUNS — e.g., "emergency only" or "weekly maintenance"] Audience: [SKILL_LEVEL — e.g., "on-call engineer, may be junior"] The runbook must include these sections: 1. Overview and purpose (2-3 sentences) 2. Prerequisites and permissions required 3. Pre-flight checks (what to verify before starting) 4. Step-by-step procedure with exact commands (use placeholders for environment-specific values in [BRACKETS]) 5. Verification steps (how to confirm success) 6. Rollback procedure 7. Common failure modes and troubleshooting 8. Escalation path if the procedure fails Format commands in code blocks. Add a warning box before any step that is destructive or irreversible.
3. Postmortems & Root Cause Analysis
Writing postmortems is emotionally taxing work. You are expected to be analytically rigorous, blameless, and communicative all at once, often within 24-48 hours of a stressful incident when the team is exhausted. AI is exceptionally well-suited to help with the structural and analytical parts of postmortem writing, letting humans focus on the nuanced organizational insights and action items.
AI-assisted postmortems consistently produce more thorough "contributing factors" sections because the model will systematically walk through categories (tooling, process, communication, monitoring) that engineers might unconsciously skip when they already have a narrative in mind. This produces more actionable and honest documents.
You are an experienced SRE writing a blameless postmortem. Here is the incident data: Incident title: [TITLE] Date & duration: [DATE], duration [DURATION] Severity: [SEV_LEVEL] User impact: [IMPACT — e.g., "~12,000 users unable to complete checkout for 47 minutes"] Systems affected: [SYSTEMS] Timeline of events (paste your raw notes or Slack thread): [TIMELINE_NOTES] Confirmed root cause: [ROOT_CAUSE or "unknown — suspected X"] Contributing factors identified so far: [ANY_FACTORS_ALREADY_KNOWN] Mitigations applied during incident: [WHAT_YOU_DID] Please write a complete blameless postmortem including: 1. Executive summary (3-4 sentences, non-technical) 2. Impact summary with metrics 3. Detailed timeline (clean up my notes into a clear chronological format) 4. Root cause analysis using the 5 Whys method 5. Contributing factors (probe for gaps in: monitoring, alerting, deployment process, documentation, runbooks, communication) 6. What went well 7. Action items table (columns: action, owner role, priority, due date placeholder) 8. Lessons learned Maintain a blameless, systems-thinking tone throughout.
4. Infrastructure as Code (IaC)
Writing infrastructure as code is where AI assistance delivers some of the most tangible time savings. Terraform, Pulumi, Bicep, and CloudFormation all have steep learning curves and verbose syntax — AI models have been trained on enormous amounts of IaC and can generate production-quality resource definitions faster than most engineers can type. The critical skill is telling the model exactly what security posture, tagging strategy, and naming conventions to apply.
Never deploy AI-generated IaC directly to production without running it through terraform plan, tflint, and checkov. Treat it as a senior engineer's first draft that still needs peer review, not as certified production code.
Generate a production-ready Terraform module for the following infrastructure: Cloud provider: [AWS / GCP / Azure] Resource(s): [RESOURCE_DESCRIPTION — e.g., "EKS cluster with managed node groups"] Environment: [prod / staging / dev] Region: [REGION] Requirements: - Terraform version: >= [VERSION, e.g., 1.7] - Provider version constraints: latest stable - Backend: [S3 + DynamoDB / GCS / Azure Blob] - Tagging strategy: include tags for environment, team, cost-center, managed-by=terraform - Security requirements: [e.g., "no public endpoints, encryption at rest required, IMDSv2 only"] - HA requirements: [e.g., "multi-AZ, min 3 nodes"] - Variable inputs: make all environment-specific values variables with descriptions and validation rules - Outputs: expose all values downstream modules would typically need Also provide: - A variables.tf with types, descriptions, defaults, and validation rules - An outputs.tf - A brief usage example in a README code block - A list of IAM permissions this module requires to run
5. CI/CD Pipeline Design
CI/CD pipelines are simultaneously critical infrastructure and chronic technical debt. They start simple and accrete complexity over years until nobody fully understands why a particular step exists or what happens if it fails. AI is excellent at both designing greenfield pipelines with best practices baked in and auditing existing pipelines to identify bottlenecks, missing security gates, and unnecessary complexity.
When designing a new pipeline, the most valuable thing you can give the AI is your deployment constraints: how often you deploy, what your rollback strategy is, whether you need canary or blue-green capabilities, and what compliance controls must be enforced. The more specific you are about these constraints, the more useful the output will be.
Design a production-grade CI/CD pipeline for the following context: Application type: [e.g., "Python FastAPI microservice"] Container registry: [ECR / GCR / Docker Hub / GHCR] Deployment target: [e.g., "Kubernetes on EKS, using Helm charts"] CI/CD platform: [GitHub Actions / GitLab CI / CircleCI / Jenkins] Deployment strategy: [blue-green / canary / rolling / feature flags] Team size: [NUMBER] engineers, [NUMBER] deploys per day on average Compliance requirements: [e.g., "SOC 2 — artifact signing required, audit log of all deploys"] Current pain points: [e.g., "builds take 18 minutes, no SAST scanning, manual approval for prod"] Please provide: 1. A full pipeline design with all stages (lint → test → build → scan → push → deploy → verify) 2. Parallelization strategy to minimize wall-clock build time 3. Required secret management approach 4. Rollback trigger conditions and automated rollback steps 5. The complete pipeline YAML for [CI/CD_PLATFORM] 6. A table showing which stage gates each type of issue (security, quality, performance) 7. Estimated time per stage and total pipeline duration
6. Monitoring & Alerting Setup
Alert fatigue is the silent killer of SRE teams. When every alert is not actionable or the threshold is so sensitive that it fires during normal traffic fluctuations, engineers start ignoring pages — and that is when actual incidents get missed. AI can help you design alerting systems with the right philosophy: every alert should be actionable, have a known response, and represent something a human needs to act on right now.
Use the prompt below to generate a comprehensive monitoring strategy for a service, including both the SLI/SLO definitions and the specific alert configurations. This is particularly useful when onboarding a new service that previously had no structured observability.
Design a comprehensive monitoring and alerting strategy for the following service: Service name: [SERVICE_NAME] Service type: [e.g., "user-facing HTTP API", "async message consumer", "batch data pipeline"] Technology stack: [LANGUAGES, FRAMEWORKS, DATABASES] Observability platform: [Datadog / Prometheus+Grafana / CloudWatch / New Relic / Honeycomb] Current SLA commitments: [e.g., "99.9% availability, p99 latency < 500ms"] Traffic patterns: [e.g., "peaks 10x on weekdays 9-11am EST, near-zero weekends"] Please provide: 1. Recommended SLIs (Service Level Indicators) with exact metric names and how to instrument them 2. SLO targets with error budget calculations for a 30-day rolling window 3. Alert definitions for each SLO with: - Alert name and severity (P1/P2/P3) - Exact threshold and evaluation window - Rationale for why this threshold (not lower, not higher) - Required runbook link placeholder - Suggested alert suppression rules to avoid noise 4. Dashboard layout recommendation (what panels, in what order) 5. Synthetic monitoring checks to add 6. Log-based alerts for error conditions not captured by metrics 7. Sample Prometheus recording rules (or equivalent) for the most expensive queries
7. Kubernetes Troubleshooting
Kubernetes troubleshooting has a high barrier to entry because the symptom (a pod not running) can have dozens of root causes distributed across networking, RBAC, resource quotas, image pull policies, readiness probes, and more. AI models have absorbed enormous amounts of Kubernetes documentation and Stack Overflow history and can walk through the diagnostic tree with you systematically.
The trick is to paste the actual output of kubectl describe, kubectl logs, and kubectl get events directly into the prompt. The more raw output you provide, the more specific and accurate the diagnosis will be. Do not summarize — paste the actual text.
I have a Kubernetes issue and need help diagnosing it. Here is all the relevant output: Cluster version: [K8S_VERSION] Cloud provider / managed service: [EKS / GKE / AKS / self-managed] Problem description: [WHAT_IS_HAPPENING — e.g., "Deployment stuck at 0/3 ready, pods crash-looping"] kubectl describe pod [POD_NAME] output: [PASTE_FULL_DESCRIBE_OUTPUT] kubectl logs [POD_NAME] --previous output: [PASTE_LOG_OUTPUT] kubectl get events --namespace [NAMESPACE] --sort-by='.lastTimestamp' output: [PASTE_EVENTS_OUTPUT] Recent changes made (deploys, config changes, cluster upgrades): [RECENT_CHANGES] Please: 1. Identify the root cause based on the above output 2. Explain why this is happening in plain language 3. Provide the exact kubectl commands to fix it 4. Explain how to verify the fix worked 5. Suggest what to add to prevent this from happening again (admission controllers, LimitRanges, resource quotas, etc.)
kubectl get networkpolicies -A and your CNI plugin details. Network policy misconfiguration is one of the most common and hardest-to-diagnose classes of Kubernetes issues, and the model will catch it immediately if given this context.8. Capacity Planning
Capacity planning sits at the uncomfortable intersection of historical data analysis, trend forecasting, and business-input uncertainty. It is the kind of work that often does not get done properly because it feels less urgent than the incident queue, even though poor capacity planning is often the upstream cause of the incidents themselves. AI can dramatically accelerate the analytical and documentation phases of capacity planning.
The prompt below is designed to produce a capacity planning analysis that you can take into a quarterly planning meeting with both engineering leadership and finance. It forces the model to produce recommendations that are tied to specific growth assumptions, making the analysis more honest about its underlying assumptions.
Help me create a capacity planning analysis for the following system: Service: [SERVICE_NAME] Current resource utilization (provide averages and peaks): - CPU: [CURRENT_USAGE, e.g., "avg 45%, peak 78% during business hours"] - Memory: [CURRENT_USAGE] - Storage: [CURRENT_USAGE + growth rate, e.g., "2.4TB, growing ~80GB/month"] - Network: [CURRENT_USAGE] - Database: [CONNECTIONS, IOPS, STORAGE] Current infrastructure: [INSTANCE_TYPES, COUNT, REGION] Current monthly cost: $[COST] Business growth assumptions: - Expected traffic growth: [e.g., "30% YoY based on sales forecast"] - Known capacity events: [e.g., "Product launch in Q3, expected 5x spike for 2 weeks"] - Data retention requirements changing: [YES/NO + details] Planning horizon: [6 months / 12 months / 18 months] Please produce: 1. Projected resource requirements at 6, 12, and 18 months under base, optimistic (+50%), and pessimistic (-20%) growth scenarios 2. Identification of the first resource that will become a bottleneck and at what growth level 3. Scaling strategy recommendations (vertical vs. horizontal, auto-scaling configuration) 4. Infrastructure cost projection for each scenario in a comparison table 5. Specific recommended actions with timeline and priority 6. Risk register: what happens if we do nothing for 6 months
9. On-Call & Escalation Policies
On-call policies are surprisingly hard to get right. Too aggressive and you burn out your engineers; too lenient and you miss critical incidents. The best on-call policies are explicit about what constitutes each severity level, who gets paged under which conditions, and what the expected response time and first actions are for each scenario. AI can help you draft and critique these policies with a systematic coverage of edge cases you might not have considered.
This prompt is also useful for auditing your existing on-call policies. Engineering managers often inherit on-call rotations and procedures that made sense years ago but have grown stale as the system architecture and team structure evolved.
Help me design a comprehensive on-call and escalation policy for my engineering team: Team structure: [e.g., "12 engineers across 3 time zones: US/Eastern, UK, India"] Services owned: [LIST_SERVICES with brief descriptions] Current on-call pain points: [e.g., "too many P2 pages overnight, unclear escalation path, no deputy backup"] Incident tooling: [PagerDuty / OpsGenie / VictorOps / other] Business hours: [HOURS + TIMEZONE] SLA commitments: [e.g., "P1 response < 15 min, P2 < 1 hour"] Please create: 1. Severity level definitions (P1-P4) with specific, concrete examples for our services 2. On-call rotation structure recommendations (primary, secondary, escalation tiers) 3. Escalation matrix: who gets paged, in what order, with what delay, for each severity 4. First-response checklist for each severity level (what to do in the first 5 minutes) 5. Handoff procedure template for shift changes during active incidents 6. Burnout prevention policies (compensation, override limits, post-incident rest time) 7. Monthly on-call health metrics to track (and thresholds that should trigger a policy review) 8. PagerDuty / OpsGenie configuration checklist to implement this policy
10. Disaster Recovery Planning
Disaster recovery documentation is the infrastructure equivalent of insurance: you invest in it hoping you never need it, but when you do need it, the quality of that documentation is the difference between a recoverable incident and an existential one. The challenge is that DR planning requires thinking through failure scenarios that are genuinely hard to imagine when systems are running smoothly.
AI excels at systematically enumerating failure scenarios that human planners tend to overlook — not because human planners are incompetent, but because the human mind naturally anchors on the most likely or most recent failure mode. The prompt below forces a comprehensive coverage of failure categories and produces a DR plan that will survive auditor scrutiny.
You are a senior SRE helping design a disaster recovery plan. Here is our environment: System overview: [DESCRIPTION — e.g., "Multi-tier e-commerce platform on AWS: frontend (CloudFront + S3), APIs (EKS in us-east-1), databases (Aurora MySQL + ElastiCache)"] Current RTO target: [e.g., "4 hours for full restoration"] Current RPO target: [e.g., "15 minutes maximum data loss"] Current backup strategy: [WHAT_EXISTS_TODAY] Primary region: [REGION] DR region (if any): [REGION or "none currently"] Compliance requirements: [e.g., "SOC 2 Type II, must demonstrate annual DR test"] Last DR test: [DATE or "never"] Please produce a complete disaster recovery plan including: 1. Failure scenario catalog: enumerate all scenarios from "single AZ failure" up to "full primary region loss" and "ransomware/data corruption" 2. For each scenario: probability (high/medium/low), blast radius, and required recovery approach 3. Recovery procedures for each top-5 most likely/impactful scenarios with step-by-step instructions 4. Data backup verification procedures (not just "backups exist" but "backups are restorable") 5. Failover runbook for primary → DR region 6. Communication plan during a DR event (internal, customer, regulatory) 7. DR test plan: what to test, how often, success criteria, and how to run tests without impacting production 8. Gap analysis: what is missing today to achieve our stated RTO/RPO targets 9. Investment roadmap to close the gaps, with rough cost estimates
Find SRE & DevOps Roles That Value AI Skills
Godle surfaces engineering positions where AI-forward infrastructure skills are explicitly valued — from principal SRE roles at hyperscalers to staff DevOps engineers at high-growth startups.
Frequently Asked Questions
Can AI really help with incident response?
Are AI-generated Terraform or Kubernetes configs safe to use in production?
tflint, checkov, kube-score), test in a non-production environment, and apply your organization's security policies before promoting to production. AI models can and do produce configurations with subtle security misconfigurations — overly permissive IAM policies, missing encryption settings, or incorrect security group rules. The productivity gain is real, but the review step is non-negotiable. Many teams have adopted a policy of running checkov automatically on all AI-generated IaC before it is even reviewed by a human engineer.
Which AI tools are most useful for SRE and DevOps work in 2026?
How do I find SRE and DevOps jobs that value AI skills?
What is prompt engineering for DevOps and how do I get better at it?
Ready to Work Smarter as an SRE or DevOps Engineer?
Godle helps infrastructure engineers find roles where their skills — including AI-assisted operations — are recognized and rewarded. Build your profile in minutes and get matched with teams building at scale.