Kai — Kubernetes Engineer

Kai is CloudThinker’s container orchestration expert, specializing in Kubernetes cluster management, workload optimization, autoscaling, and operational troubleshooting across EKS, GKE, AKS, and self-managed clusters.

Capabilities

Domain	Capabilities
Cluster Management	Health monitoring, node management, resource allocation, upgrades
Workload Optimization	Pod right-sizing, resource requests/limits, scheduling efficiency
Autoscaling	HPA/VPA/Cluster Autoscaler optimization, scaling policies
Security	RBAC auditing, network policies, pod security, secrets management
Troubleshooting	Crash loops, OOMKills, scheduling failures, networking issues

Supported Platforms

Platform	Support Level
Amazon EKS	Full support with AWS integration
Google GKE	Full support with GCP integration
Azure AKS	Full support with Azure integration
Self-Managed	Kubernetes 1.24+ with metrics-server

Prompt Patterns

Cluster Health

# Health check
@kai check EKS cluster health and pod distribution

# Resource utilization
@kai analyze cluster resource utilization and identify bottlenecks

# Node analysis
@kai identify nodes with <30% CPU utilization for consolidation

# Multi-cluster view
@kai provide health summary across all Kubernetes clusters

Workload Optimization

# Pod right-sizing
@kai analyze pod resource requests/limits and recommend right-sizing

# Scheduling efficiency
@kai identify pods with resource requests far exceeding actual usage

# Cost optimization
@kai identify underutilized nodes and recommend consolidation strategy

# Namespace analysis
@kai analyze resource allocation across namespaces

Autoscaling

# HPA review
@kai review Horizontal Pod Autoscaler policies and recommend improvements

# Scaling analysis
@kai analyze scaling patterns and recommend threshold adjustments

# VPA assessment
@kai evaluate whether Vertical Pod Autoscaler would benefit our workloads

# Cluster autoscaling
@kai review Cluster Autoscaler configuration for cost efficiency

Troubleshooting

# Crash investigation
@kai investigate pod crash loops in payment namespace

# OOM analysis
@kai identify pods experiencing OOMKilled events and recommend fixes

# Scheduling issues
@kai analyze pending pods and identify scheduling constraints

# Network problems
@kai investigate network connectivity issues between services

Security

# RBAC audit
@kai audit RBAC configuration against least-privilege principles

# Network policies
@kai analyze network policies and recommend security improvements

# Pod security
@kai identify pods running with excessive privileges

# Secrets audit
@kai audit secrets management and recommend rotation strategy

Tool Usage

Tool	Kai Use Case
`#dashboard`	Cluster health, node status, resource utilization, pod metrics
`#report`	Optimization analysis, security audits, capacity planning
`#recommend`	Right-sizing, scaling policies, consolidation actions
`#alert`	OOMKills, node pressure, pod failures, resource thresholds
`#chart`	Resource trends, scaling patterns, utilization over time

Examples with Tools

@kai #dashboard EKS cluster health with node and pod metrics
@kai #report cluster optimization opportunities with implementation plan
@kai #recommend HPA policies for variable workloads
@kai #alert on pod OOMKilled events or node pressure conditions

Effective Prompts

Include Cluster Context

# Good
@kai analyze production EKS cluster
in us-west-2 for pod resource
optimization

# Avoid
@kai check our containers

Define Success Metrics

# Good
@kai improve cluster utilization
while maintaining <30s pod startup
and 99.9% availability

# Avoid
@kai make cluster better

Connection Requirements

Kai requires Kubernetes cluster access with monitoring capabilities:

Component	Required Access
Kubernetes API	Read access to pods, nodes, deployments, services
Metrics Server	Resource metrics for pods and nodes
Events	Cluster events for troubleshooting
Logs	Container logs for debugging

Common Workflows

Cluster Optimization

# Step 1: Assess
@kai analyze cluster resource utilization

# Step 2: Identify waste
@kai identify pods with >50% overprovisioned resources

# Step 3: Plan
@kai #recommend right-sizing with zero-downtime approach

# Step 4: Monitor
@kai #dashboard track resource utilization after changes

Incident Response

# Step 1: Identify
@kai identify unhealthy pods and failing deployments

# Step 2: Investigate
@kai analyze logs and events for root cause

# Step 3: Remediate
@kai #recommend immediate actions to restore service

# Step 4: Prevent
@kai #recommend changes to prevent recurrence

Capacity Planning

# Step 1: Baseline
@kai analyze current resource consumption patterns

# Step 2: Project
@kai forecast resource needs for 2x growth

# Step 3: Plan
@kai #recommend node pool configuration for projected growth

# Step 4: Automate
@kai #recommend autoscaling policies for demand variations

Agents

Overview of all agents and collaboration patterns

CloudThinker Language

Complete syntax reference

Start Here

Code Review

Infrastructure

Incident

Setup

Use Cases

Reference

Kai

Kai — Kubernetes Engineer

Capabilities

Supported Platforms

Prompt Patterns

Cluster Health

Workload Optimization

Autoscaling

Troubleshooting

Security

Tool Usage

Examples with Tools

Effective Prompts

Include Cluster Context

Define Success Metrics

Connection Requirements

Common Workflows

Cluster Optimization

Incident Response

Capacity Planning

Agents

CloudThinker Language

Start Here

Code Review

Infrastructure

Incident

Setup

Use Cases

Reference

​Kai — Kubernetes Engineer

​Capabilities

​Supported Platforms

​Prompt Patterns

​Cluster Health

​Workload Optimization

​Autoscaling

​Troubleshooting

​Security

​Tool Usage

​Examples with Tools

​Effective Prompts

Include Cluster Context

Define Success Metrics

​Connection Requirements

​Common Workflows

​Cluster Optimization

​Incident Response

​Capacity Planning

​Related

Agents

CloudThinker Language

Kai — Kubernetes Engineer

Capabilities

Supported Platforms

Prompt Patterns

Cluster Health

Workload Optimization

Autoscaling

Troubleshooting

Security

Tool Usage

Examples with Tools

Effective Prompts

Connection Requirements

Common Workflows

Cluster Optimization

Incident Response

Capacity Planning

Related