> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloudthinker.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Kai

> Kubernetes Engineer - Cluster management, container optimization, orchestration

Kai is CloudThinker's container orchestration expert, specializing in Kubernetes cluster management, workload optimization, autoscaling, and operational troubleshooting across EKS, GKE, AKS, and self-managed clusters.

***

## The Problem Kai Solves

Kubernetes is powerful but deeply complex. Most teams provision resource requests and limits once (or copy them from a template), then never revisit them. Pods get OOMKilled because limits are too low; nodes are underutilized because requests are too high. Cluster autoscaler adds nodes instead of right-sizing workloads. RBAC configurations drift from least-privilege as service accounts accumulate permissions.

Operating Kubernetes well requires daily attention from someone with deep expertise:

* Monitoring pod resource utilization across hundreds of pods across multiple namespaces
* Diagnosing crash loops by reading logs, events, and checking resource constraints
* Tuning HPA thresholds, VPA recommendations, and Cluster Autoscaler behavior
* Auditing RBAC configurations and network policies for security gaps

Most teams have one or two Kubernetes engineers — and they're already overloaded managing infrastructure changes. Proactive optimization rarely happens.

***

## How Existing Tools Compare

| Tool                               | What It Does                             | What's Missing                                                    |
| ---------------------------------- | ---------------------------------------- | ----------------------------------------------------------------- |
| **kubectl**                        | Direct cluster API access                | Raw tool, requires deep expertise, no analysis or recommendations |
| **Lens / k9s**                     | Kubernetes dashboards and CLI            | Visualization only, no AI analysis, no recommendations            |
| **Kubecost**                       | Kubernetes cost allocation and reporting | Cost visibility only, no troubleshooting or optimization guidance |
| **Datadog / Prometheus + Grafana** | Kubernetes metrics and alerting          | Monitoring only, still requires expert interpretation to act      |
| **KEDA / VPA**                     | Autoscaling automation                   | Single-purpose tools, no holistic cluster analysis                |

Kai combines what normally takes kubectl expertise, monitoring dashboards, cost tools, and security scanners — in a single conversational interface that explains issues and recommends specific fixes.

***

## How Kai Works

1. **Connects to Kubernetes API** — reads pods, nodes, deployments, services, events, and RBAC configurations across all namespaces
2. **Pulls metrics** — correlates Kubernetes API state with metrics-server data (CPU/memory actual vs. requested)
3. **Identifies inefficiency patterns** — OOMKill history, pending pods, underutilized nodes, misconfigured autoscaling policies
4. **Generates specific recommendations** — exact resource request/limit values based on actual P95 utilization, HPA threshold adjustments, RBAC policy changes
5. **Troubleshoots with context** — when a pod fails, Kai reads logs, events, and resource state simultaneously to identify root cause instead of having you correlate them manually

***

## Capabilities

| Domain                    | Capabilities                                                      |
| ------------------------- | ----------------------------------------------------------------- |
| **Cluster Management**    | Health monitoring, node management, resource allocation, upgrades |
| **Workload Optimization** | Pod right-sizing, resource requests/limits, scheduling efficiency |
| **Autoscaling**           | HPA/VPA/Cluster Autoscaler optimization, scaling policies         |
| **Security**              | RBAC auditing, network policies, pod security, secrets management |
| **Troubleshooting**       | Crash loops, OOMKills, scheduling failures, networking issues     |

***

## Supported Platforms

| Platform         | Support Level                        |
| ---------------- | ------------------------------------ |
| **Amazon EKS**   | Full support with AWS integration    |
| **Google GKE**   | Full support with GCP integration    |
| **Azure AKS**    | Full support with Azure integration  |
| **Self-Managed** | Kubernetes 1.24+ with metrics-server |

***

## Prompt Patterns

### Cluster Health

```bash theme={null}
# Health check
@kai check EKS cluster health and pod distribution

# Resource utilization
@kai analyze cluster resource utilization and identify bottlenecks

# Node analysis
@kai identify nodes with <30% CPU utilization for consolidation

# Multi-cluster view
@kai provide health summary across all Kubernetes clusters
```

### Workload Optimization

```bash theme={null}
# Pod right-sizing
@kai analyze pod resource requests/limits and recommend right-sizing

# Scheduling efficiency
@kai identify pods with resource requests far exceeding actual usage

# Cost optimization
@kai identify underutilized nodes and recommend consolidation strategy

# Namespace analysis
@kai analyze resource allocation across namespaces
```

### Autoscaling

```bash theme={null}
# HPA review
@kai review Horizontal Pod Autoscaler policies and recommend improvements

# Scaling analysis
@kai analyze scaling patterns and recommend threshold adjustments

# VPA assessment
@kai evaluate whether Vertical Pod Autoscaler would benefit our workloads

# Cluster autoscaling
@kai review Cluster Autoscaler configuration for cost efficiency
```

### Troubleshooting

```bash theme={null}
# Crash investigation
@kai investigate pod crash loops in payment namespace

# OOM analysis
@kai identify pods experiencing OOMKilled events and recommend fixes

# Scheduling issues
@kai analyze pending pods and identify scheduling constraints

# Network problems
@kai investigate network connectivity issues between services
```

### Security

```bash theme={null}
# RBAC audit
@kai audit RBAC configuration against least-privilege principles

# Network policies
@kai analyze network policies and recommend security improvements

# Pod security
@kai identify pods running with excessive privileges

# Secrets audit
@kai audit secrets management and recommend rotation strategy
```

***

## Tool Usage

| Tool         | Kai Use Case                                                   |
| ------------ | -------------------------------------------------------------- |
| `#dashboard` | Cluster health, node status, resource utilization, pod metrics |
| `#report`    | Optimization analysis, security audits, capacity planning      |
| `#recommend` | Right-sizing, scaling policies, consolidation actions          |
| `#alert`     | OOMKills, node pressure, pod failures, resource thresholds     |
| `#chart`     | Resource trends, scaling patterns, utilization over time       |

### Examples with Tools

```bash theme={null}
@kai #dashboard EKS cluster health with node and pod metrics
@kai #report cluster optimization opportunities with implementation plan
@kai #recommend HPA policies for variable workloads
@kai #alert on pod OOMKilled events or node pressure conditions
```

***

## Effective Prompts

<CardGroup cols={2}>
  <Card title="Include Cluster Context" icon="server">
    ```bash theme={null}
    # Good
    @kai analyze production EKS cluster
    in us-west-2 for pod resource
    optimization

    # Avoid
    @kai check our containers
    ```
  </Card>

  <Card title="Define Success Metrics" icon="chart-line">
    ```bash theme={null}
    # Good
    @kai improve cluster utilization
    while maintaining <30s pod startup
    and 99.9% availability

    # Avoid
    @kai make cluster better
    ```
  </Card>
</CardGroup>

***

## Connection Requirements

Kai requires Kubernetes cluster access with monitoring capabilities:

| Component          | Required Access                                   |
| ------------------ | ------------------------------------------------- |
| **Kubernetes API** | Read access to pods, nodes, deployments, services |
| **Metrics Server** | Resource metrics for pods and nodes               |
| **Events**         | Cluster events for troubleshooting                |
| **Logs**           | Container logs for debugging                      |

***

## Common Workflows

### Cluster Optimization

```bash theme={null}
# Step 1: Assess
@kai analyze cluster resource utilization

# Step 2: Identify waste
@kai identify pods with >50% overprovisioned resources

# Step 3: Plan
@kai #recommend right-sizing with zero-downtime approach

# Step 4: Monitor
@kai #dashboard track resource utilization after changes
```

### Incident Response

```bash theme={null}
# Step 1: Identify
@kai identify unhealthy pods and failing deployments

# Step 2: Investigate
@kai analyze logs and events for root cause

# Step 3: Remediate
@kai #recommend immediate actions to restore service

# Step 4: Prevent
@kai #recommend changes to prevent recurrence
```

### Capacity Planning

```bash theme={null}
# Step 1: Baseline
@kai analyze current resource consumption patterns

# Step 2: Project
@kai forecast resource needs for 2x growth

# Step 3: Plan
@kai #recommend node pool configuration for projected growth

# Step 4: Automate
@kai #recommend autoscaling policies for demand variations
```

***

## What's Next

<CardGroup cols={2}>
  <Card title="Kubernetes Connection" icon="https://mintcdn.com/cloudthinker/aLd-ttc-SCW-aFky/images/icons/kubernetes.svg?fit=max&auto=format&n=aLd-ttc-SCW-aFky&q=85&s=7c03292954ff635a1994623a5c39971b" href="/guide/connections/kubernetes" width="24" height="24" data-path="images/icons/kubernetes.svg">
    Connect Kai to your EKS, GKE, AKS, or self-managed clusters
  </Card>

  <Card title="Topology" icon="diagram-project" href="/guide/infrastructure/topology">
    Visualize Kubernetes service dependencies for [RCA](/guide/incident/root-cause-analysis)
  </Card>

  <Card title="Deep Response Engine" icon="triangle-exclamation" href="/guide/incident/overview">
    How Kai investigates Kubernetes incidents automatically
  </Card>

  <Card title="Anna" icon="users" href="/guide/agents/anna">
    Coordinate Kai with [Alex](/guide/agents/alex) for cluster cost + performance optimization
  </Card>
</CardGroup>