Karta

A standard way to describe the structure of any Kubernetes workload type.

Karta lets you define a portable, declarative blueprint for any Kubernetes workload — whether it's a simple Deployment, a distributed PyTorchJob, or a custom CRD. Controllers and platforms can then use that blueprint to inspect, modify, and manage workloads without hard-coding knowledge of each type.

The Problem

In Kubernetes, and especially in AI systems, a workload is not a standalone execution unit such as a single Pod. Instead, it is composed of multiple components organized in a complex hierarchy of resources, often exposed via custom resource definitions (CRDs) — for example: PyTorchJob, RayCluster, and MPIJob. Each of these CRDs structures the workload configuration differently, but they all share the same conceptual building blocks: pod specifications, scaling parameters, and status definitions.

If you're building a controller, scheduler, or platform that needs to work with multiple workload types, you end up writing bespoke logic for each one:

Where is the pod template?
How do I find the replica count?
Which status conditions mean "running" vs "failed"?
How do I modify the pod spec without breaking the workload?

This doesn't scale. Every new workload type means new integration code — schedulers, controllers, and platforms all end up maintaining per-CRD adapters that implement the same patterns over and over.

The Solution

Karta (a map to navigate resources) introduces a CRD that maps the structure of any workload type into a standard schema. Using JQ-based path expressions, a Karta declaratively defines how to locate pod specifications, scaling parameters, and status fields within any workload hierarchy. Define it once, and any controller can use it to:

Extract pod templates, replica counts, status, and metadata
Update pod specs, labels, and annotations across all instances
Understand workload hierarchy (e.g., a JobSet with master + worker groups)

In addition to the CRD, Karta provides a Go package that performs the core processing logic: query evaluation to dynamically interpret custom resource schemas, resource extraction that traverses workload hierarchies to identify and group pods, and optimization instruction processors that apply strategies such as gang scheduling to ensure coordinated placement of pods for distributed workloads.

┌──────────────────────────────────────────────────┐
│                   Your Platform                  │
│  (scheduler, controller, dashboard, CLI, etc.)   │
├──────────────────────────────────────────────────┤
│              Karta Component API                 │
│    Extract pods · Update specs · Read status     │
├────────────┬────────────┬────────────┬───────────┤
│ Karta:     │ Karta:     │ Karta:     │ Karta:    │
│ JobSet     │ RayCluster │ PyTorchJob │ YourCRD   │
└────────────┴────────────┴────────────┴───────────┘

Quick Start

Install the CRD

kubectl apply -f https://raw.githubusercontent.com/run-ai/karta/main/charts/krt/crds/run.ai_kartas.yaml

Use the Go library

go get github.com/run-ai/karta@latest

Define a Karta

Here's a Karta for a JobSet — a distributed training workload with master and worker groups:

apiVersion: run.ai/v1alpha1
kind: Karta
spec:
  structureDefinition:
    rootComponent:
      name: jobset
      kind:
        group: jobset.x-k8s.io
        version: v1alpha2
        kind: JobSet
      statusDefinition:
        conditionsDefinition:
          path: .status.conditions
          typeFieldName: type
          statusFieldName: status
        statusMappings:
          running:
          - byConditions:
            - type: StartupPolicyCompleted
              status: "True"
          completed:
          - byConditions:
            - type: Completed
              status: "True"
          failed:
          - byConditions:
            - type: Failed
              status: "True"

    childComponents:
    - name: replicatedjob
      kind:
        group: batch
        version: v1
        kind: Job
      ownerRef: jobset
      specDefinition:
        podTemplateSpecPath: .spec.replicatedJobs[].template.spec.template
      scaleDefinition:
        replicasPath: .spec.replicatedJobs[].replicas
      instanceIdPath: .spec.replicatedJobs[].name  # Instances: "master", "worker"

Extract workload information

import "github.com/run-ai/karta/pkg/resource"

// Create a factory from your Karta and workload object
factory := resource.NewComponentFactoryFromObject(karta, jobSetObject)

// Get the child component which has the per-instance data
component, _ := factory.GetComponent("replicatedjob")
summaries, _ := component.GetExtractedInstances(ctx)

// Access pod template specs, metadata, and scale info for each instance
for instanceID, summary := range summaries {
    // instanceID will be "master" or "worker"
    if summary.PodTemplateSpec != nil {
        fmt.Printf("Instance %s image: %s\n", instanceID, summary.PodTemplateSpec.Spec.Containers[0].Image)
    }
}

// Get status from the root component
rootComponent, _ := factory.GetRootComponent()
status, _ := rootComponent.GetStatus(ctx)
// status.MatchedStatuses: matched statuses based on conditions (e.g., ["running"])
// status.Phase: raw phase string from the workload
// status.Conditions: []Condition with Type, Status, Message fields

Update workload specs

The same paths defined in specDefinition are used for both extraction and updates:

// Prepare updates per instance
updates := map[string]resource.FragmentedPodSpec{
    "master": {
        SchedulerName: "my-custom-scheduler",
        Labels: map[string]string{"my-label": "true"},
    },
    "worker": {
        SchedulerName: "my-custom-scheduler",
    },
}

// Apply updates — modifies the underlying unstructured object
err := component.UpdateFragmentedPodSpec(ctx, updates)

// Get the updated object to apply back to the cluster
updatedObject, _ := factory.GetObject()

Pre-built Karta Definitions

Karta supports any workload type. The following are pre-built and tested Karta definitions that ship with the project:

Workload Type	Framework
JobSet	Kubernetes
PyTorchJob	Kubeflow
RayCluster	Ray
RayJob	Ray
InferenceService	KServe
Knative Service	Knative
MPIJob	Kubeflow
NIM Service	NVIDIA
LeaderWorkerSet	Kubernetes
Milvus	Milvus
DynamoGraphDeployment	NVIDIA Dynamo

See docs/examples/ for the full Karta definitions.

Complex example: NVIDIA Dynamo

The Dynamo Karta shows Karta handling a real-world multi-service inference graph - fragmented pod specs across services, autoscaling with min/max replicas, replica selectors for multi-node workers, gang scheduling, and 6 additional child resource types (DynamoComponentDeployment, LeaderWorkerSet, PodGang, PodClique, PodCliqueSet, PodCliqueScalingGroup). A single Karta definition replaces what would otherwise require hundreds of lines of per-type controller logic.

Who Uses Karta?

Karta was created at Run:ai (NVIDIA) to power workload management across diverse Kubernetes workload types. It is used internally by multiple services including the workload controllers, scheduler integrations, and platform components.

Documentation

Technical Guide - Full Karta spec, path syntax (jq), validation rules
Examples - Real-world Karta definitions for common workload types
API Reference — Go package documentation
CONTRIBUTING.md — How to contribute (DCO required)

Status

Karta is in active development (pre-1.0). The API may change between minor versions. We welcome feedback and contributions — please open an issue or start a discussion.

Third-Party Software

This project includes third-party software components. See the NOTICE file for attributions and the THIRD_PARTY_LICENSES file for detailed license information.

License

Apache License 2.0 — see LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
charts/krt		charts/krt
docs		docs
hack/licenses		hack/licenses
pkg		pkg
test/types		test/types
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
NOTICE		NOTICE
OWNERS		OWNERS
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Karta

The Problem

The Solution

Quick Start

Install the CRD

Use the Go library

Define a Karta

Extract workload information

Update workload specs

Pre-built Karta Definitions

Complex example: NVIDIA Dynamo

Who Uses Karta?

Documentation

Status

Third-Party Software

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Karta

The Problem

The Solution

Quick Start

Install the CRD

Use the Go library

Define a Karta

Extract workload information

Update workload specs

Pre-built Karta Definitions

Complex example: NVIDIA Dynamo

Who Uses Karta?

Documentation

Status

Third-Party Software

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages