Skip to content

Kubeflow Training Operator

Kubeflow Training Operator V1 Documentation

WARNING

Old Version!

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

Simple Example

---
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: kubeflow-user-example-com
  name: tfjob-mnist-with-summaries
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /mnist-with-summaries-logs/test
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "32"
        max: "64"
  trialTemplate:
    primaryContainerName: tensorflow
    # In this example we can collect metrics only from the Worker pods.
    primaryPodLabels:
      training.kubeflow.org/replica-type: worker
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: ghcr.io/kubeflow/katib/tf-mnist-with-summaries:latest
                    command:
                      - "python"
                      - "/opt/tf-mnist-with-summaries/mnist.py"
                      - "--epochs=1"
                      - "--learning-rate=${trialParameters.learningRate}"
                      - "--batch-size=${trialParameters.batchSize}"
                      - "--log-path=/mnist-with-summaries-logs"

실행하고 나면 다음과 같이 tfjob-mnist-with-summaries-* pods 가 생기면서 Trial 이 돌아간다.

$ kubectl get pods -n kubeflow-user-example-com
NAME                                                 READY   STATUS    RESTARTS   AGE
ktfms-7786d5c68f-hh64k                               2/2     Running   0          2d22h
random-experiment-enas-6fc65474d4-d7f85              1/1     Running   0          2d22h
tfjob-mnist-with-summaries-7gm58mwn-worker-0         2/2     Running   0          11s
tfjob-mnist-with-summaries-7gm58mwn-worker-1         2/2     Running   0          11s
tfjob-mnist-with-summaries-dkr246f7-worker-0         2/2     Running   0          11s
tfjob-mnist-with-summaries-dkr246f7-worker-1         2/2     Running   0          10s
tfjob-mnist-with-summaries-dnh6tkpk-worker-0         2/2     Running   0          11s
tfjob-mnist-with-summaries-dnh6tkpk-worker-1         2/2     Running   0          11s
tfjob-mnist-with-summaries-random-776f9bb45d-rczlr   1/1     Running   0          4m
tutorial-jupyter-lab-01-0                            2/2     Running   0          2d23h

See also

Favorite site