Mar 26, 2024

Job 总结

cgxxv

kubernetes

1183 Words | 5 Minutes, 22 Seconds

2024-03-26 09:25 +0800

默认启动的 Job（非并发）

通常只启动一个 Pod，除非该 Pod 失败
当 Pod 成功终止时，立即视 Job 为完成状态

关键指标：

spec.completions = 1 默认值为 1
spec.parallelism = 1 默认值为 1

并行的 Job

具有确定完成计数的并行 Job
- spec.completions > 0，可以设置或者不设置spec.parallelism，默认值为 1
- Completed Job，当成功 (exitCode = 0) Pod 个数达到spec.completions
- spec.completionMode = Indexed, Job index (0-spec.completions-1)
带工作队列的并行 Job
- No spec.completions, default is spec.parallelism >= 0，and must set spec.parallelism
- Coordinated Pod，or specify Pod do which item(s)
- Pod know its peer Pods are completed or not，to determine Job completed or not
- No more new Pods, when Pod terminated successfully
- Job completed when Pod (>=1 terminated) and other Pods terminated successfully
- Pod exited successfully, no other Pod will keep doing this task and will be in exiting process.

控制并行：

spec.parallelism >= 0，如果为 0，则 Job 相当于启动之后立即被暂停。实际在任意状态运行的 Pod 个数可能比spec.parallelism略大或略小，原因如下：

For fixed completion count Job，paralleled Pod number <= remaining completion count
For work queue Job，有任何的 Job 成功结束之后，不会有新的 Pod 启动，对于已经运行的 Pod，允许执行完毕
如果 JobController 没来得及做出响应，或者 JobController 因为任何原因（如资源不足，缺少 ResourceQuota 或者没有权限）无法创建 Pod。则 Pod 个数可能比请求的数目小
JobController 可能因为之前同一 Job 中 Pod 失败次数过多而压制新 Pod 的创建
当 Pod 处于体面终止进程中，需要一定时间才能停止

Job Completion Mode

spec.completions > 0 && spec.completionMode in (NonIndexed, Indexed)

NonIndexed (default)
Indexed, get this index value through four mechanisms
1. Pod annotation batch.kubernetes.io/job-completion-index
2. When PodIndexLabel (default enabled) feature gate enabled, Pod label batch.kubernetes.io/job-completion-index (>= v1.28).
3. Pod hostname, $(job-name)-$(index). When use an IndexedJob in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS. Job with Pod-to-Pod Communication
4. For the containerized task, use environment variable JOB_COMPLETION_INDEX

Reference: http://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode

Handing Pod and container failures

User should handle spec.template.spec.restartPolicy = OnFailure, or set spec.template.spec.restartPolicy = Never
User should handle temporary files, locks, incomplete output by yourself
Each Pod failure is counted in spec.backoffLimit.
Count Pod failure in spec.backoffLimitPerIndex
Set spec.parallelism = 1 && spec.completions = 1 && spec.template.spec.restartPolicy = Never can not make sure the program run one time.
User should handle currency for spec.parallelism > 1 && spec.completions > 1
Feature gate PodDisruptionCondition and JobPodFailurePolicy enabled and spec.podFailurePolicy be set, JobController will not treat Pod with metadata.deletionTimestamp as a failure Pod, until the Pod terminated (.status.phase in Failure Success). Once the Pod terminated, the JobController evaluates .backoffLimit and .podFailurePolicy for relevant job, consider this now-terminated Pod failed or not.
No above situation suited, JobController counts a terminating Pod as an immediate failure, even through that Pod terminates with phase = Succeed later.

Pod backoff failure policy

Set spec.backoffLimit = X, when Job retries count is X, the Job is marked failure. Default spec.backoffLimit = 6，backoff time (10s, 20s, 40s) until 6m.

status.phase = Failed
restartPolicy = OnFailure && status.phase in (Pending,Running)

Official suggestion: restartPolicy = "Never" and using a logging system to record logs.

Backoff limit per index

Feature gate JobBackoffLimitPerIndex should be enabled.

Set spec.backoffLimitPerIndex for handling retries for Pod failure
Failed Job index will be added in status.failedIndexes. Completed Job index will be added in status.completedIndex, regardless of backoffLimitPerIndex field
A failing index Job does not interrupt execution of other indexes. If one index Job failed, the overall IndexedJob will be marked failed.
JobController will terminate the entire job failed by setting spec.maxFailedIndexes, including the running Pods for that Job.

Pod failure policy

Feature gate JobPodFailurePolicy enabled, recommended PodDisruptionConditions enabled, supported in v1.29.

spec.podFailurePolicy enables k8s cluster to handle Pod failures based on the container exit codes and the Pod conditions.

A better control for handling Pod failures than the Pod backoff failure policy, based on spec.backoffLimit.

For avoiding unnecessary Pod restarts
Guarantee Job and ignore Pod failures caused by disruptions (eg. preemption, API-initiated eviction or taint-based eviction) so that don’t count spec.backoffLimit

Note: Because the Pod template specifies a restartPolicy: Never, the kubelet does not restart the main container in that particular Pod.

Ignore action for failed Pods with condition DisruptionTarget excludes Pod disruptions from being counted towards spec.backoffLimit

Note: If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple Pods, Kubernetes terminates all the Pods in that Job that are still Pending or Running.

API requirements and semantics:

Must define spec.template.spec.restartPolicy = Never for spec.podFailurePolicy
spec.podFailurePolicy.rules are evaluated in order. Once a rule matches a Pod failure, the remaining rules are ignored.
spec.podFailurePolicy.rules[*].onExitCodes.containerName available for both containers and initContainers
spec.podFailurePolicy.rules[*].action
- FailJob
- Ignore: relevant with spec.backoffLimit
- Count: relevant with spec.backoffLimit
- FailIndex: relevant with backoff limit per index

Reference: http://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Job termination and cleanup

A Job will be interrupted with a Pod restartPolicy = Never or a Container exits in restartPolicy = OnFailure. Once spec.backoffLimit be satisfied, the entire Job will be marked as failed and any running Pods will be terminated.
spec.activeDeadlineSeconds be satisfied, all of its running pods are terminated, and the Job status will because type: Failed with reason: DeadlineExceeded.
spec.activeDeadlineSeconds takes precedence over spec.backoffLimit. Once the Job reaches the time limit (activeDeadlineSeconds), even if the backoffLimit is not yet reached.

Cleanup finished jobs automatically (v1.23 stable)

TTL mechanism, spec.ttlSecondsAfterFinished for cleaning up finished Jobs (Completed, Failed), including all the cascading Objects, eg: Pods.

Note: If the Job do not be cleaned up, the cluster performance degradation or in worst case cause to go offline due to this degradation. Use LimitRanges and ResourceQuotas is a better way to avoid this.

Reference: http://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs

Some examples:

Specify this field in Job manifest
Manually set this field of existing, already finished Jobs
Use a mutating admission webhook to set this field dynamically at Job creation time. Cluster administrators use cases.
Use a mutation admission webhook to set this field dynamically after the Job has finished, need to detect status of the Job.
Write your own controller to manage the cleanup TTL for Jobs.

Caveats:

Updating TTL for finished Jobs: K8s will not make sure the update for TTL with have been expired.
Time skew: Known that clocks aren’t always correct, K8s Job use the timestamp for doing the clean up.

Job patterns

Usage cases: like emails to be sent, or notification to be pushed, or frames to be rendered, or files to be transcoded, ranges of keys in a NoSQL to scan…

Different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:

A single job for all work items is better for large numbers of items.
Having each Pod process multiple work items is better for large numbers of items.
Several approaches use a work queue.
The job is associated with a headless Service.

Reference: http://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns

Advanced Usage

Suspending a job: spec.suspend = true (v1.24 stable)
Mutable Scheduling Directives (v1.27 stable)
Specifying your own Pod selector: spec.selector
Job tracking with finalizers: batch.kubernetes.io/job-tracking (v1.26 stable)
Elastic Indexed Jobs (v1.27 beta)
- When feature gate ElasticIndexedJob disabled, spec.completions immutable
- spec.parallelism
- spec.completions
Delayed creation of replacement pods (v1.29 beta)
- Feature gate JobPodReplacementPolicy enabled by default
- Use status.phase = Failed for delaying Pods creation of replacement, set spec.podReplacementPolicy = Failed
- Without podFailurePolicy set, podReplacementPolicy selects the TerminatingOrFailed replacement policy: the control plane creates replacement Pods immediately upon Pod deletion (as soon as the control plane sees that a Pod for this Job has deletionTimestamp set).

Alternatives

Bare Pods
Replication Controller
Single Job starts controller Pod

Job usage conclusion (personally)

用户需要自己处理一些如 lock，重试，标记，验证等业务逻辑，来安全的使用 Job，JobController 并不会减轻开发工作量，因为 JobController 中的 Pod 会因为很多原因重启或失败，比如 Node eviction，比如 livenessProbe。

JobController 的优势，在于可以可控制的扩缩并行任务数量