aws batch 在eks上配置计算环境和提交任务_资讯

aws batch 在eks上配置计算环境和提交任务

创始人

2024-03-15 07:04:43

0次

文档

Getting started with Amazon Batch on Amazon EKS
Amazon EKS jobs
Memory and vCPU considerations for Amazon Batch on Amazon EKS

batch不会管理集群，只是会管理节点（自动扩缩）并运行任务。batch在eks中单独管理自身资源（不会影响其他pod，node和asg），最佳实践是创建一个单独的nsmespace

在batch创建计算环境和eks关联，此时计算环境和eks解耦，用户实际上是在抽象的计算环境中提交任务，任务到pod的转换交给了batch

eks 和 batch job

job是batch的最小单元，eks上的batch job是和pod的映射。当提交jib时，job定义中的eks properties包括了能够在eks上运行的job的参数。

The podProperties of a running job have podName and nodeName parameters set for the current job attempt

aws batch describe-jobs --job 2d044787-c663-4ce6-a6fe-f2baf7e51b04

当向eks提交job时，batch会将job转换成pod定义。通过label和taints确保job运行在batch托管的节点上。eks上的job pod定义默认有以下设置

hostNetwork = true
dnsPolicy = ClusterFirstWithHostNet

使用cloudwatch logs监控eks上batch job的运行，https://docs.amazonaws.cn/en_us/batch/latest/userguide/batch-eks-cloudwatch-logs.html

pod中设置了label标识了batch job的jobid和计算环境的uuid。通过向pod注入环境变量为job runtime指明作业信息

kubectl describe pod aws-batch.14638eb9-d218-372d-ba5c-1c9ab9c7f2a1 -n my-aws-batch-namespace

在eks上运行基于gpu的作业，https://docs.amazonaws.cn/en_us/batch/latest/userguide/run-eks-gpu-workload.html

eks上内存和cpu的预留逻辑和GKE有区别，尤其在内存这块。batch的可能会受到预留资源的影响

eks 和 batch 计算环境

配置工具（awscli，kuebctl），配置权限（访问eks），创建集群

注意：batch只支持公共访问的eks集群

在eks创建的资源包括

专用名称空间ns
clusterrolebinding，batchh监控node和pod
role，在ns中创建专用的角色，绑定用户aws-batch
创建iamidentitymapping映射到AWSServiceRoleForBatch（此处有bug尚未解决，这个角色的路径需要删除不饿能直接复制）

注意点：

EKS 计算环境仅支持BEST_FIT_PROGRESSIVE和SPOT_CAPACITY_OPTIMIZED分配策略
aws cli工具在2.8.6版本之后才支持创建eks计算环境
实例role盈盈eks node角色，实例sg对应cluster的安全组

使用命令行出现以下错误，通过--generate-cli-skeleton生成模板发现确实没有eksConfiguration配置，检查awscli版本，建议更新到2.8.6之后

Parameter validation failed:
Unknown parameter in input: "eksConfiguration", must be one of: computeEnvironmentName, type, state, unmanagedvCpus, computeResources, serviceRole, tags

再次执行，同时会创建asg，将网络配置填充到对应的启动模板中

可以在--compute-resources中配置 ec2Configuration.imageType 选择gpu类型实例

The image type to match with the instance type to select an AMI. The supported values are different for ECS and EKS resources.

ECS：ECS_AL2，ECS_AL2_NVIDIA，ECS_AL1，ECS

EKS：EKS，EKS_AL2，EKS_AL2_NVIDIA（例如P4 和 G4）

aws batch create-compute-environment --cli-input-json file://./batch-eks-compute-environment.json
{"computeEnvironmentName": "My-eks-CE1","computeEnvironmentArn": "arn:aws-cn:batch:cn-north-1:xxxxxxxx:compute-environment/My-eks-CE1"
}
aws batch describe-compute-environments

创建计算队列

aws batch create-job-queue --cli-input-json file://./batch-eks-job-queue.json
{"jobQueueName": "My-eks-JQ1","jobQueueArn": "arn:aws-cn:batch:cn-north-1:xxxxxx:job-queue/My-eks-JQ1"
}

创建任务定义，和ecs的任务定义类似，其中有eksProperties配pod参数，可以对pod的command等参数进行覆盖

aws batch register-job-definition --cli-input-json file://./batch-eks-job-definition.json
{"jobDefinitionName": "MyJobOnEks_Sleep"jobDefinitionArn": "arn:aws-cn:batch:cn-north-1:xxxxxxx:job-definition/MyJobOnEks_Sleep:2","revision": 2
}

创建简单任务并提交到作业队列，通过以下方式设置job调度

设定任务队列的优先级。为任务设定调度优先级
调度策略。在创建作业队列时未指定调度策略，作业调度程序默认使用先进先出(FIFO)策略
公平调度。使用共享标识标记job，调度器从共享标识的作业中选择使用率最低的作业

aws batch submit-job --job-queue My-eks-JQ1 \
>     --job-definition MyJobOnEks_Sleep --job-name My-eks-Job1
{"jobArn": "arn:aws-cn:batch:cn-north-1:xxxxxxxxxxxx:job/fe10768a-a3b5-4596-93f1-b48083332e73","jobName": "My-eks-Job1","jobId": "fe10768a-a3b5-4596-93f1-b48083332e73"
}aws batch describe-jobs --job fe10768a-a3b5-4596-93f1-b48083332e73

控制台查看提交的任务json

在这里插入图片描述

此后新的m5.large实例启动，使用eks优化ami，配置添加了如下userdate

#!/bin/bash
set -exif [ -f /etc/aws-batch/batch.config ]; thenwhile read line; do[ $(expr "$line" : "^[A-Za-z_][0-9A-Za-z_]*=.*") -gt 0 ] && eval export $linedone < /etc/aws-batch/batch.config
fi[ -z "$AWS_BATCH_KUBELET_EXTRA_ARGS" ] && AWS_BATCH_KUBELET_EXTRA_ARGS=""/etc/eks/bootstrap.sh worklearn \--kubelet-extra-args ' '"$AWS_BATCH_KUBELET_EXTRA_ARGS"'  ... '

节点加入集群失败，出现如下错误Failed to contact API server when waiting for CSINode publishing: Unauthorized

kubelet_node_status.go:70] "Attempting to register node" node="ip-192-168-30-56.cn-north-1.compute.internal"
kubelet.go:2469] "Error getting node" err="node \"ip-192-168-30-56.cn-north-1.compute.internal\" not found"
kubelet_node_status.go:92] "Unable to register node with API server" err="Unauthorized" node="ip-192-168-30-56.cn-north-1.compute.internal"
kubelet.go:2469] "Error getting node" err="node \"ip-192-168-30-56.cn-north-1.compute.internal\" not found"
csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Unauthorized

最终发现是忘记将node的角色加入eks集群的aws-auth configmap中，加入一下

- groups:- system:bootstrappers- system:nodesrolearn: arn:aws-cn:iam::xxxxxx:role/myEKSNodeRole

这里总结一下batch 节点启动逻辑，batch会在配置的子网中通过dry-run的方式确认实例能够正常启动，随后batch修改ags的启动模板中的desired count数量，将节点启动。在cloudtrail中会看到以下意料中错误

An error occurred (InvalidParameter) when calling the RunInstances operation: Security group sg-0b1e6f21a1a04d078 and subnet subnet-027025e9d9760acdd belong to different networks.

修改之后计算环境启动，并且任务成功执行

node配置如下，节点通过污点排斥其他pod

apiVersion: v1
kind: Node
metadata:annotations:alpha.kubernetes.io/provided-node-ip: 192.168.15.116csi.volume.kubernetes.io/nodeid: '{"efs.csi.aws.com":"i-xxxxxxxx"}'node.alpha.kubernetes.io/ttl: "0"volumes.kubernetes.io/controller-managed-attach-detach: "true"labels:batch.amazonaws.com/compute-environment-revision: "4"batch.amazonaws.com/compute-environment-uuid: 6c63cab8-8b00-3021-bb3d-fb990cef9c60beta.kubernetes.io/arch: amd64beta.kubernetes.io/instance-type: m5.xlargebeta.kubernetes.io/os: linuxfailure-domain.beta.kubernetes.io/region: cn-north-1failure-domain.beta.kubernetes.io/zone: cn-north-1ak8s.io/cloud-provider-aws: f48c3b996b9bce33df562d04d847dfafkubernetes.io/arch: amd64kubernetes.io/hostname: ip-192-168-15-116.cn-north-1.compute.internalkubernetes.io/os: linuxnode.kubernetes.io/instance-type: m5.xlargetopology.kubernetes.io/region: cn-north-1topology.kubernetes.io/zone: cn-north-1aname: ip-192-168-15-116.cn-north-1.compute.internalresourceVersion: "34308242"uid: ffc8beb4-f326-4135-a471-e0b1d9511012
spec:providerID: aws:///cn-north-1a/i-xxxxxxxtaints:- effect: NoSchedulekey: batch.amazonaws.com/batch-node- effect: NoExecutekey: batch.amazonaws.com/batch-node

pod配置如下，label指明了计算环境和任务id，任务pod上配置了污点容忍和batch环境变量。默认网络模式为hostNetwork=true和dnsPolicy=ClusterFirstWithHostNet

任务制定完毕pod立刻被清除，可以配置cloudwatch agent收集日志（使用fluentbit组件需要增加污点容忍配置）

apiVersion: v1
kind: Pod
metadata:annotations:kubernetes.io/psp: eks.privilegedlabels:batch.amazonaws.com/compute-environment-uuid: 6c63cab8-8b00-3021-bb3d-fb990cef9c60batch.amazonaws.com/job-id: a5b695fc-8847-4a32-bfb5-99c6cf66c1dfbatch.amazonaws.com/node-uid: ffc8beb4-f326-4135-a471-e0b1d9511012name: aws-batch.b08aaab0-59e6-39b7-ada4-bbae690412b2namespace: my-aws-batch-namespace
spec:containers:- command:- sleep- "60"env:- name: AWS_BATCH_JOB_KUBERNETES_NODE_UIDvalue: ffc8beb4-f326-4135-a471-e0b1d9511012- name: AWS_BATCH_JOB_IDvalue: a5b695fc-8847-4a32-bfb5-99c6cf66c1df- name: AWS_BATCH_JQ_NAMEvalue: My-eks-JQ1- name: AWS_BATCH_JOB_ATTEMPTvalue: "1"- name: AWS_BATCH_CE_NAMEvalue: My-eks-CE1image: public.ecr.aws/amazonlinux/amazonlinux:2imagePullPolicy: IfNotPresentname: defaultresources:limits:cpu: "1"memory: 1Girequests:cpu: "1"memory: 1GiterminationMessagePath: /dev/termination-logterminationMessagePolicy: FilevolumeMounts:- mountPath: /var/run/secrets/kubernetes.io/serviceaccountname: kube-api-access-xddn2readOnly: truednsPolicy: ClusterFirstenableServiceLinks: truehostNetwork: truenodeName: ip-192-168-15-116.cn-north-1.compute.internalpreemptionPolicy: PreemptLowerPrioritypriority: 0restartPolicy: NeverschedulerName: default-schedulersecurityContext: {}serviceAccount: defaultserviceAccountName: defaultterminationGracePeriodSeconds: 30tolerations:- effect: NoSchedulekey: batch.amazonaws.com/batch-nodeoperator: Exists- effect: NoExecutekey: batch.amazonaws.com/batch-nodeoperator: Exists- effect: NoExecutekey: node.kubernetes.io/not-readyoperator: ExiststolerationSeconds: 300- effect: NoExecutekey: node.kubernetes.io/unreachableoperator: ExiststolerationSeconds: 300volumes:- name: kube-api-access-xddn2projected:defaultMode: 420sources:- serviceAccountToken:expirationSeconds: 3607path: token- configMap:items:- key: ca.crtpath: ca.crtname: kube-root-ca.crt- downwardAPI:items:- fieldRef:apiVersion: v1fieldPath: metadata.namespacepath: namespace

上一篇：项目接入腾讯云短信服务SMS实现向用户发送手机验证码

下一篇：国美通讯设备股份有限公司关于涉及诉讼的公告

aws batch 在eks上配置计算环境和提交任务

eks 和 batch job

eks 和 batch 计算环境

相关内容

热门资讯