Pod 安全性


安全标准

  1. Privileged 不受限制的策略,此类 Pod 权限较高,通常为一些系统级别或基础设施级别的工作负载

  2. Baseline 限制性最弱的策略,禁止已知的特权提升

    1. HostProcess, windows related (v1.26 stable)

      • Restricted Fields
        1. spec.securityContext.windowsOptions.hostProcess
        2. spec.containers[*].securityContext.windowsOptions.hostProcess
        3. spec.initContainers[*].securityContext.windowsOptions.hostProcess
        4. spec.ephemeralContainers[*].securityContext.windowsOptions.hostProcess
      • Allowed Values
        1. undefined/nil
        2. false
    2. Host Namespaces: Sharing the host namespaces must be disallowed.

      • Restricted Fields
        1. spec.hostNetwork
        2. spec.hostPID
        3. spec.hostIPC
      • Allowed Values
        1. undefined/nil
        2. false
    3. Privileged Containers: Privileged Pods disable most security mechanisms and must be disallowed.

      • Restricted Fields
        1. spec.containers[*].securityContext.privileged
        2. spec.initContainers[*].securityContext.privileged
        3. spec.ephemeralContainers[*].securityContext.privileged
      • Allowed Values
        1. undefined/nil
        2. false
    4. Capabilities: Adding additional capabilities beyond those listed below must be disallowed.

      • Restricted Fields
        1. spec.containers[*].securityContext.capabilities.add
        2. spec.initContainers[*].securityContext.capabilities.add
        3. spec.ephemeralContainers[*].securityContext.capabilities.add
      • Allowed Values
        • undefined/nil
        • AUDIT_WRITE 允许写入审计日志
        • CHOWN 允许容器更改文件所有者
        • DAC_OVERRIDE 允许容器忽略文件的 DAC 权限(Discretionary Access Controls)即,读、写、执行、特殊权限
        • FOWNER 允许容器更改文件所有者为任何用户
        • FSETID 允许容器设置文件的 Setuid 位或 Setgid 位
        • KILL 允许容器相其他进程发送信号
        • MKNOD 允许容器创建特殊文件节点
        • NET_BIND_SERVICE 允许容器绑定到低于 1024 的端口号
        • SETFCAP 允许容器设置文件的能力,如某个文件需要一些特权操作,但又不想以 root 用户身份执行
        • SETGID 允许容器设置有效的组 ID(宿主机)
        • SETPCAP 允许容器进程修改其进程的能力,如 Docker,Sandbox
        • SETUID 允许容器设置有效的用户 ID(宿主机)
        • SYS_CHROOT 允许容器使用 chroot 系统调用,即通过系统调用限制用户只能使用某个文件目录的能力
    5. HostPath Volumes: HostPath volumes must be forbidden.

      • Restricted Fields
        1. spec.volumes[*].hostPath
      • Allowed Values
        1. undefined/nil
    6. Hosts Ports: HostPorts should be disallowed entirely (recommended) or restricted to a known list

      • Restricted Fields
        1. spec.containers[*].ports[*].hostPort
        2. spec.initContainers[*].ports[*].hostPort
        3. spec.ephemeralContainers[*].ports[*].hostPort
      • Allowed Values
        1. undefined/nil
        2. [enforce,audit,warn]
        3. 0
    7. AppArmor: On supported hosts, the runtime/default AppArmor profile is applied by default. The baseline policy should prevent overriding or disabling the default AppArmor profile, or restrict overrides to an allowed set of profiles.

      • Restricted Fields
        1. metadata.annotations[“container.apparmor.security.beta.kubernetes.io/*”]
      • Allowed Values
        1. undefined/nil
        2. runtime/default
        3. localhost/*
    8. SELinux: Setting the SELinux type is restricted, and setting a custom SELinux user or role option is forbidden.

      • Restricted Fields
        1. spec.securityContext.seLinuxOptions.type
        2. spec.containers[*].securityContext.seLinuxOptions.type
        3. spec.initContainers[*].securityContext.seLinuxOptions.type
        4. spec.ephemeralContainers[*].securityContext.seLinuxOptions.type
      • Allowed Values
        1. undefined/""
        2. container_t 容器的默认 SELinux 类型,允许主容器中的进程访问主容器中的资源,并收到 SELinux 策略的保护
        3. container_init_t 允许初始化容器中的进程访问初始化容器中的资源,并收到 SELinux 策略的保护
        4. container_kvm_t 允许容器内运行的虚拟机(如 KVM)的 SELinux 类型,并受到 SELinux 策略的保护
      • Restricted Fields
        1. spec.securityContext.seLinuxOptions.[user/role]
        2. spec.containers[*].securityContext.seLinuxOptions.[user/role]
        3. spec.initContainers[*].securityContext.seLinuxOptions.[user/role]
        4. spec.ephemeralContainers[*].securityContext.seLinuxOptions.[user/role]
      • Allowed Values
        1. undefined/""
    9. /proc Mount Type: The default /proc masks are set up to reduce attack surface, and should be required.

      • Restricted Fields
        1. spec.containers[*].securityContext.procMount
        2. spec.initContainers[*].securityContext.procMount
        3. spec.ephemeralContainers[*].securityContext.procMount
      • Allowed Values
        1. undefined/nil
        2. Default
    10. Seccomp: Seccomp profile must not be explicitly set to Unconfined

      • Restricted Fields
        1. spec.securityContext.seccompProfile.type
        2. spec.containers[*].securityContext.seccompProfile.type
        3. spec.initContainers[*].securityContext.seccompProfile.type
        4. spec.ephemeralContainers[*].securityContext.seccompProfile.type
      • Allowed Values
        1. undefined/nil
        2. RuntimeDefault
        3. Localhost
    11. Sysctls: Sysctls can disable security mechanisms or affect all containers on a host, and should be disallowed except for an allowed “safe” subset. A sysctl is considered safe if it is namespaced in the container or the Pod, and it is isolated from other Pods or processes on the same Node.

      • Restricted Fields
        1. spec.securityContext.sysctls[*].name
      • Allowed Values
        1. undefined/nil
        2. kernel.shm_rmid_forced 控制是否强制删除共享内存标识符
        3. net.ipv4.ip_local_port_range 控制本地端口范围
        4. net.ipv4.ip_unprivileged_port_start 指定非特权用户可用的本地端口起始范围
        5. net.ipv4.tcp_syncookies 控制是否启用 SYN cookie 机制来防范 SYN 攻击
        6. net.ipv4.ping_group_range 指定 ping 命令可用的 ICMP Echo 请求
        7. net.ipv4.ip_local_reserved_ports 指定保留的本地端口范围
        8. net.ipv4.tcp_keepalive_time 指定 TCP 连接的 FIN 超时时间,以秒为单位
        9. net.ipv4.tcp_fin_timeout 指定 TCP 连接的 FIN 超时时间,以秒为单位
        10. net.ipv4.tcp_keepalive_intvl 指定 TCP 连接的 keepalive 控制消息之间的间隔时间,以秒为单位
        11. net.ipv4.tcp_keepalive_probes 指定 TCP 连接在进行 keepalive 检测之前尝试的次数
  3. Restricted 限制性最强的策略

    1. Volume Types: The restricted policy only permits the following volume types.
      • Restricted Fields
        1. spec.volumes[*]
      • Allowed Values: Non-Null value
        1. spec.volumes[*].configMap
        2. spec.volumes[*].csi
        3. spec.volumes[*].downwardAPI
        4. spec.volumes[*].emptyDir
        5. spec.volumes[*].ephemeral
        6. spec.volumes[*].persistentVolumeClaim
        7. spec.volumes[*].protected
        8. spec.volumes[*].secret
    2. Privilege Escalation: Privilege escalation (such as via set-user-ID or set-group-ID file mode) should not be allowed. linux only policy (spec.os.name != windows)
      • Restricted Fields
        1. spec.containers[*].securityContext.allowPrivilegeEscalation
        2. spec.initContainers[*].securityContext.allowPrivilegeEscalation
        3. spec.ephemeralContainers[*].securityContext.allowPrivilegeEscalation
      • Allowed Values
        1. false
    3. Running as Non-root
      • Restricted Fields: Containers must be required to run as non-root users.
        1. spec.securityContext.runAsNonRoot
        2. spec.containers[*].securityContext.runAsNonRoot
        3. spec.initContainers[*].securityContext.runAsNonRoot
        4. spec.ephemeralContainers[*].securityContext.runAsNonRoot
      • Allowed Values
        1. true
      • Restricted Fields: Containers must not set runAsUser to 0
        1. spec.securityContext.runAsUser
        2. spec.containers[*].securityContext.runAsUser
        3. spec.initContainers[*].securityContext.runAsUser
        4. spec.ephemeralContainers[*].securityContext.runAsUser
      • Allowed Values
        1. any non-zero value
        2. undefined/null
    4. Seccomp: Seccomp profile must be explicitly set to one of the allowed values. Both the Unconfined profile and the absence of a profile are prohibited. linux only (spec.os.name != windows)
      • Restricted Fields
        1. spec.securityContext.seccompProfile.type
        2. spec.containers[*].securityContext.seccompProfile.type
        3. spec.initContainers[*].securityContext.seccompProfile.type
        4. spec.ephemeralContainers[*].securityContext.seccompProfile.type
      • Allowed Values
        1. RuntimeDefault
        2. Localhost
    5. Capabilities: Containers must drop ALL capabilities, and are only permitted to add back the NET_BIND_SERVICE capability. linux only (spec.os.name != windows)
      • Restricted Fields
        1. spec.containers[*].securityContext.capabilities.drop
        2. spec.initContainers[*].securityContext.capabilities.drop
        3. spec.ephemeralContainers[*].securityContext.capabilities.drop
      • Allowed Values
        1. Any list of capabilities that includes ALL
      • Restricted Fields
        1. spec.containers[*].securityContext.capabilities.add
        2. spec.initContainers[*].securityContext.capabilities.add
        3. spec.ephemeralContainers[*].securityContext.capabilities.add
      • Allowed Values
        1. undefined/nil
        2. NET_BIND_SERVICE

为名字空间设置 Pod 安全性准入控制标签

  1. enforce: 策略违例会导致 Pod 被拒绝,应用到 Pod 对象上
  2. audit:策略违例会触发在审计日志中记录新事件时添加注解;但是 Pod 仍然是被接受的,应用到 Deployment,ReplicaSet 等控制器对象上
  3. warn:策略违例会触发用户可见的警告信息,但是 Pod 仍是被接受的,应用到 Deployment,ReplicaSet 等控制器对象上

对应的标签

  1. pod-security.kubernetes.io/<MODE>: <LEVEL> MODE: enforce,audit,warn LEVEL: privileged,baseline,restricted
  2. pod-security.kubernetes.io/<MODE>-version: <VERSION> MODE: enforce,audit,warn VERSION: 合法的 kubernetes 小版本号或者latest
apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.29

    # 我们将这些标签设置为我们所 _期望_ 的 `enforce` 级别
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.29
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.29

准入豁免

  1. Username:来自用户名一杯豁免的,已认证的(或伪装的)用户请求会被忽略
  2. RuntimeClassName:指定了已豁免的 CRI 类名称的 Pod 和负载资源(Deployment,ReplicaSet 等)会被忽略
  3. Namespace:位于北豁免的名字空间中的 Pod 和负载资源会被忽略

NOTE: 为用户提供豁免时,只会当该用户直接创建的 Pod 时对其实施安全策略的豁免。用户所创建的工作负载资源(控制器)不会被豁免。控制器服务账号(如:system:serviceaccount:kube-system:replicaset-controller)通常不应该被豁免,因为这类服务账号隐含着对所有能够创建对应工作负载资源的用户豁免。

策略检查时会对以下 Pod 字段的更新操作予以豁免,这意味着如果 Pod 更新请求进改变这些字段时,即使 Pod 违反了当前的策略级别,请求也不会被拒绝。

  • 除了对 seccomp 或 AppArmor 注解之外的所有 Metadata 更新操作:
    • container.apparmor.security.beta.kubernetes.io/*
  • .spec.activeDeadlineSeconds的合法更新
  • .spec.tolerations的合法更新

Pod 安全级别的指标监控

  • pod_security_evaluations_total: 表示易发生的策略评估的数量,不包括到处期间被忽略或豁免的请求
  • pod_security_exemptions_total: 表示豁免请求的数量,不包括被忽略或超出范围的请求