参数服务器架构

实验

实验目标

搞清楚ps架构通信的时候是不是都会走QPI总线，跨numa node？
PS在哪些cpu上跑也影响着通信的效率。尝试找到每个PS跑在哪几个cpu上

实验过程

k8s中cpuset的探索

为了找到PS具体使用的是哪几个cpu，尝试探索k8s对cpuset的使用。

k8s中，容器 cpuset 中的 CPU 数量与 pod 规格中指定的整数型 CPU limit 相等。只要保证pod属于 Guaranteed QoS（即limits和requests相等）且cpu的资源值为大于等于1的整数值，则该pod就会被cpu manager（从 1.10 版本开始，作为 beta 特性默认开启）分配两个独占的cpu。

源码如下：

// pkg/kubelet/cm/cpumanager/policy_static.go
// 判断一个pod是否为Guaranteed QoS，并且cpu申请值是否为整数
func guaranteedCPUs(pod *v1.Pod, container *v1.Container) int {
    if v1qos.GetPodQOS(pod) != v1.PodQOSGuaranteed {
        return 0
    }
    cpuQuantity := container.Resources.Requests[v1.ResourceCPU]
    if cpuQuantity.Value()*1000 != cpuQuantity.MilliValue() {
        return 0
    }
    // Safe downcast to do for all systems with < 2.1 billion CPUs.
    // Per the language spec, `int` is guaranteed to be at least 32 bits wide.
    // https://golang.org/ref/spec#Numeric_types
    return int(cpuQuantity.Value())
}

// pkg/kubelet/cm/cpumanager/cpu_manager.go
// 同步cpuset的状态，
func (m *manager) reconcileState() (success []reconciledContainer, failure []reconciledContainer) {
    // ...
    klog.V(4).Infof("[cpumanager] reconcileState: updating container (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
    err = m.updateContainerCPUSet(containerID, cset)
    if err != nil {
        klog.Errorf("[cpumanager] reconcileState: failed to update container (pod: %s, container: %s, container id: %s, cpuset: \"%v\", error: %v)", pod.Name, container.Name, containerID, cset, err)
        failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
        continue
    }
    success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
    // ... 
}

func (m *manager) updateContainerCPUSet(containerID string, cpus cpuset.CPUSet) error {
    // TODO: Consider adding a `ResourceConfigForContainer` helper in
    // helpers_linux.go similar to what exists for pods.
    // It would be better to pass the full container resources here instead of
    // this patch-like partial resources.
    return m.containerRuntime.UpdateContainerResources(
        containerID,
        &runtimeapi.LinuxContainerResources{
            CpusetCpus: cpus.String(),
        })
}

实验环境为167集群，共20个逻辑cpu，分布在两个numa node上。PS是处于Guaranteed QoS状态，而且cpu=2，但是通过 docker inspect怎么看不到CpusetCpus字段被设置了？，直接去cgroups中查看PS对应的pod的 cpuset.cpus的值是给的 0-19，表示每个cpu都有可能用到，很奇怪。

尝试通过docker直接绑定一组cpuset：docker run --rm --cpuset-cpus 0-3 nginx:alpine，查看 docker inspect [container-id]，CpusetCpus字段是有设置的，查看cgroups中该容器对应的 cpuset.cpus的值也是 0-3

查看 /var/lib/kubelet/config.yaml里面给出的kubelet的启动参数中，cpuManagerPolicy=none，表示通过操作系统调度器CFS配额来提供Guaranteed Pod的cpu使用限制，而前面提到的满足条件的Pod会被分配独占cpu的策略是cpu manager提供的另一种 static策略，而kubelet默认开启的是 none 策略。

单机n33，Resnet50：

单作业 1 ps 1 worker：113 images/sec (gpu 7)

# gpu rxpci txpci
# Idx  MB/s  MB/s
    7  4863   414
    7  6454   827
    7  5580  2530

单作业 3 ps 1 worker：113 images/sec (gpu 7)
单作业 3 ps 2 worker：103 images/sec (gpu 0 4)

单作业 3 ps 3 worker：

3 * 93 images/sec (gpu 2 14 15) （一开始速度只有80不到，到2000 batches后稳定）

# gpu rxpci txpci
# Idx  MB/s  MB/s
   15  5172  2415
    2  6131   978
   14  4241   603
   15  4032   141
    2  2033   135
   14  2284   584

3 * 96 images/sec (gpu 1 11 13)

# gpu rxpci txpci
# Idx  MB/s  MB/s
    1  5426  1395
   11  5598   933
   13  5745   596
    1  2231   256
   11  5812   784
   13  6150  2869

3 * 84 images/sec (gpu 4 7 10)（速度从93降到83）

# gpu rxpci txpci
# Idx  MB/s  MB/s
    4    63   229
    7   450     0
   10   394   169
    4  5789   669
    7  1357   151
   10  1593     0

两作业3 ps 2 worker：96 images/sec （gpu 0 4 vs 10 14）

# gpu rxpci txpci
# Idx  MB/s  MB/s
    0  6591  2638
    4  5704   916
   10  8900   725
   14  5124   322