0%

Tensorflow

1. 模型

1.1. Graph and Session

在分布式参数服务器架构训练中,tf.train.replica_device_setter(ps_tasks=3)可以用来指定将tf.Variable的放置位置,比如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
with tf.device(tf.train.replica_device_setter(ps_tasks=3)):
# tf.Variable objects are, by default, placed on tasks in "/job:ps" in a
# round-robin fashion. And only Variable ops are placed on ps tasks
w_0 = tf.Variable(...) # placed on "/job:ps/task:0"
b_0 = tf.Variable(...) # placed on "/job:ps/task:1"
w_1 = tf.Variable(...) # placed on "/job:ps/task:2"
b_1 = tf.Variable(...) # placed on "/job:ps/task:0"

input_data = tf.placeholder(tf.float32) # placed on "/job:worker"
layer_0 = tf.matmul(input_data, w_0) + b_0 # placed on "/job:worker"
layer_1 = tf.matmul(layer_0, w_1) + b_1 # placed on "/job:worker"

# or place worker
with tf.device(tf.train.replica_device_setter(
worker_device='/job:worker/task:'+task_idx,
clustercluster=cluster)):
# build your model here as if you only were using a single machine
with tf.Session(server.target):
# train your model here

tf.Variable 对象默认是放在ps上的,也只有Varibale是放在ps上的

ps上面只是存储了模型的参数,多个ps会将模型的参数拆分之后分别进行存储,workers计算参数更新之后,会将相应的参数发送到对应的ps上去。

tf.train.replica_device_setter会返回一个device function,用于with tf.device(device_function):在Operation对象构造时自动给Operation对象

1.1.1. 概念

  • client:编写代码的程序,client里面构造了tf.Graph,并构建tf.Session来启动训练
  • tf.Session:Tensorflow使用tf.Session类来表示client程序(通常为Python)与c++运行时之间的连接。
    1
    2
    3
    4
    5
    6
    7
    # Create a default in-process session.
    with tf.Session() as sess:
    # ...

    # Create a remote session.
    with tf.Session("grpc://example.org:2222"):
    # ...
  • cluster:对于一个cluster,由tf.train.ClusterSpec指定,里面包含了一些Server,Server又可以分为两种job(ps和worker),每个job可以有多个tasks

1.2. 数据并行

数据并行分为图内复制和图间复制,

1.2.1. in-graph replication

图内复制中所有的Operation都在一张tf.Graph中;用一个客户端来生成Graph, 把所有tf.Operation分配到所有的ps和worker上。一般是单机多卡的训练。

1.2.2. between-graph replication

每个/job/worker的task都有独立的client,参数通过tf.trian.replica_device_setter复制到task中,而每个worker上都会有模型计算部分的副本

框架 代码过程
参数更新的时候的通信