基础环境搭建
四台阿里云集群:
1 | 172.0.0.1 master1 |
其中172.0.0.1机器可以访问外网(47.0.0.1)。
安装Docker
每台机器都安装
- 安装必要的packages
1
2
3sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2 - 配置阿里镜像源
1
2
3sudo yum-config-manager \
--add-repo \
http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo - 安装docker
1
yum install docker-ce
- 开启docker服务
1
systemctl start docker
遇到的问题:
docker daemon不能启动,即执行systemctl start docker
时提示Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
,执行systemctl status docker.service
,输出Active: failed (Result: start-limit) Process: 7880 ExecStart=/usr/bin/dockerd (code=exited, status=1/FAILURE) Main PID: 7880 (code=exited, status=1/FAILURE)
。
解决方法:1
2
3
4
5
6
7rm -rf /var/lib/docker/
# 添加如下内容
# vim /etc/docker/daemon.json
{
"graph": "/mnt/docker-data",
"storage-driver": "overlay"
}
安装Glusterfs
- 每台机器都安装
1
2
3
4
5
6yum install glusterfs
yum install glusterfs-cli
yum install glusterfs-libs
yum install glusterfs-server
yum install glusterfs-fuse
yum install glusterfs-geo-replication - 在每台机器上的/etc/hosts文件中添加hosts信息:
1
2
3
4172.0.0.1 master1
172.0.0.2 agent1
172.0.0.3 agent2
172.0.0.4 agent3 - 加入peer
1
2
3
4
5
6# master1上:
gluster peer probe agent1
gluster peer probe agent2
gluster peer probe agent3
# 在各台机器上查看peer status
gluster peer status - 创建两个volume,一个做复制卷(replica=2),一个Hash卷
创建复制卷:
- 在master1和agent1上创建
/data/repli
目录 - 执行
gluster volume create replica_volume replica 2 transport tcp master1:/data/repli agent1:/data/repli force
创建名为replica_volume的复制卷 - 执行
gluster volume start replica_volume
开启复制卷 - 在master1上创建目录
/data/mnt
用于挂载复制卷 - 执行
mount -t glusterfs master1:replica_volume /data/mnt
挂载复制卷
- 在master1和agent1上创建
创建Hash卷
- 分别在四台机器上创建
/data/hash
目录 - 执行
gluster volume create hash_volume master1:/data/hash agent1:/data/hash agent2:/data/hash agent3:/data/hash force
创建名为hash_volume的Hash卷 - 执行
gluster volume start hash_volume
- 在master1上创建目录
/data/mnt_hash
用于挂载Hash卷 - 执行
mount -t glusterfs master1:hash_volume /data/mnt_hash
挂载Hash卷
- 分别在四台机器上创建
Mesos+Zookeeper+Marathon集群搭建
Mesos-master节点
这里使用master1节点作为mesos-master,配置如下:
关闭selinux
1
2sed -i '/SELINUX/s/enforcing/disabled/' /etc/selinux/config
setenforce 0关闭防火墙
1
2systemctl stop firewalld.service
systemctl disable firewalld.service添加mesos的yum源
1
rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
安装依赖的JDK环境、Mesos1.5.0、Marathon、Zookeeper
1
2
3# 需要java8
yum install -y java-1.8.0-openjdk-devel java-1.8.0-openjdk
yum -y install mesos marathon mesosphere-zookeeper配置Zookeeper
我们将master1、agent1和agent2作为zookeeper的三个Znode,分别编号为1,2,3;zookeeper的配置必须在这三个Znode上都完成。将编号写到/var/lib/zookeeper/myid文件中去:
1
2
3master1: echo 1 > /var/lib/zookeeper/myid
agent1: echo 2 > /var/lib/zookeeper/myid
agent2: echo 3 > /var/lib/zookeeper/myid修改zookeeper文件:
1
2
3
4
5
6# 备份zoo.cfg文件
cp /etc/zookeeper/conf/zoo.cfg /etc/zookeeper/conf/zoo.cfg.bak
# 然后往zoo.cfg文件中加入以下配置信息:
server.1=master1:2888:3888
server.2=agent1:2888:3888
server.3=agent2:2888:3888
配置Mesos-master
- 在/etc/mesos/zk文件中加入
zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/mesos
- 设置文件/etc/mesos-master/quorum内容为一个大于(master节点数除以2)的整数。这里只有一个master,故quorum设置为1:
echo 1 >/etc/mesos-master/quorum
。 - 创建目录:
mkdir -p /etc/marathon/conf
- 设置mesos-master的hostname:
echo 172.0.0.1 > /etc/mesos-master/hostname
- 设置marathon的hostname:
echo 172.0.0.1 > /etc/marathon/conf/hostname
- 在/etc/marathon/conf/master文件中加入
zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/mesos
- 在/etc/marathon/conf/zk文件中加入
zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/marathon
- 在/etc/mesos/zk文件中加入
启动mesos,marathon,zookeeper
1
2systemctl enable zookeeper && systemctl enable mesos-master && systemctl enable marathon
systemctl start zookeeper && systemctl start mesos-master && systemctl start marathon启动marathon的时候报错:未指定master。
解决办法:在启动marathon的时候指定master和zk,即进入/usr/lib/systemd/system/marathon.service
,修改1
2
3ExecStart=/usr/share/marathon/bin/marathon \
--master zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/mesos \
--zk zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/marathon1
2systemctl daemon-reload
systemctl start marathon
Mesos-slave节点
这里使用全部四个节点作为mesos-slave,配置如下:
- 关闭selinux
1
2sed -i '/SELINUX/s/enforcing/disabled/' /etc/selinux/config
setenforce 0 - 关闭防火墙
1
2systemctl stop firewalld.service
systemctl disable firewalld.service - 添加mesos的yum源
1
rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
- 安装Mesos1.5.0
1
yum install -y mesos
- 将IP写入
/etc/mesos-slave/hostname
中:1
2
3
4master1: echo 172.0.0.1 > /etc/mesos-slave/hostname
agent1: echo 172.0.0.2 > /etc/mesos-slave/hostname
agent2: echo 172.0.0.3 > /etc/mesos-slave/hostname
agent3: echo 172.0.0.4 > /etc/mesos-slave/hostname - 配置Mesos
1
echo zk://172.0.0.1:2181,172.0.0.2:2181,172.0.0.3:2181/mesos /etc/mesos/zk
- 配置marathon调用mesos运行docker容器
1
echo 'docker,mesos' > /etc/mesos-slave/containerizers
- 启动slave除了master1节点外,其余节点分别执行:
1
systemctl enable mesos-slave && systemctl start mesos-slave
systemctl disable mesos-master
遇到的问题:
Mesos-slave:Failed to perform recovery: Incompatible agent info detected.
通过journalctl -xe
查看log,提示
1
2
3To restart this agent with a new agent id instead, do as follows:
4月 03 19:13:30 iZbp194ugtbu9s8ct8lt3eZ mesos-slave[1542]: rm -f /var/mesos/meta/slaves/latest
4月 03 19:13:30 iZbp194ugtbu9s8ct8lt3eZ mesos-slave[1542]: This ensures that the agent does not recover old live executors.
执行rm -f /var/mesos/meta/slaves/latest
,重启mesos-slave即可
安装并配置Etcd3.2.15
安装etcd
1 | yum install etcd -y |
配置etcd
修改etcd默认配置文件/etc/etcd/etcd.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55master1节点:
#[Member]
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
ETCD_NAME="master1"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://master1:2380"、
ETCD_ADVERTISE_CLIENT_URLS="http://master1:2379,http://master1:4001"
ETCD_INITIAL_CLUSTER="master1=http://master1:2380,agent1=http://agent1:2380,agent2=http://agent2:2380,agent3=http://agent3:2380"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
agent1节点:
#[Member]
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
ETCD_NAME="agent1"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://agent1:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://agent1:2379,http://agent1:4001"
ETCD_INITIAL_CLUSTER="master1=http://master1:2380,agent1=http://agent1:2380,agent2=http://agent2:2380,agent3=http://agent3:2380"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
agent2节点:
#[Member]
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
ETCD_NAME="agent2"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://agent2:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://agent2:2379,http://agent2:4001"
ETCD_INITIAL_CLUSTER="master1=http://master1:2380,agent1=http://agent1:2380,agent2=http://agent2:2380,agent3=http://agent3:2380"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
agent3节点:
#[Member]
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
ETCD_NAME="agent3"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://agent3:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://agent3:2379,http://agent3:4001"
ETCD_INITIAL_CLUSTER="master1=http://master1:2380,agent1=http://agent1:2380,agent2=http://agent2:2380,agent3=http://agent3:2380"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"修改/usr/lib/systemd/system/etcd.service:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28所有四个节点:
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
WorkingDirectory=/var/lib/etcd/
EnvironmentFile=-/etc/etcd/etcd.conf
User=etcd
# set GOMAXPROCS to number of processors
ExecStart=/bin/bash -c "GOMAXPROCS=$(nproc) /usr/bin/etcd --name=\"${ETCD_NAME}\" \
--data-dir=\"${ETCD_DATA_DIR}\" \
--listen-client-urls=\"${ETCD_LISTEN_CLIENT_URLS}\" \
--advertise-client-urls=\"${ETCD_ADVERTISE_CLIENT_URLS}\" \
--initial-advertise-peer-urls=\"${ETCD_INITIAL_ADVERTISE_PEER_URLS}\" \
--listen-peer-urls=\"${ETCD_LISTEN_PEER_URLS}\" \
--initial-cluster-token=\"${ETCD_INITIAL_CLUSTER_TOKEN}\" \
--initial-cluster=\"${ETCD_INITIAL_CLUSTER}\" \
--initial-cluster-state=\"${ETCD_INITIAL_CLUSTER_STATE}\" \
"
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target开启etcd服务:
1
2systemctl daemon-reload
systemctl start etcd
遇到的问题:
error verifying flags, --advertise-client-urls is required when --listen-client-urls is set explicitly. See 'etcd --help'.
问题是–advertise-client-urls未指定。查看/etc/etcd/etcd.cnf
发现ETCD_ADVERTISE_CLIENT_URLS="http://master1:2379,http://master1:4001"
写成了ETCD_ADVERTISE_CLIENT_URLS="http://master1:2379,master1:4001"
,应该加上http://
安装Calico
以下配置在每台机器上都需要完成。
- 下载calicoctl二进制文件
1
2wget -O /usr/local/bin/calicoctl https://github.com/projectcalico/calicoctl/releases/download/v1.6.3/calicoctl
chmod +x /usr/local/bin/calicoctl - 开启calico/node检查calico/node是否运行:
1
2
3
4
5
6# master1节点
ETCD_ENDPOINTS=http://master1:2379 calicoctl node run --node-image=quay.io/calico/node:v2.6.8
# agent节点(使用自己搭建的registry仓库,下面将介绍如何搭建)
# 将"insecure-registries":["master1:5000"]写入/etc/docker/daemon.json
ETCD_ENDPOINTS=http://master1:2379 calicoctl node run --node-image=master1:5000/calico/node:v2.6.8对于agent节点,从后面搭建的docker registry拉取镜像。1
2
3
4
5
6# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7cfe445e5d58 quay.io/calico/node:v2.6.8 "start_runit" 26 hours ago Up 26 hours calico-node
# calicoctl node status
Calico process is running. - 配置CNI网络:
- 设置环境变量
1
2
3
4
5# set cni network
# CNI configuration directory
export NETWORK_CNI_PLUGINS_DIR=/var/lib/mesos/cni/plugins
# CNI plugin directory
export NETWORK_CNI_CONF_DIR=/var/lib/mesos/cni/config - 下载Calico CNI插件到$NETWORK_CNI_PLUGINS_DIR
1
2
3
4
5
6curl -L -o $NETWORK_CNI_PLUGINS_DIR/calico \
https://github.com/projectcalico/cni-plugin/releases/download/v1.11.4/calico
curl -L -o $NETWORK_CNI_PLUGINS_DIR/calico-ipam \
https://github.com/projectcalico/cni-plugin/releases/download/v1.11.4/calico-ipam
chmod +x $NETWORK_CNI_PLUGINS_DIR/calico
chmod +x $NETWORK_CNI_PLUGINS_DIR/calico-ipam - 创建Calico CNI配置到$NETWORK_CNI_CONF_DIR
1
2
3
4
5
6
7
8
9
10
11cat > $NETWORK_CNI_CONF_DIR/calico.conf <<EOF
{
"name": "calico",
"cniVersion": "0.1.0",
"type": "calico",
"ipam": {
"type": "calico-ipam"
},
"etcd_endpoints": "http://master1:2379"
}
EOF
- 设置环境变量
- Docker Registry的搭建(后面由marathon运行docker registry,该步骤可忽略)
由于agent[1,2,3]不能出外网,故需要在master1节点上搭建docker registry。- 在master1节点上,执行
docker pull registry
拉取最新版docker registry镜像。 - 创建/data/registry/config.yml文件,写入一下内容,在创建registry后允许删除镜像
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20version: 0.1
log:
fields:
service: registry
storage:
delete:
enabled: true
cache:
blobdescriptor: inmemory
filesystem:
rootdirectory: /var/lib/registry
http:
addr: :5000
headers:
X-Content-Type-Options: [nosniff]
health:
storagedriver:
enabled: true
interval: 10s
threshold: 3 - 创建/opt/data/registry目录用于挂载registry容器内部存储的镜像
- 运行registry
加入insecure-registries到/etc/docker/daemon.json中:运行registry,将registry容器命名为docker_registry1
2
3
4
5
6
7{
"graph": "/mnt/docker-data",
"storage-driver": "overlay",
"insecure-registries":[
"master1:5000"
]
}1
docker run -d -p 5000:5000 -v /opt/data/registry:/var/lib/registry -v /data/registry/config.yml:/etc/docker/registry/config.yml --name docker_registry registry
- 在master1节点上,执行
- Marathon运行Docker registry
- 在/root/zhongchuang-cluster/marathon/registry.json文件中加入以下内容:/etc/docker/registry/config.yml为上一步创建的config.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36{
"id": "registry",
"cpus": 0.3,
"mem": 100.0,
"disk": 20000,
"instances": 1,
"constraints": [
[
"hostname",
"CLUSTER",
"172.0.0.1"
]
],
"container": {
"type": "DOCKER",
"docker": {
"image": "registry",
"privileged": true
},
"volumes":[
{
"containerPath":"/var/lib/registry",
"hostPath":"/opt/data/registry",
"mode":"RW"
},
{
"containerPath":"/etc/docker/registry/config.yml",
"hostPath":"/data/registry/config.yml",
"mode":"RO"
}
],
"portMappings": [
{ "containerPort": 5000, "hostPort": 5000 }
]
}
} - 执行
curl -X POST http://master1:8080/v2/apps -d@registry.json -H "Conten-type:application/json"
即在marathon上启动docker registry。 - 测试registry是否创建成功
docker tag busybox master1:5000/busybox
docker push master1:5000/busybox
- 在agent1上执行:
docker pull master1:5000/busybox
成功。
- 在/root/zhongchuang-cluster/marathon/registry.json文件中加入以下内容:
- Marathon运行Marathon-lb:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51{
"id": "/system/marathon-lb",
"cmd": null,
"cpus": 1,
"mem": 128,
"disk": 0,
"instances": 1,
"constraints": [
[
"hostname",
"CLUSTER",
"172.0.0.1"
]
],
"acceptedResourceRoles": [
"*"
],
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "docker.io/mesosphere/marathon-lb",
"network": "HOST",
"privileged": true,
"parameters": [],
"forcePullImage": false
}
},
"portDefinitions": [
{
"port": 10000,
"protocol": "tcp",
"name": "default",
"labels": {}
}
],
"args": [
"sse",
"-m",
"http://master1:8080",
"-m",
"http://agent1:8080",
"-m",
"http://agent2:8080",
"-m",
"http://agent3:8080",
"--group",
"external"
],
"healthChecks": []
}- 遇到的问题:marathon前端一直显示waiting。解决方案是:
- mesos-slave上资源不够,一般是内存不够。可上mesos-master:5050上查看
- 宿主机上没有镜像,一直在拉或拉不到。上宿主机上查看: docker images | grep xxx,确保marathon上配置的镜像名和版本在宿主机上存在
- marathon上属性Constraints中包含的主机名和/etc/mesos-slave/hostname的内容一样。一开始配置constraints的时候将172.0.0.1写成了master1,而/etc/mesos-slave/hostname写的是172.0.0.1
- mesos-master数量有问题,正常情况应该是3个,至少需要是奇数。如果有多余启动的mesos-master,关掉服务并禁止启动
- 遇到的问题:marathon前端一直显示waiting。解决方案是:
安装DC
- 遇到的问题:阿里云上haproxy/nginx反代到自己机子上的一个端口时,如果upstream用的公网ip,会被莫名的拦截,即使开了安全组。
- 使用内网IP做反向代理的upstream就没问题。