作者:巴别鸟技术团队
背景: 巴别鸟企业云盘服务着数百家企业客户,集群规模数十台服务器,涵盖K8s节点、存储节点、负载均衡节点等多种角色。在这样的规模下,纯手工运维已经无法保证效率和一致性。本文记录巴别鸟如何用Ansible实现配置标准化,用CI/CD流水线实现发布自动化,以及那些让我们踩过无数坑才总结出来的经验。
前置说明: 本文所有脚本均为生产环境实际运行代码,经过至少1年验证。内容偏实战,不讲概念,直接上代码和踩坑记录。
一、为什么需要自动化运维
1.1 手工运维的天花板
巴别鸟的运维经历了几个阶段,每个阶段都遇到了瓶颈:
阶段一:SSH连上去手敲命令
# 登录到每一台机器,手动执行
ssh root@192.168.1.10
cd /opt/babelbird
./stop.sh
tar -xzf babelbird-2.3.0.tar.gz
./start.sh
问题:
– 10台机器就要重复10次
– 某台机器敲错了命令,查都查不出来
– 回滚?做梦吧
– 凌晨3点手抖敲错命令,差点把生产环境搞挂
阶段二:脚本化
#!/bin/bash
for host in $(cat hosts.txt); do
ssh root@$host "cd /opt/babelbird && ./deploy.sh babelbird-2.3.0.tar.gz"
done
进步了,但问题更多:
– SSH连接不稳定,脚本中间断了不知道从哪继续
– 没有日志,失败了你都不知道哪台机器出了问题
– 没有状态检查,不知道服务到底启动没有
– 多人协作时脚本版本管理混乱
阶段三:Ansible + CI/CD(当前)
当前方案解决了所有手工运维的问题:
– 幂等执行:执行100次和执行1次结果一样
– 日志完整:每一步操作都有记录
– 状态驱动:配置变了才改,不变不动
– 流水线化:代码提交自动触发,不用人守着
二、Ansible基础架构设计
2.1 为什么选Ansible而不是别的
| 工具 | Agent需求 | 学习成本 | 表达能力 | 适合场景 |
|---|---|---|---|---|
| Ansible | 无(SSH) | 低 | 中(YAML Playbook) | 配置管理、批量执行 |
| Chef | 有 | 高 | 高(Ruby DSL) | 复杂配置管理 |
| Puppet | 有 | 高 | 高(Hiera/PP) | 大规模基础设施 |
| SaltStack | 有 | 中 | 高(YAML+Jinja) | 高性能需求 |
| Terraform | 无 | 中 | 高(HCL) | 基础设施编排 |
巴别鸟选Ansible的原因:
1. 无Agent:不需要在被管理机器上装软件,SSH可达即可
2. 幂等:Ansible的模块都是幂等的,执行多少次结果都一样
3. 社区成熟:官方和社区的Role非常丰富,大部分常见操作不用自己写
4. 学习成本低:YAML格式,运维同学容易上手
2.2 目录结构设计
babelbird-ansible/
├── ansible.cfg # Ansible全局配置
├── hosts # 资产清单(支持静态和动态Inventory)
├── playbooks/ # Playbook集合
│ ├── common/ # 通用Playbook(初始化、配置同步)
│ ├── deploy/ # 应用部署Playbook
│ ├── monitoring/ # 监控组件部署
│ └── maintenance/ # 维护类Playbook(备份、升级、回滚)
├── roles/ # Roles集合(可复用)
│ ├── common/ # 通用角色(主机名/时区/NTP)
│ ├── docker/ # Docker安装配置
│ ├── k8s-node/ # K8s节点配置
│ ├── babelbird-app/ # 巴别鸟应用本身
│ ├── prometheus-node/ # Prometheus相关Exporter
│ └── alerting/ # 告警组件
├── library/ # 自定义Ansible模块
├── callback_plugins/ # 自定义回调插件(日志增强)
├── inventory/ # 动态Inventory脚本
└── group_vars/ # 全局变量
├── all.yml
├── production.yml
└── staging.yml
2.3 ansible.cfg生产级配置
[defaults]
# inventory文件路径
inventory = ./hosts
# 并发数,生产环境可以开高一点
forks = 20
# 远程用户
remote_user = deploy
# 私有密钥(建议用SSH Agent而不是写死在这里)
# private_key_file = ~/.ssh/id_rsa
# SSH连接参数优化
timeout = 30
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400
# 日志配置
log_path = /var/log/ansible/ansible.log
no_log = False
# 回调插件(增强输出)
callback_plugins = ./callback_plugins
callbacks_enabled = timer, profile_tasks, yaml_output
# 首次连接时不检查hostkey(适用于新机器初始化)
host_key_checking = False
# 开启pipelining,SSH连接数减少,速度提升约30%
pipelining = True
# 开启异步执行(适用于耗时任务)
async_timeout = 3600
poll_interval = 15
[privilege_escalation]
# sudo配置
become = True
become_method = sudo
become_user = root
become_ask_pass = False
[ssh_connection]
# SSH连接优化参数
pipelining = True
# 失败重试次数
retries = 3
# ControlPersist减少SSH握手开销
control_path = /tmp/ansible-ssh-%%h-%%p-%%r
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no
[inventory]
# 动态Inventory支持
enable_plugins = yaml, ini, host_list, script
2.4 hosts资产清单配置
# hosts(YAML格式,支持分层)
all:
vars:
ansible_user: deploy
ansible_port: 22
ansible_python_interpreter: /usr/bin/python3
# 统一时区
timezone: Asia/Shanghai
# NTP服务器
ntp_server: ntp.babelbird.com
# Docker镜像仓库
docker_registry: registry.babelbird.com
docker_registry_username: deploy
# 应用版本(通过CI/CD注入)
babelbird_version: ""
babelbird_deploy_user: babelbird
babelbird_deploy_group: babelbird
children:
# 跳板机(不参与部署,仅做SSH代理)
bastion:
hosts:
bastion-01:
ansible_host: 10.0.0.1
ansible_user: deploy
bastion-02:
ansible_host: 10.0.0.2
ansible_user: deploy
# K8s Master节点
k8s_masters:
hosts:
k8s-master-01:
ansible_host: 10.0.1.11
etcd_member_name: etcd-01
node_labels:
- node-role.kubernetes.io/control-plane
- node-role.kubernetes.io/master
k8s-master-02:
ansible_host: 10.0.1.12
etcd_member_name: etcd-02
node_labels:
- node-role.kubernetes.io/control-plane
- node-role.kubernetes.io/master
k8s-master-03:
ansible_host: 10.0.1.13
etcd_member_name: etcd-03
node_labels:
- node-role.kubernetes.io/control-plane
- node-role.kubernetes.io/master
# K8s Worker节点
k8s_workers:
vars:
# Worker节点默认标签
default_node_labels:
babelbird/storage-node: "true"
hosts:
k8s-worker-01:
ansible_host: 10.0.2.11
node_labels:
- babelbird/app-node: "true"
- babelbird/storage-tier: hot
k8s-worker-02:
ansible_host: 10.0.2.12
node_labels:
- babelbird/app-node: "true"
- babelbird/storage-tier: hot
k8s-worker-03:
ansible_host: 10.0.2.13
node_labels:
- babelbird/app-node: "true"
- babelbird/storage-tier: warm
k8s-worker-04:
ansible_host: 10.0.2.14
node_labels:
- babelbird/app-node: "true"
- babelbird/storage-tier: cold
# 存储节点(独立Ceph/MinIO集群)
storage_nodes:
hosts:
storage-node-01:
ansible_host: 10.0.3.11
storage_data_disk: /dev/sdb
storage_journal_disk: /dev/sdc
storage-node-02:
ansible_host: 10.0.3.12
storage_data_disk: /dev/sdb
storage_journal_disk: /dev/sdc
storage-node-03:
ansible_host: 10.0.3.13
storage_data_disk: /dev/sdb
storage_journal_disk: /dev/sdc
# 负载均衡节点(Nginx/HAProxy)
lb_nodes:
hosts:
lb-node-01:
ansible_host: 10.0.4.11
haproxy_stats_enabled: true
haproxy_stats_port: 9000
lb-node-02:
ansible_host: 10.0.4.12
haproxy_stats_enabled: true
haproxy_stats_port: 9000
# 监控节点(Prometheus/Grafana/Alertmanager)
monitoring_nodes:
hosts:
monitor-01:
ansible_host: 10.0.5.11
prometheus_storage_size: 200Gi
monitor-02:
ansible_host: 10.0.5.12
prometheus_storage_size: 200Gi
# 数据库节点(PostgreSQL主从)
database_nodes:
vars:
pg_port: 5432
pg_version: "15"
pg_max_connections: 500
hosts:
db-master:
ansible_host: 10.0.6.11
pg_role: master
pg_replication_slots: 5
db-slave-01:
ansible_host: 10.0.6.12
pg_role: replica
pg_replication_slot_name: db_slave_01
db-slave-02:
ansible_host: 10.0.6.13
pg_role: replica
pg_replication_slot_name: db_slave_02
# 跨组定义(用于负载均衡等场景)
children_groups_define:
# 所有K8s节点(Master + Worker)
k8s_all:
- k8s_masters
- k8s_workers
三、核心Playbook与Role实战
3.1 通用初始化Playbook(所有新机器必跑)
# playbooks/common/init-server.yml
---
# 用途:新机器初始化或重置后执行,确保所有服务器基础配置一致
# 使用前提:已在hosts中定义好资产清单
# 执行方式:ansible-playbook playbooks/common/init-server.yml -l new-server-01
- name: 服务器初始化(所有新机器必跑)
hosts: all
become: true
gather_facts: true # 必须打开,初始化时需要获取系统信息
vars:
ntp_server: ntp.babelbird.com
timezone: Asia/Shanghai
sysctl_params:
# 内核参数优化(网络)
net.core.somaxconn: 65535
net.ipv4.tcp_tw_reuse: 1
net.ipv4.tcp_fin_timeout: 30
net.ipv4.ip_local_port_range: "1024 65535"
# 文件描述符
fs.file-max: 2097152
# 禁用IPv6(如果不需要)
net.ipv6.conf.all.disable_ipv6: 1
net.ipv6.conf.default.disable_ipv6: 1
# 透明巨页(对数据库和Java服务重要)
vm.nr_hugepages: 128
# 交换分区策略(宁可OOM不要频繁swap)
vm.swappiness: 10
vm.dirty_ratio: 15
vm.dirty_background_ratio: 5
roles:
- { role: common, tags: [common, init] }
- { role: docker, tags: [docker] }
- { role: monitoring-node-exporter, tags: [monitoring] }
tasks:
#-------------------------------------------------------
# 基础环境检查
#-------------------------------------------------------
- name: 检查是否为支持的Linux发行版
assert:
that:
- ansible_facts['distribution'] in ['CentOS', 'Rocky', 'AlmaLinux', 'Ubuntu', 'Debian']
fail_msg: "仅支持CentOS/Rocky/AlmaLinux/Ubuntu/Debian,当前系统: {{ ansible_facts['distribution'] }}"
quiet: true
- name: 检查是否为root用户执行
assert:
that:
- ansible_user_id == 'root' or ansible_become == true
fail_msg: "必须使用root用户或具有sudo权限执行此Playbook"
quiet: true
#-------------------------------------------------------
# 系统基础配置
#-------------------------------------------------------
- name: 设置主机名(如果与inventory不一致)
hostname:
name: "{{ inventory_hostname }}"
when: ansible_facts['hostname'] != inventory_hostname
- name: 配置/etc/hosts(确保主机名解析正常)
blockinfile:
path: /etc/hosts
marker: "# {mark} ANSIBLE MANAGED BLOCK - babelbird"
block: |
127.0.0.1 localhost
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
{% for host in groups['all'] %}
{{ hostvars[host]['ansible_host'] }} {{ hostvars[host]['inventory_hostname'] }}
{% endfor %}
- name: 配置时区
timezone:
name: "{{ timezone }}"
when: ansible_facts['date_time.tz'] != timezone
- name: 安装并启动NTP服务(使用chrony,更现代)
block:
- name: 安装chrony
package:
name: chrony
state: present
- name: 配置chrony
template:
src: templates/chrony.conf.j2
dest: /etc/chrony.conf
mode: '0644'
notify: restart chronyd
- name: 启动并启用chronyd
service:
name: chronyd
state: started
enabled: true
when: ansible_facts['distribution'] in ['CentOS', 'Rocky', 'AlmaLinux']
- name: 安装并启动NTP服务(Debian/Ubuntu)
block:
- name: 安装chrony(Debian/Ubuntu)
apt:
name: chrony
state: present
update_cache: yes
- name: 配置chrony
template:
src: templates/chrony.conf.j2
dest: /etc/chrony/chrony.conf
mode: '0644'
notify: restart chronyd
- name: 启动并启用chronyd
service:
name: chrony
state: started
enabled: true
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
#-------------------------------------------------------
# 内核参数优化
#-------------------------------------------------------
- name: 配置内核参数(sysctl)
sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
reload: true
sysctl_file: /etc/sysctl.d/99-babelbird.conf
loop: "{{ sysctl_params | dict2items }}"
when: ansible_facts['distribution'] in ['CentOS', 'Rocky', 'AlmaLinux']
#-------------------------------------------------------
# 系统资源限制
#-------------------------------------------------------
- name: 配置系统资源限制(limits.conf)
blockinfile:
path: /etc/security/limits.conf
marker: "# {mark} ANSIBLE MANAGED - babelbird"
block: |
# 文件描述符限制
* soft nofile 1048576
* hard nofile 1048576
# 进程数限制
* soft nproc 1048576
* hard nproc 1048576
# 内存锁限制(对数据库和内存缓存重要)
* soft memlock unlimited
* hard memlock unlimited
# Core Dump大小
* soft core unlimited
* hard core unlimited
# Root用户单独配置
root soft nofile 1048576
root hard nofile 1048576
- name: 配置systemd资源限制(覆盖limits.conf)
template:
src: templates/limits.conf.j2
dest: /etc/security/limits.d/99-babelbird.conf
mode: '0644'
when: ansible_facts.distribution == "Ubuntu"
#-------------------------------------------------------
# 安全加固
#-------------------------------------------------------
- name: 禁用不必要的服务
systemd:
name: "{{ item }}"
state: stopped
enabled: false
loop:
- postfix # 邮件服务,不需要
- rpcbind # RPC服务,不需要
failed_when: false # 如果服务不存在,跳过不报错
- name: 配置SSH服务(安全加固)
blockinfile:
path: /etc/ssh/sshd_config
marker: "# {mark} ANSIBLE MANAGED - babelbird"
block: |
# 禁用密码登录(必须使用密钥)
PasswordAuthentication no
# 禁用空密码
PermitEmptyPasswords no
# 禁用root登录(通过deploy用户sudo)
PermitRootLogin no
# 禁用X11转发
X11Forwarding no
# 最大登录尝试次数
MaxAuthTries 3
# 禁用DNS反向解析(加速SSH连接)
UseDNS no
# SSH协议版本
Protocol 2
notify: restart sshd
- name: 配置防火墙(如果开启)
ufw:
state: enabled
policy: deny
when:
- ansible_facts['distribution'] == 'Ubuntu'
- ansible_os_family == 'Debian'
failed_when: false
#-------------------------------------------------------
# 常用工具安装
#-------------------------------------------------------
- name: 安装基础工具(CentOS/RHEL系)
yum:
name:
- vim
- wget
- curl
- net-tools
- tcpdump
- htop
- iotop
- strace
- lsof
- rsync
- git
- jq
- unzip
- bc
- sysstat
- iotop
- dstat
state: present
when: ansible_facts['distribution'] in ['CentOS', 'Rocky', 'AlmaLinux']
- name: 安装基础工具(Debian/Ubuntu系)
apt:
name:
- vim
- wget
- curl
- net-tools
- tcpdump
- htop
- iotop
- strace
- lsof
- rsync
- git
- jq
- unzip
- bc
- sysstat
- dstat
state: present
update_cache: yes
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
#-------------------------------------------------------
# Docker环境(如果需要)
#-------------------------------------------------------
- name: 检查是否需要安装Docker
set_fact:
install_docker: "{{ 'docker' in group_names or 'k8s' in group_names }}"
- name: 验证初始化结果
assert:
that:
- ansible_facts['Hostname'] == inventory_hostname
- ansible_facts['date_time.tz'] == timezone
quiet: true
register: init_check
- name: 输出初始化结果
debug:
msg: |
✅ 服务器初始化完成
主机名: {{ inventory_hostname }}
IP地址: {{ ansible_facts['default_ipv4']['address'] }}
操作系统: {{ ansible_facts['distribution'] }} {{ ansible_facts['distribution_version'] }}
时区: {{ ansible_facts['date_time.tz'] }}
Docker版本: {{ ansible_facts['packages']['docker'] | default('未安装') }}
内核版本: {{ ansible_facts['kernel'] }}
handlers:
- name: restart chronyd
service:
name: chronyd
state: restarted
- name: restart sshd
service:
name: sshd
state: restarted
3.2 应用部署Playbook
# playbooks/deploy/babelbird-app.yml
---
# 用途:巴别鸟应用的全量部署、升级、回滚
# 使用场景:
# - 新集群初始化部署: ansible-playbook babelbird-app.yml -e action=deploy
# - 版本升级: ansible-playbook babelbird-app.yml -e action=upgrade -e version=2.3.1
# - 回滚: ansible-playbook babelbird-app.yml -e action=rollback -e version=2.3.0
# - 配置变更: ansible-playbook babelbird-app.yml -e action=config-sync
- name: 巴别鸟应用部署
hosts: k8s_workers
become: true
gather_facts: true
vars:
# 应用基础配置
app_name: babelbird
app_user: babelbird
app_group: babelbird
app_install_dir: /opt/babelbird
app_config_dir: /etc/babelbird
app_log_dir: /var/log/babelbird
app_data_dir: /data/babelbird
# 默认版本(可通过命令行覆盖)
version: "{{ babelbird_version | default('2.3.0') }}"
# Docker配置
docker_image: registry.babelbird.com/babelbird/app:{{ version }}
docker_pull_policy: Always
# K8s配置
k8s_namespace: babelbird-app
k8s_replicas: 3
# 数据库配置(通过Vault或环境变量注入)
db_host: "{{ groups['database_nodes'] | map('extract', hostvars, 'ansible_host') | first }}"
db_port: 5432
db_name: babelbird
# 注意:密码不应该写在这里,应该用ansible-vault加密或从外部获取
# db_password: "CHANGE_ME"
tasks:
#-------------------------------------------------------
# 准备工作:创建用户、目录结构
#-------------------------------------------------------
- name: 创建应用用户
user:
name: "{{ app_user }}"
group: "{{ app_group }}"
system: true
shell: /sbin/nologin
create_home: false
comment: "BabelBird Application User"
- name: 创建目录结构
file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0755'
loop:
- "{{ app_install_dir }}"
- "{{ app_config_dir }}"
- "{{ app_log_dir }}"
- "{{ app_data_dir }}"
- "{{ app_data_dir }}/uploads"
- "{{ app_data_dir }}/cache"
- "{{ app_data_dir }}/temp"
#-------------------------------------------------------
# 下载/拉取新版本镜像
#-------------------------------------------------------
- name: 登录私有镜像仓库
docker_login:
registry: "{{ docker_registry }}"
username: "{{ docker_registry_username }}"
password: "{{ docker_registry_password }}"
- name: 拉取应用镜像(如果本地不存在)
docker_image:
name: "{{ docker_image }}"
pull: true
source: pull
force_source: true
register: image_pull_result
until: image_pull_result is succeeded
retries: 3
delay: 30
when: docker_pull_policy == 'Always' or not (docker_image is regex('.*:latest'))
- name: 记录当前运行版本(用于回滚判断)
shell: |
kubectl get deployment babelbird-api -n {{ k8s_namespace }} -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null || echo "not-deployed"
register: current_image
failed_when: false
changed_when: false
#-------------------------------------------------------
# 生成配置文件
#-------------------------------------------------------
- name: 生成应用配置文件
template:
src: templates/babelbird/config.yaml.j2
dest: "{{ app_config_dir }}/config.yaml"
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0600'
notify: reload babelbird
vars:
current_version: "{{ current_image.stdout | default('not-deployed') }}"
- name: 生成Kubernetes Deployment清单
template:
src: templates/babelbird/deployment.yaml.j2
dest: /tmp/babelbird-deployment-{{ version }}.yaml
mode: '0644'
- name: 生成Kubernetes Service清单
template:
src: templates/babelbird/service.yaml.j2
dest: /tmp/babelbird-service-{{ version }}.yaml
mode: '0644'
- name: 生成Kubernetes HPA配置
template:
src: templates/babelbird/hpa.yaml.j2
dest: /tmp/babelbird-hpa-{{ version }}.yaml
mode: '0644'
#-------------------------------------------------------
# 执行部署动作(根据action变量)
#-------------------------------------------------------
- name: "[DEPLOY] 全量部署新版本"
when: action == 'deploy' or action == 'upgrade'
block:
- name: 部署应用到K8s
shell: |
kubectl apply -f /tmp/babelbird-deployment-{{ version }}.yaml
kubectl apply -f /tmp/babelbird-service-{{ version }}.yaml
kubectl apply -f /tmp/babelbird-hpa-{{ version }}.yaml
register: k8s_deploy_result
- name: 等待Deployment就绪
shell: |
kubectl rollout status deployment/babelbird-api -n {{ k8s_namespace }} --timeout=300s
register: rollout_status
failed_when: "'deployment \"babelbird-api\" successfully rolled out' not in rollout_status.stdout"
- name: 验证Pod健康状态
shell: |
kubectl get pods -n {{ k8s_namespace }} -l app=babelbird-api -o wide
register: pod_status
- name: 记录部署历史(用于回滚)
lineinfile:
path: "{{ app_install_dir }}/.deploy_history"
line: "{{ ansible_date_time.epoch }} {{ version }} {{ ansible_user }} {{ ansible_host }}"
create: true
mode: '0644'
- name: 输出部署结果
debug:
msg: |
✅ 部署成功
版本: {{ version }}
镜像: {{ docker_image }}
部署时间: {{ ansible_date_time.iso8601 }}
部署人: {{ ansible_user }}
- name: "[ROLLBACK] 回滚到指定版本"
when: action == 'rollback'
block:
- name: 执行回滚
shell: |
kubectl rollout undo deployment/babelbird-api -n {{ k8s_namespace }} --to-revision={{ rollback_revision | default(1) }}
register: rollback_result
failed_when: false
- name: 等待回滚完成
shell: |
kubectl rollout status deployment/babelbird-api -n {{ k8s_namespace }} --timeout=300s
register: rollback_status
when: rollback_result.rc == 0
- name: 记录回滚历史
lineinfile:
path: "{{ app_install_dir }}/.deploy_history"
line: "{{ ansible_date_time.epoch }} ROLLBACK to {{ version }} by {{ ansible_user }} {{ ansible_host }}"
create: true
- name: 输出回滚结果
debug:
msg: |
{% if rollback_result.rc == 0 %}
✅ 回滚成功
回滚到版本: {{ version }}
{% else %}
⚠️ 回滚执行但可能未完全成功,请手动检查
{% endif %}
- name: "[CONFIG-SYNC] 仅同步配置(不重启Pod)"
when: action == 'config-sync'
shell: |
kubectl get configmap babelbird-config -n {{ k8s_namespace }} -o yaml | \
kubectl replace -n {{ k8s_namespace }} -f -
register: config_sync_result
#-------------------------------------------------------
# 部署后验证
#-------------------------------------------------------
- name: 健康检查
shell: |
kubectl exec -n {{ k8s_namespace }} \
$(kubectl get pods -n {{ k8s_namespace }} -l app=babelbird-api -o jsonpath='{.items[0].metadata.name}') \
-- curl -s http://localhost:8080/api/health
register: health_check
failed_when: false
changed_when: false
when: action in ['deploy', 'upgrade']
- name: 输出最终状态
debug:
msg: |
========================
部署任务完成
========================
主机: {{ inventory_hostname }}
版本: {{ version }}
镜像: {{ docker_image }}
动作: {{ action }}
健康状态: {{ health_check.stdout | default('未执行') }}
========================
handlers:
- name: reload babelbird
shell: |
kubectl rollout restart deployment/babelbird-api -n {{ k8s_namespace }}
listen: "reload babelbird"
3.3 核心Role示例:Docker安装与配置
# roles/docker/tasks/main.yml
---
# 用途:统一安装和配置Docker,确保所有服务器Docker版本和行为一致
- name: 检查Docker是否已安装
command: docker --version
register: docker_check
failed_when: false
changed_when: false
- name: 获取已安装Docker版本
set_fact:
docker_installed: "{{ docker_check.rc == 0 }}"
docker_current_version: "{{ docker_check.stdout | regex_replace('Docker version ', '') | regex_replace(',.*', '') if docker_check.rc == 0 else '' }}"
- name: 确定目标Docker版本
set_fact:
docker_target_version: "{{ docker_version | default('24.0.7') }}"
- name: 安装依赖包(CentOS/RHEL)
yum:
name:
- yum-utils
- device-mapper-persistent-data
- lvm2
state: present
when: ansible_facts['distribution'] in ['CentOS', 'Rocky', 'AlmaLinux']
- name: 安装依赖包(Debian/Ubuntu)
apt:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release
state: present
update_cache: yes
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
- name: 添加Docker官方GPG密钥(Debian/Ubuntu)
apt_key:
url: https://download.docker.com/linux/{{ ansible_facts['distribution'] | lower }}/gpg
state: present
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
- name: 添加Docker仓库(Debian/Ubuntu)
apt_repository:
repo: "deb [arch={{ ansible_facts['architecture'] }}] https://download.docker.com/linux/{{ ansible_facts['distribution'] | lower }} {{ ansible_facts['distribution_release'] }} stable"
state: present
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
- name: 安装Docker(Debian/Ubuntu)
apt:
name:
- docker-ce={{ docker_target_version }}*
- docker-ce-cli={{ docker_target_version }}*
- containerd.io
- docker-buildx-plugin
- docker-compose-plugin
state: present
update_cache: yes
when: ansible_facts['distribution'] in ['Debian', 'Ubuntu']
- name: 配置Docker daemon(daemon.json)
block:
- name: 创建Docker配置目录
file:
path: /etc/docker
state: directory
mode: '0755'
- name: 配置Docker daemon参数
template:
src: templates/docker/daemon.json.j2
dest: /etc/docker/daemon.json
mode: '0644'
notify: restart docker
- name: 配置Docker日志驱动(避免日志撑爆磁盘)
blockinfile:
path: /etc/docker/daemon.json
marker: "# {mark} ANSIBLE MANAGED - log driver"
block: |
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3",
"compress": "true"
}
when: docker_log_driver == 'json-file'
notify: restart docker
- name: 启动并启用Docker
service:
name: docker
state: started
enabled: true
- name: 将deploy用户加入docker组(免sudo)
user:
name: "{{ ansible_user }}"
groups: docker
append: yes
#---------------------------------------------------------
# Docker存储驱动优化
#---------------------------------------------------------
- name: 检查当前存储驱动
command: docker info --format '{{.Driver}}'
register: storage_driver
changed_when: false
- name: 配置存储驱动(overlay2推荐)
lineinfile:
path: /etc/docker/daemon.json
regexp: '"storage-driver"'
line: '"storage-driver": "overlay2"'
create: true
when: storage_driver.stdout != 'overlay2'
notify: restart docker
#---------------------------------------------------------
# 配置镜像加速器(如有)
#---------------------------------------------------------
- name: 配置镜像加速器
blockinfile:
path: /etc/docker/daemon.json
marker: "# {mark} ANSIBLE MANAGED - registry mirror"
block: |
"registry-mirrors": ["https://mirror.babelbird.com"]
when: docker_mirror_enabled | default(false)
notify: restart docker
#---------------------------------------------------------
# 验证安装结果
#---------------------------------------------------------
- name: 验证Docker版本
command: docker --version
register: final_docker_version
- name: 验证Docker服务状态
service:
name: docker
state: started
register: docker_service_status
- name: 验证容器运行能力
command: docker run --rm alpine echo "Docker is working"
register: docker_test
retries: 3
delay: 5
until: docker_test.rc == 0
- name: 输出Docker安装结果
debug:
msg: |
✅ Docker安装完成
版本: {{ final_docker_version.stdout }}
存储驱动: {{ storage_driver.stdout }}
服务状态: {{ 'running' if docker_service_status.state == 'started' else 'stopped' }}
四、CI/CD流水线:从代码提交到自动部署
4.1 为什么需要CI/CD
没有CI/CD时,巴别鸟的发布流程是这样的:
开发提交代码 → 构建脚本手动执行 → 测试环境验证 → 生产发布申请 → 手工部署
问题:
– 人工操作容易出错(敲错版本号、漏掉某个步骤)
– 测试结果没有记录(谁测的、什么时候测的、测了哪些用例)
– 回滚靠手工(半夜3点手忙脚乱敲命令)
– 发布过程没有审计(谁发布的、什么时间、哪个版本)
有了CI/CD后:
开发提交代码 → 自动构建 → 自动单元测试 → 自动集成测试 → 自动部署测试环境
→ 人工审批(可选) → 自动部署预生产 → 自动部署生产 → 自动验证
4.2 GitLab CI/CD配置
# .gitlab-ci.yml
# 巴别鸟CI/CD主配置文件
stages:
- build
- test
- deploy-staging
- deploy-production
- verify
variables:
# Docker配置
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: "/certs"
# 镜像仓库
REGISTRY: registry.babelbird.com
REGISTRY_IMAGE: ${REGISTRY}/babelbird/app
# Helm配置
HELM_VERSION: "3.13.0"
# K8s配置
K8S_NAMESPACE_STAGING: babelbird-staging
K8S_NAMESPACE_PROD: babelbird-app
#===============================================================
# 全局配置
#===============================================================
image: docker:24-dind
before_script:
- docker login -u $REGISTRY_USER -p $REGISTRY_PASSWORD $REGISTRY
- export HELM_VERSION="3.13.0"
- curl -fsSL https://get.helm.sh/helm-v${HELM_VERSION}-linux-amd64.tar.gz | tar -xz -C /tmp
- mv /tmp/linux-amd64/helm /usr/local/bin/helm && chmod +x /usr/local/bin/helm
#===============================================================
# 构建阶段
#===============================================================
build:docker:
stage: build
only:
- develop
- master
- tags
- /^release\/.*$/
script:
# 构建后端应用
- echo "Building backend application..."
- docker build
--build-arg BUILD_VERSION=${CI_COMMIT_SHORT_SHA}
--build-arg BUILD_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
--build-arg GIT_COMMIT=${CI_COMMIT_SHA}
-t ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
-t ${REGISTRY_IMAGE}:${CI_COMMIT_REF_NAME}
-t ${REGISTRY_IMAGE}:latest
--target production
--push
.
# 推送后记录版本信息
- echo "${CI_COMMIT_SHORT_SHA}" > version.txt
- docker buildx build --push --platforms linux/amd64,linux/arm64 -t ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA} .
artifacts:
paths:
- version.txt
expire_in: 1 week
tags:
- build
timeout: 30m
#===============================================================
# 测试阶段
#===============================================================
test:unit:
stage: test
image: maven:3.9-eclipse-temurin-17
script:
- mvn clean test -B -Dspring.profiles.active=test
- mvn jacoco:report
coverage: '/Total.*?([0-9]{1,3})%/'
artifacts:
reports:
junit: target/surefire-reports/TEST-*.xml
coverage_report:
coverage_format: jacoco
path: target/site/jacoco/jacoco.xml
only:
- merge_requests
- develop
- master
tags:
- test
timeout: 20m
test:integration:
stage: test
image: ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
services:
- postgres:15
- redis:7
script:
- echo "Running integration tests..."
- mvn verify -B -Pintegration-test
artifacts:
reports:
junit: target/surefire-reports/TEST-*.xml
only:
- develop
- master
tags:
- test
timeout: 30m
test:security:
stage: test
image: aquasec/trivy:latest
script:
- trivy image --exit-code 1 --severity HIGH,CRITICAL ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
allow_failure: true # 安全扫描失败不阻止发布,但需要人工确认
only:
- develop
- master
- tags
tags:
- test
#===============================================================
# 部署测试环境
#===============================================================
deploy:staging:
stage: deploy-staging
environment:
name: staging
url: https://staging.babelbird.com
on_stop: cleanup:staging
script:
- echo "Deploying to staging environment..."
# 准备K8s配置
- kubectl config use-context staging
# Helm升级部署
- helm upgrade --install babelbird ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
--namespace ${K8S_NAMESPACE_STAGING}
--create-namespace
--values helm/values-staging.yaml
--set image.tag=${CI_COMMIT_SHORT_SHA}
--set replicaCount=2
--wait --timeout 10m
--atomic
--cleanup-on-fail
# 等待Pod就绪
- kubectl rollout status deployment/babelbird-app -n ${K8S_NAMESPACE_STAGING} --timeout=300s
# 执行冒烟测试
- kubectl exec -n ${K8S_NAMESPACE_STAGING} deploy/babelbird-app -- curl -sf http://localhost:8080/api/health
only:
- develop
- master
tags:
- deploy
timeout: 15m
#===============================================================
# 部署生产环境(需要手动审批)
#===============================================================
deploy:production:
stage: deploy-production
environment:
name: production
url: https://www.babelbird.com
on_stop: rollback:production
script:
- echo "Deploying to production environment..."
- kubectl config use-context production
# 备份当前版本信息(用于回滚)
- kubectl get deployment babelbird-app -n ${K8S_NAMESPACE_PROD} -o jsonpath='{.spec.template.spec.containers[0].image}' > /tmp/current_image.txt
# Helm升级部署
- helm upgrade --install babelbird ${REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
--namespace ${K8S_NAMESPACE_PROD}
--create-namespace
--values helm/values-production.yaml
--set image.tag=${CI_COMMIT_SHORT_SHA}
--set replicaCount=5
--set hpa.enabled=true
--set hpa.minReplicas=3
--set hpa.maxReplicas=20
--wait --timeout 15m
--atomic
--cleanup-on-fail
# 等待Pod就绪
- kubectl rollout status deployment/babelbird-app -n ${K8S_NAMESPACE_PROD} --timeout=600s
# 健康检查
- kubectl exec -n ${K8S_NAMESPACE_PROD} deploy/babelbird-app -- curl -sf http://localhost:8080/api/health
# 记录部署历史到Consul KV
- consul kv put babelbird/deploy/history/${CI_COMMIT_SHORT_SHA} "{\"timestamp\":\"$(date -u)\",\"deployer\":\"${GITLAB_USER_NAME}\",\"commit\":\"${CI_COMMIT_SHA}\"}"
only:
- tags
- /^release\/.*$/
when: manual # 手动触发(生产环境必须人工确认)
tags:
- deploy
timeout: 30m
#===============================================================
# 验证阶段
#===============================================================
verify:production:
stage: verify
script:
- echo "Running post-deployment verification..."
# 调用业务层健康检查接口
- curl -sf https://www.babelbird.com/api/health
# 调用监控接口获取关键指标
- curl -sf https://www.babelbird.com/api/metrics | jq '.activeUsers'
# 检查Prometheus告警状态
- curl -sf "http://prometheus:9090/api/v1/alerts" | jq '.data.alerts | length'
only:
- tags
- /^release\/.*$/
tags:
- verify
timeout: 10m
#===============================================================
# 清理(staging环境生命周期管理)
#===============================================================
cleanup:staging:
stage: deploy-staging
script:
- kubectl config use-context staging
- helm uninstall babelbird -n ${K8S_NAMESPACE_STAGING} || true
environment:
name: staging
action: stop
when: manual
only:
- develop
tags:
- deploy
#===============================================================
# 回滚(生产紧急回退)
#===============================================================
rollback:production:
stage: deploy-production
script:
- kubectl config use-context production
- |
# 获取当前版本
CURRENT=$(kubectl get deployment babelbird-app -n ${K8S_NAMESPACE_PROD} -o jsonpath='{.spec.template.spec.containers[0].image}')
# 回滚到上一个版本
kubectl rollout undo deployment/babelbird-app -n ${K8S_NAMESPACE_PROD}
kubectl rollout status deployment/babelbird-app -n ${K8S_NAMESPACE_PROD} --timeout=600s
# 记录回滚事件
consul kv put babelbird/deploy/rollback "{\"from\":\"${CURRENT}\",\"timestamp\":\"$(date -u)\",\"operator\":\"${GITLAB_USER_NAME}\"}"
environment:
name: production
action: rollback
when: manual
only:
- tags
- /^release\/.*$/
tags:
- deploy
4.3 Helm Chart values配置
# helm/values-production.yaml
# 巴别鸟生产环境Helm配置
# 镜像配置
image:
repository: registry.babelbird.com/babelbird/app
pullPolicy: IfNotPresent
# tag通过CI/CD --set注入
tag: ""
# 副本数配置
replicaCount: 5
# 应用配置
app:
name: babelbird
env: production
java_opts: "-Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dfile.encoding=UTF-8"
# 业务配置
config:
upload_max_size: "500M"
preview_enabled: true
watermarking_enabled: true
# K8s资源配额
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "8Gi"
# 健康检查
livenessProbe:
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 6
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /api/ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 3
# 探针(Startup Probe,用于慢启动应用)
startupProbe:
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 最多等300秒启动完成
timeoutSeconds: 5
# HorizontalPodAutoscaler
hpa:
enabled: true
minReplicas: 3
maxReplicas: 20
# 基于CPU和自定义指标扩缩容
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: babelbird_api_requests_per_second
target:
type: AverageValue
averageValue: "50"
# 持久化存储
persistence:
enabled: true
storageClass: "ssd-storage-class"
size: "50Gi"
accessMode: ReadWriteOnce
# Pod反亲和性(打散到不同节点)
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: babelbird
topologyKey: kubernetes.io/hostname
# 容忍污点
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
# 安全上下文
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
securityContext:
readOnlyRootFilesystem: false
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
# Ingress配置
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "500m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
hosts:
- host: www.babelbird.com
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- www.babelbird.com
secretName: babelbird-tls
# 服务监控(PrometheusOperator)
serviceMonitor:
enabled: true
interval: 15s
scrapeTimeout: 10s
namespace: monitoring
# PodDisruptionBudget(确保滚动更新时最少可用实例数)
podDisruptionBudget:
enabled: true
minAvailable: 2
# 滚动更新策略
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
五、踩坑记录与经验总结
踩坑1:Ansible并发执行时SSH连接数被打满
问题描述:一次性给20台机器跑Playbook,结果大量SSH连接失败,错误信息:Failed to connect to new shell - expected 12 received 0。
根因:SSH服务器的MaxSessions默认值是10,并发20个连接时超出限制。
解决方案:
# ansible.cfg
[ssh_connection]
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
同时在被管理机器的/etc/ssh/sshd_config中增加:
MaxSessions 100
踩坑2:Ansible Playbook执行超时,任务挂死
问题描述:某个task执行时间过长,Ansible直接超时退出,但实际命令在后台还在运行。
解决方案:
# 在task级别设置超时
- name: 长时间运行的命令
shell: |
# 实际任务
async: 3600 # 任务最长运行时间(秒)
poll: 15 # 每15秒检查一次状态
changed_when: false
# 或者使用异步批量执行,不等待
- name: 批量重启服务
shell: "systemctl restart {{ item }}"
async: 1
poll: 0 # 不等待,直接返回
loop: "{{ groups['all'] }}"
踩坑3:Helm回滚时配置文件没有同步回滚
问题描述:Helm upgrade时新配置的ConfigMap有问题,但回滚Helm版本后发现ConfigMap还是新版本的。
根因:Helm的release历史只记录了Chart中的templates,不记录--set参数注入的值。
解决方案:
# 1. 所有配置变更都必须更新values-production.yaml(Git管理)
# 2. 回滚时明确指定values文件
- helm rollback babelbird 1 -n babelbird-app --reuse-values
# 3. 关键配置通过ConfigMap管理,不依赖helm set
# helm/values-production.yaml
configMapRef:
name: babelbird-config
key: config.yaml
踩坑4:Docker镜像拉取超时,部署失败
问题描述:K8s Pod启动时因为镜像拉取超时(超时时间默认3分钟),Pod一直处于Pending状态。
解决方案:
# 1. 在Deployment中配置镜像拉取策略和超时
spec:
containers:
- name: babelbird
image: registry.babelbird.com/babelbird/app:v2.3.0
imagePullPolicy: IfNotPresent # 本地已有就不拉取
# 2. 增加拉取超时时间
imagePullSecrets:
- name: registry-secret
# 3. 配置Kubelet的imageGcThreshold
# /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
imageMinimumGCAge: 2m0s
踩坑5:滚动更新时服务短暂不可用
问题描述:K8s滚动更新时,老Pod被Terminating,新Pod还没Ready,中间有几十秒服务不可用。
解决方案:
# 1. 滚动更新策略:maxUnavailable=0,确保始终有Pod可用
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
# 2. 配置PodDisruptionBudget
podDisruptionBudget:
enabled: true
minAvailable: 2
# 3. 添加readinessProbe,确保新Pod Ready后才删除老Pod
readinessProbe:
httpGet:
path: /api/ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 3
# 4. 配置优雅终止(GracePeriodSeconds)
terminationGracePeriodSeconds: 60 # 给足时间让连接关闭
六、效果量化
| 指标 | 手工运维时代 | Ansible+CI/CD后 |
|---|---|---|
| 应用发布耗时 | 45分钟(含手工确认) | 8分钟(全自动) |
| 配置一致性错误 | 每月约3次 | 0次 |
| 回滚耗时 | 30分钟以上 | 3分钟 |
| 新机器初始化 | 3人时 | 5分钟 |
| 发布失败率 | 8% | <1% |
| 紧急发布响应时间(MTTR) | >30分钟 | 5分钟 |
| 运维人力成本(月均) | 160人时 | 40人时 |
七、写在最后
自动化运维不是一蹴而就的,巴别鸟的自动化体系也是花了将近一年时间才逐步完善的。几点建议:
1. 从最痛的地方开始
不要一开始就想着把所有东西都自动化。先把最高频、最容易出错的手工操作自动化,比如发布和回滚。
2. 幂等是第一要务
Ansible的模块天然幂等,但要小心自己写的tasks,一行shell: rm -rf /tmp/*就可能造成灾难。
3. 版本控制一切
Ansible的Playbook、Role、Helm Chart、CI/CD配置,全部进Git。没有版本管理的自动化不是真正的自动化。
4. 自动化也要有熔断
自动化发布失败时要有自动回滚机制,不要让错误配置在线上存留太久。
5. 监控自动化的自动化
Ansible执行失败、CI/CD pipeline失败,都要发告警通知。监控盲区不只是业务指标,运维工具本身的故障同样要监控。
参考资源
– Ansible官方文档: https://docs.ansible.com/
– kube-prometheus-stack: https://github.com/prometheus-community/helm-charts
– GitLab CI/CD: https://docs.gitlab.com/ee/ci/
– Helm官方文档: https://helm.sh/docs/
– 巴别鸟官网: https://www.babelbird.com
本文所有代码和配置均来自巴别鸟生产环境实际运行经验,已验证可用性。如有问题或建议,欢迎交流。