一、Nagios技术选型与部署准备

1.1 监控系统选型依据

在分布式架构日益复杂的今天，企业需要一套能够统一监控服务器、网络设备、应用服务的解决方案。Nagios作为开源领域的标杆产品，具备三大核心优势：

跨平台支持：可运行在主流Linux发行版（CentOS/Ubuntu/Debian）
插件生态：通过NRPE、NSClient++等扩展实现多样化监控
告警策略：支持阈值触发、依赖关系、告警升级等高级规则

1.2 环境准备清单

部署前需完成以下基础配置：

# 系统要求检查示例
cat /etc/redhat-release  # 确认CentOS 7.x/8.x
uname -m                # 验证x86_64架构
free -h                 # 至少2GB可用内存
df -h /opt              # 确保/opt分区有5GB空间

建议关闭SELinux并配置防火墙规则：

setenforce 0
systemctl stop firewalld
# 或开放特定端口（根据实际配置调整）
firewall-cmd --permanent --add-port={80/tcp,5666/tcp}

二、核心组件安装流程

2.1 依赖环境配置

安装编译工具链和开发库：

yum install -y gcc glibc glibc-common make httpd php \
gd gd-devel perl postfix

创建专用运行用户：

useradd -m nagios
groupadd nagcmd
usermod -a -G nagcmd nagios
usermod -a -G nagcmd apache

2.2 Nagios核心安装

从官方托管仓库获取最新稳定版：

wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.4.6.tar.gz
tar zxvf nagios-*.tar.gz
cd nagios-*

执行编译安装三步曲：

./configure --with-nagios-user=nagios \
            --with-nagios-group=nagcmd \
            --with-command-group=nagcmd \
            --prefix=/usr/local/nagios
make all
make install
make install-init
make install-config
make install-commandmode

2.3 Web界面配置

安装Nagios插件集：

wget https://nagios-plugins.org/download/nagios-plugins-2.3.3.tar.gz
tar zxvf nagios-plugins-*.tar.gz
cd nagios-plugins-*
./configure --with-nagios-user=nagios \
            --with-nagios-group=nagcmd
make && make install

配置Apache虚拟主机：

# /etc/httpd/conf.d/nagios.conf
Alias /nagios "/usr/local/nagios/share"
<Directory "/usr/local/nagios/share">
    Options None
    AllowOverride All
    Require all granted
    AuthName "Nagios Access"
    AuthType Basic
    AuthUserFile /usr/local/nagios/etc/htpasswd.users
    Require valid-user
</Directory>

创建认证用户：

htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
systemctl restart httpd

三、基础监控配置实践

3.1 主机与服务定义

编辑/usr/local/nagios/etc/objects/hosts.cfg：

define host{
    use                     linux-server
    host_name               web01
    alias                   Web Server
    address                 192.168.1.100
    max_check_attempts      5
    check_period            24x7
    notification_interval   30
    notification_period     24x7
}

配置服务检查项services.cfg：

define service{
    use                     generic-service
    host_name               web01
    service_description     HTTP
    check_command           check_http
    max_check_attempts      3
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
}

3.2 插件扩展配置

通过NRPE实现本地监控：

# 在被监控节点安装NRPE
yum install -y epel-release
yum install -y nrpe nagios-plugins-all

配置/etc/nagios/nrpe.cfg：

allowed_hosts=192.168.1.10  # Nagios服务器IP
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /

在Nagios服务器定义检查命令：

define command{
    command_name    check_nrpe
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

3.3 告警策略优化

配置通知依赖关系：

define hostdependency{
    dependent_host_name      web01
    dependent_hostgroup_name webservers
    host_name               router01
    notification_failure_criteria d,u
}

设置告警升级规则：

define serviceescalation{
    host_name               web01
    service_description     HTTP
    first_notification      1
    last_notification       0
    notification_interval   10
    escalation_period       24x7
    escalation_options      c,r
    contact_groups          admins,managers
}

四、运维常见问题处理

4.1 服务启动失败排查

检查日志文件：

tail -f /usr/local/nagios/var/nagios.log
journalctl -u nagios -n 50 --no-pager

验证配置文件语法：

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

4.2 插件执行异常处理

常见错误及解决方案：

CHECK_NRPE: Error - Could not complete SSL handshake
解决方案：统一NRPE版本或关闭SSL验证
```
# 在nrpe.cfg中添加
ssl_version=TLSv1.2
# 或
dont_use_ssl=1
```

Return code of 127 for check
解决方案：检查插件路径权限

chown nagios:nagcmd /usr/lib64/nagios/plugins/*
chmod 755 /usr/lib64/nagios/plugins/*

4.3 性能优化建议

调整检查间隔：对非关键服务设置normal_check_interval=10

启用结果缓存：在nagios.cfg中设置：

cache_file=/usr/local/nagios/var/objects.cache
object_cache_file=/usr/local/nagios/var/objects.precache

实施分布式监控：通过NSCA实现多站点数据汇总

五、进阶功能探索

5.1 监控数据可视化

集成Grafana展示历史数据：

配置InfluxDB时序数据库
使用pnp4nagios或Graphite作为数据源
创建自定义仪表盘展示关键指标

5.2 自动化运维集成

通过API实现自动化：

# 使用curl提交被动检查结果
curl -X POST "http://nagios-server/nagios/cgi-bin/cmd.cgi" \
     -d "cmd_typ=34&cmd_mod=2&host=web01&service=HTTP&status=0&output=OK"

5.3 容器化部署方案

使用Docker快速部署：

FROM centos:7
RUN yum install -y epel-release && \
    yum install -y nagios nagios-plugins httpd php
COPY nagios.cfg /etc/nagios/
CMD ["/usr/sbin/nagios", "/etc/nagios/nagios.cfg"]

通过本文的详细指导，运维人员可以完成从环境准备到高级配置的全流程操作。建议在实际部署前先在测试环境验证配置，逐步扩展监控范围。对于大型企业环境，可考虑结合CMDB系统实现动态主机发现，或通过配置管理工具实现监控配置的版本化管理。

Nagios监控系统快速部署与基础配置指南