部署ldms

Light Weight Monitoring System

在客户端安装ldms

编译安装ldms

安装编译依赖库:

1
2
# yum check-update; yum groupinstall -y 'Development Tools'; yum install -y git
# yum install -y autoconf automake libtool make bison flex gettext-devel libevent-devel openssl-devel python3-devel python36-Cython

克隆最新版LDMS代码,使用 OVIS-4.3.4 版本:

1
2
3
4
# cd /usr/local/src
# git clone https://github.com/ovis-hpc/ovis.git
# cd ovis
# git checkout OVIS-4.3.4

进入目录编译安装,设置安装目录为 /opt/ovis :

1
2
3
4
5
6
7
8
# cd ovis
# sh autogen.sh
# mkdir build
# cd build
# ../configure -h
# ../configure --prefix=/opt/ovis
# make
# make install

确认LDMS已安装成功:

1
2
3
4
5
# ls /opt/ovis/bin/
# ls /opt/ovis/sbin/
# ls /opt/ovis/lib/ovis-ldms/
# ls /opt/ovis/share/man/man7/
# ls /opt/ovis/lib/python3.6/site-packages/

从systemd启动LDMS

创建存放配置的目录:

1
2
3
# cd /opt/ovis
# mkdir -p etc/ldms
# cd etc/ldms

创建 sampler.conf 编辑采集器设置:

1
2
3
4
5
6
7
8
env SAMPLE_INTERVAL=60000000
env SAMPLE_OFFSET=0
env MYHOST=$(eval hostname -s)
env COMPONENT_ID=$(echo ${MYHOST} | sed 's/cas//g')

load name=meminfo
config name=meminfo producer=${MYHOST} instance=${MYHOST}/meminfo component_id=${COMPONENT_ID}
start name=meminfo interval=${SAMPLE_INTERVAL} offset=${SAMPLE_OFFSET}

创建 ldmsd.sampler.env 设置ldmsd所需的环境变量:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# LDMS transport option (sock, rdma, or ugni)
LDMSD_XPRT=sock
# LDMS Daemon service port
LDMSD_PORT=411
# LDMS spank transport option (sock, rdma, or ugni)
# LDMSD_SPANKD_XPRT=sock
# LDMS spankd port
# LDMSD_SPANKD_PORT=10000
# LDMS memory allocation
LDMSD_MEM=128M
LDMSD_VERBOSE=ERROR
# Log file control. The default is to log to syslog.
# LDMSD_LOG_OPTION="-l /var/log/ldmsd.log"
# Authentication options
#LDMSD_AUTH_OPTION="-A conf=/opt/ovis/etc/ldms/ldmsauth.conf"
# LDMS plugin configuration file, see /opt/ovis/etc/ldms/sampler.conf for an example
LDMSD_PLUGIN_CONFIG_FILE=/opt/ovis/etc/ldms/sampler.conf
# These are configured by configure script, no need to change.
LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/ovis-ldms
ZAP_LIBPATH=/opt/ovis/lib/ovis-ldms

创建 /usr/lib/systemd/system/ldmsd.sampler.service 设置通过 systemd 启动服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[Unit]
Description = LDMS Sampler Daemon
Documentation = http://ovis.ca.sandia.gov

[Service]
Type = forking
EnvironmentFile = /opt/ovis/etc/ldms/ldmsd.sampler.env
Environment = HOSTNAME=%H
ExecStartPre = /bin/mkdir -p /opt/ovis/var/run/ldmsd
ExecStartPre = -/bin/bash -c "test -n \"${LDMS_JOBINFO_DATA_FILE}\" && touch ${LDMS_JOBINFO_DATA_FILE} || touch /var/run/ldms_jobinfo.data"
ExecStart = /opt/ovis/sbin/ldmsd \
                -x ${LDMSD_XPRT}:${LDMSD_PORT} \
                -c ${LDMSD_PLUGIN_CONFIG_FILE} \
                -a ${LDMSD_AUTH_PLUGIN} \
                -v ${LDMSD_VERBOSE} \
                -m ${LDMSD_MEM} \
                $LDMSD_LOG_OPTION \
$LDMSD_AUTH_OPTION \
                -r /opt/ovis/var/run/ldmsd/sampler.pid

[Install]
WantedBy = default.target

创建软链接:

1
# ln -sf /opt/ovis/etc/ldms/ldmsd.sampler.service /usr/lib/systemd/system/ldmsd.sampler.service

重载service文件后启动服务:

1
2
# systemctl daemon-reload
# systemctl start ldmsd.sampler

LDMS命令行工具配置文件

创建 /etc/profile.d/ovis.sh 为ldms的命令和python模块设置环境变量:

1
2
3
export LD_LIBRARY_PATH=/opt/ovis/lib:$LD_LIBRARY_PATH
export PATH=/opt/ovis/bin/:/opt/ovis/sbin/:${PATH}
export PYTHONPATH=/opt/ovis/lib/python3.6/site-packages/:${PYTHONPATH}

应用环境变量:

1
2
# chmod +x /etc/profile.d/ovis.sh
# source /etc/profile.d/ovis.sh

重新登录后确认可直接调用ldms命令:

1
2
# which ldms_ls
/opt/ovis/sbin/ldms_ls

在服务器端部署ldms

安装编译依赖库:

1
2
# yum check-update; yum groupinstall -y 'Development Tools'; yum install -y git
# yum install -y autoconf automake libtool make bison flex gettext-devel libevent-devel openssl-devel python3-devel python36-Cython

克隆最新版LDMS代码,使用 OVIS-4.3.4 版本:

1
2
3
4
# cd /usr/local/src
# git clone https://github.com/ovis-hpc/ovis.git
# cd ovis
# git checkout OVIS-4.3.4

进入目录编译安装,设置安装目录为 /opt/ovis :

1
2
3
4
5
6
7
8
# cd ovis
# sh autogen.sh
# mkdir build
# cd build
# ../configure -h
# ../configure --prefix=/opt/ovis
# make
# make install

创建存放配置的目录:

1
2
3
# cd /opt/ovis
# mkdir -p etc/ldms
# cd etc/ldms

创建 ldmsd.aggregator.env 设置ldmsd所需的环境变量:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# This file contains environment variables for ldmsd.sampler, which will affect
# ldmsd initial configuration (e.g. transport, named socket path)
# LDMS transport option (sock, rdma, or ugni)
LDMSD_XPRT=sock
# LDMS Daemon service port
LDMSD_PORT=412
# LDMS memory allocation
LDMSD_MEM=7G
# Number of event threads
LDMSD_NUM_THREADS=1
LDMSD_ULIMIT_NOFILE=100000
LDMSD_VERBOSE=ERROR
# Log file control. The default is to log to syslog.
LDMSD_LOG_OPTION="-l /var/log/ldmsd.log"

# Authentication method
LDMSD_AUTH_PLUGIN=none

# Authentication options
# LDMSD_AUTH_OPTION="-A conf=/opt/ovis/etc/ldms/ldmsauth.conf"

LDMSD_PLUGIN_CONFIG_FILE=/opt/ovis/etc/ldms/aggregator.conf

# These are configured by configure script, no need to change.
LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/ovis-ldms
ZAP_LIBPATH=/opt/ovis/lib/ovis-ldms

创建 aggregator.conf 编辑服务器端配置,先添加客户端作为 producer ,以添加OSS节点为例:

1
2
3
4
5
6
7
8
9
10
11
prdcr_add name=hwoss1 host=hwoss1 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss2 host=hwoss1 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss3 host=hwoss3 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss4 host=hwoss4 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss5 host=hwoss5 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss6 host=hwoss6 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss7 host=hwoss7 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss8 host=hwoss8 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss9 host=hwoss9 port=411 xprt=sock type=active interval=20000000
prdcr_add name=hwoss10 host=hwoss10 port=411 xprt=sock type=active interval=20000000
prdcr_start_regex regex=.*

注:此处 interval 参数的值是为了确认 producer 的连接状态,重新连接的时间间隔。

接着,设置server端更新数据的间隔,使用正则表达式匹配所有的 producer 节点:

1
2
3
updtr_add name=update_all interval=1000000 offset=500
updtr_prdcr_add name=update_all regex=.*
updtr_start name=update_all

注:server端的 interval 可以小于客户端sampler设置的 interval ,server端会检查获取数据的 timestamp ,因此不会重复写入重复数据。

最后设置数据库,将数据存入TimescaleDB时序数据库:

1
2
3
4
5
6
7
8
9
10
11
12
load name=store_timescale
config name=store_timescale user=postgres pwfile=/root/password.txt hostaddr=172.16.0.190 port=5432 dbname=ldms

strgp_add name=meminfo_timescale plugin=store_timescale container=meminfo schema=meminfo
strgp_add name=procnetdev_timescale plugin=store_timescale container=procnetdev schema=procnetdev
strgp_add name=procstat_timescale plugin=store_timescale container=procstat schema=procstat
strgp_add name=loadavg_timescale plugin=store_timescale container=loadavg schema=loadavg

strgp_start name=meminfo_timescale
strgp_start name=procnetdev_timescale
strgp_start name=procstat_timescale
strgp_start name=loadavg_timescale

创建 ldmsd.aggr.service 设置通过 systemd 启动服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[Unit]
Description = LDMS Daemon
Documentation = http://ovis.ca.sandia.gov

[Service]
Type = forking
LimitNOFILE = ${LDMSD_ULIMIT_NOFILE}
EnvironmentFile = /opt/ovis/etc/ldms/ldmsd.aggregator.env
Environment = HOSTNAME=%H
ExecStartPre = /bin/mkdir -p /opt/ovis/var/run/ldmsd

ExecStart = /opt/ovis/sbin/ldmsd \
                -x ${LDMSD_XPRT}:${LDMSD_PORT} \
                -c ${LDMSD_PLUGIN_CONFIG_FILE} \
                -a ${LDMSD_AUTH_PLUGIN} \
                -v ${LDMSD_VERBOSE} \
                -m ${LDMSD_MEM} \
                $LDMSD_LOG_OPTION \
                -P ${LDMSD_NUM_THREADS} \
                -r /opt/ovis/var/run/ldmsd/aggregator.pid

[Install]
WantedBy = default.target

创建软链接:

1
# ln -sf /opt/ovis/etc/ldms/ldmsd.aggr.service /usr/lib/systemd/system/ldmsd.aggr.service

创建 /etc/profile.d/ovis.sh 为ldms的命令和python模块设置环境变量:

1
2
3
export LD_LIBRARY_PATH=/opt/ovis/lib:$LD_LIBRARY_PATH
export PATH=/opt/ovis/bin/:/opt/ovis/sbin/:${PATH}
export PYTHONPATH=/opt/ovis/lib/python3.6/site-packages/:${PYTHONPATH}

应用环境变量:

1
2
# chmod +x /etc/profile.d/ovis.sh
# source /etc/profile.d/ovis.sh

重载service文件后启动服务:

1
2
# systemctl daemon-reload
# systemctl start ldmsd.aggr

管理和加载LDMS模块管理

LDMS通过加载模块发挥“监控数据收集器”(类似客户端)和“监控数据汇总器”(类似于服务端)的功能。
这些模块既有LDMS内置的,也有来自其他组织的贡献。
模块加载方式既可以是动态的(重启失效),也可以写入配置文件(重启生效)。

在 /opt/ovis/lib/ovis-ldms 查看可用的插件:

1
# ll /opt/ovis/lib/ovis-ldms/*.so

一些常用采集器插件可在 /opt/ovis/share/man/man7/ 目录下查看对应man文档。

IPMI Sampler

LDMS内置的IPMI模块,需要在编译时添加 --enable-ipmireader 参数启用:

1
2
3
4
5
6
7
8
9
10
# cd /usr/local/src
# git clone https://github.com/ovis-hpc/ovis.git
# cd ovis
# git checkout OVIS-4.3.4
# sh autogen.sh
# mkdir build
# cd build
# ../configure --prefix=/opt/ovis --enable-ipmireader
# make
# make install

检查 ipmireader 的库文件是否被安装:

1
2
3
4
5
6
# ls /opt/ovis/lib/ovis-ldms/libipmi*
/opt/ovis/lib/ovis-ldms/libipmireader.a /opt/ovis/lib/ovis-ldms/libipmisensors.a
/opt/ovis/lib/ovis-ldms/libipmireader.la /opt/ovis/lib/ovis-ldms/libipmisensors.la
/opt/ovis/lib/ovis-ldms/libipmireader.so /opt/ovis/lib/ovis-ldms/libipmisensors.so
/opt/ovis/lib/ovis-ldms/libipmireader.so.0 /opt/ovis/lib/ovis-ldms/libipmisensors.so.0
/opt/ovis/lib/ovis-ldms/libipmireader.so.0.0.0 /opt/ovis/lib/ovis-ldms/libipmisensors.so.0.0.0

sampler.conf 中添加 ipmisensors :

1
2
3
4
5
6
7
env SAMPLE_INTERVAL=60000000
env SAMPLE_OFFSET=0
env MYHOST=$(eval hostname -s)
env COMPONENT_ID=$(echo ${MYHOST} | sed 's/cas//g')
load name=ipmisensors
config name=ipmisensors producer=${MYHOST} instance=${MYHOST}/ipmisensors component_id=${COMPONENT_ID} address=localhost
start name=ipmisensors interval=${SAMPLE_INTERVAL} offset=${SAMPLE_OFFSET}

LLNL Lustre Sampler

克隆LLNL LDMS插件代码,使用 1.12 版本:

1
2
3
4
5
6
7
8
9
10
# git clone https://github.com/LLNL/ldms-plugins-llnl.git
# cd ldms-plugins-llnl
# git checkout 1.12
进入目录,编译安装LLNL插件,配置时指定插件安装目录(同LDMS目录)、LDMS目录(需要找到LDMS的include和lib目录):
# ./bootstrap
# mkdir build
# cd build
# ../configure --prefix=/opt/ovis --with-libldms-prefix=/opt/ovis
# make
# make install

检查LLNL插件安装结果:

1
2
3
4
# find /opt/ovis/ -iname '*llnl*.so'
/opt/ovis/lib/ovis-ldms/libllnl_lustre_client.so
/opt/ovis/lib/ovis-ldms/libllnl_lustre_ost.so
/opt/ovis/lib/ovis-ldms/libllnl_lustre_mdt.so

LLNL Lustre Sampler 包括 llnl_lustre_clientllnl_lustre_mdtllnl_lustre_ost , 根据服务器角色进行设置。

这三个插件可用的配置项是 interval 采样间隔(单位:微秒),例如在一台MDS节点设置:

1
2
3
load name=llnl_lustre_mdt
config name=llnl_lustre_mdt
start name=llnl_lustre_mdt interval=60000000

LLNL插件均使用默认配置,无需也无法设置额外参数。

在服务器端配置文件 aggregator.conf 中为添加数据库设置,存储来自LLNL插件的数据:

1
2
3
4
5
6
7
8
9
10
11
strgp_add name=llnl_lustre_ost_job_stats_csv plugin=store_csv container=llnl_lustre_ost_job_stats schema=llnl_lustre_ost_job_stats
strgp_add name=llnl_lustre_mdt_job_stats_csv plugin=store_csv container=llnl_lustre_mdt_job_stats schema=llnl_lustre_mdt_job_stats
strgp_add name=llnl_lustre_ost_csv plugin=store_csv container=llnl_lustre_ost schema=llnl_lustre_ost
strgp_add name=llnl_lustre_mdt_csv plugin=store_csv container=llnl_lustre_mdt schema=llnl_lustre_mdt
strgp_add name=llnl_lustre_client_csv plugin=store_csv container=llnl_lustre_client schema=llnl_lustre_client

strgp_start name=llnl_lustre_ost_job_stats_csv
strgp_start name=llnl_lustre_mdt_job_stats_csv
strgp_start name=llnl_lustre_ost_csv
strgp_start name=llnl_lustre_mdt_csv
strgp_start name=llnl_lustre_client_csv

llnl_lustre_mdt 插件设置在MDS节点,此插件有两个 schemallnl_lustre_mdtllnl_lustre_mdt_job_stats ,前者数据来自 /proc/fs/lustre/mdt/*/stats ,后者数据来自 /proc/fs/lustre/mdt/*/job_stats ;
llnl_lustre_ost 插件设置在OSS节点,此插件有两个 schemallnl_lustre_ostllnl_lustre_ost_job_stats ,前者数据来自 /proc/fs/lustre/ost/*/stats ,后者数据来自 /proc/fs/lustre/ost/*/job_stats ;
llnl_lustre_client 插件设置在lustre_client节点,只有一个同名 schema ,收集数据来自 /proc/fs/lustre/llite/*/stats

LDMS数据转储到持久化存储中

LDMS数据收集服务器(aggregator)可通过存储(store)插件将时序数据存储CSV文本文件或数据库中。

LDMS数据转储CSV文件

csv存储插件(store_csv)默认编译安装就可使用,将数据以类似日志形式逐行记录在文本文件中。

在server端配置文件中设置启用csv存储插件:

1
2
load name=store_csv
config name=store_csv path=/data/csv buffer=0

store_csv 常用参数说明:

1
2
3
4
5
buffer=0  不使用system buffer;
path=<PATH> 指定存储目录,不存在目录会自动创建;
altheader=1 将表头单独创建为一个文件,默认为0;
rolltype=1 根据时间戳分割文件,默认为0,不推荐开启;
rollover=<NUMBER> 分割间隔,仅在rolltype开启时可设置,单位为秒

为每个sampler进行存储设置,以 meminfo 为例:

1
2
strgp_add name=meminfo-store_csv plugin=store_csv container=meminfo schema=meminfo
strgp_start name=meminfo-store_csv

LDMS监控数据解析

查询本节点所有采集的数据实例(Instance):

1
# ldms_ls -h localhost -p 411

查询某项采集器插件的属性:

1
2
3
4
5
6
# ldms_ls -h localhost -p 411 -v cas034/meminfo
Schema Instance Flags Msize Dsize UID GID Perm Update Duration Info
-------------- ------------------------ ------ ------ ------ ------ ------ ---------- ----------------- ----------------- --------
meminfo cas034/meminfo CL 2488 440 0 0 -r--r----- 1603692532.053091 0.000517 "updt_hint_us"="1000000:50000"
-------------- ------------------------ ------ ------ ------ ------ ------ ---------- ----------------- ----------------- --------
Total Sets: 1, Meta Data (kB): 2.49, Data (kB) 0.44, Memory (kB): 2.93

查询某项采集器插件获取数据的具体内容:

1
2
3
4
5
6
7
8
9
10
11
# ldms_ls -h localhost -p 411 -l cas034/meminfo | head -10
cas034/meminfo: consistent, last update: Mon Oct 26 14:10:36 2020 +0800 [54415us]
M u64 component_id 34
D u64 job_id 0
D u64 app_id 0
D u64 MemTotal 196490388
D u64 MemFree 94002756
D u64 MemAvailable 94291512
D u64 Buffers 4148
D u64 Cached 701276
D u64 SwapCached 30152

查询 ldms_ls --help ,查询更详细信息。

一些插件需要设置特殊的参数,比如 lustre2_client ,默认它会从 /proc/fs/lustre/{osc, mdc, llite}/sys/kernel/debug/lustre/{osc, mdc, llite} 目录中获取数据,如果lustre信息不在默认路径下,设置 osc_pathmdc_pathllite_path 参数指定路径。

由于 lustre2_client 全部采集的数据过于庞大,在配置中不添加 oscmdc 这两个参数,仅设置 llite=* ,即采集 /sys/kernel/debug/lustre/llite 目录下 scrfssjtu 的信息:

1
2
3
load name=lustre2_client
config name=lustre2_client producer=${PRODUCER} component_id=${COMPONENT_ID} instance=${PRODUCER}/lustre2_client schema=lustre2_client llite=* job_set=${PRODUCER}/jobinfo
start name=lustre2_client interval=${SAMPLE_INTERVAL} offset=${SAMPLE_OFFSET}

使用ldmsd_controller动态加载模块

下面使用 meminfo 模块作为示例,展示如何在LDMS中动态加载模块。

使用 ldmsd_controller 连接本机ldmsd服务:

1
2
3
4
# ldmsd_controller --host localhost --port 411
Welcome to the LDMSD control processor
sock:localhost:411> help
...

使用 help 获得帮助信息:

1
2
3
4
sock:localhost:411> help
...
sock:localhost:411> help prdcr_add
...

使用 status 查看LDMS当前加载的模块,默认为空:

1
2
3
4
5
6
7
8
9
sock:localhost:411> status
Name Type Interval Offset Libpath
------------ ------------ ------------ ------------ ------------
Name Host Port Transport State
---------------- ---------------- ------------ ------------ ------------
Name Interval Offset Mode State
---------------- ------------ ------------ --------------- ------------
Name Container Schema Plugin State
---------------- ---------------- ---------------- ---------------- ------------

使用 load 加载 meminfo 插件,需要填入的关键参数包括 producer (数据源节点) 、 instance (采样数据的名称,一般设置的格式为hostname/sampler_name)、 component_id (节点的编号):

1
sock:localhost:411> load name=meminfo producer=cas005 instance=cas005/meminfo component_id=1

config 修改模块参数,如下命令修改了内存模块的采样间隔,调整为60秒:

1
sock:localhost:411> config name=meminfo producer=cas005 interval=60000000 offset=0 instance=cas005/meminfo component_id=1

start 启动模块收集数据:

1
sock:localhost:411> start name=meminfo producer=cas005 interval=60000000 offset=0 instance=cas005/meminfo component_id=1

确认模块已经加载,可以看到插件名、插件类型、采样间隔、采样偏移、插件位置等信息:

1
2
3
4
5
6
7
8
9
10
sock:localhost:411> status
Name Type Interval Offset Libpath
------------ ------------ ------------ ------------ ------------
meminfo sampler 60000000 0 /opt/ovis/lib/ovis-ldms/libmeminfo.so
Name Host Port Transport State
---------------- ---------------- ------------ ------------ ------------
Name Interval Offset Mode State
---------------- ------------ ------------ --------------- ------------
Name Container Schema Plugin State
---------------- ---------------- ---------------- ---------------- ------------

退出controller控制台,再次使用 ldms_ls 查看节点的监控数据集:

1
2
# ldms_ls -h localhost -x sock -p 411
cas005/meminfo

查看监控数据集内容:

1
2
3
4
5
6
# ldms_ls -h localhost -x sock -p 411 -v cas005/meminfo
Schema Instance Flags Msize Dsize UID GID Perm Update Duration Info
-------------- ------------------------ ------ ------ ------ ------ ------ ---------- ----------------- ----------------- --------
meminfo cas005/meminfo CL 2488 440 0 0 -r--r----- 1606886760.013959 0.000286 "updt_hint_us"="60000000:0"
-------------- ------------------------ ------ ------ ------ ------ ------ ---------- ----------------- ----------------- --------
Total Sets: 1, Meta Data (kB): 2.49, Data (kB) 0.44, Memory (kB): 2.93

查看数据集收集到的监控信息:

1
2
3
4
5
6
7
8
9
10
# ldms_ls -h localhost -x sock -p 411 -l cas005/meminfo
M u64 component_id 1
D u64 jobid 0
D u64 app_id 0
D u64 MemTotal 196490388
D u64 MemFree 191586092
D u64 MemAvailable 192571340
D u64 Buffers 5732
D u64 Cached 1534260
...

参考资料