node_exporter metrics

node_exporter has bazillion of metrics, most of then wont be used, so why collect them, did tried to go thrue all of them and make some notes

in my case I'm on single virtual machine with kafka, so no docker, kubernetes etc, good old plain virtual machine

actual version - (looking for node_exporter-*.linux-amd64.tar.gz, at moment 1.3.1)

sudo wget -O /opt/node_exporter-1.3.1.linux-amd64.tar.gz
sudo tar -xzf /opt/node_exporter-1.3.1.linux-amd64.tar.gz -C /opt
sudo rm /opt/node_exporter-1.3.1.linux-amd64.tar.gz
sudo ln -s /opt/node_exporter-1.3.1.linux-amd64 /opt/node_exporter

add it to path

sudo tee /etc/environment > /dev/null <<EOT
source /etc/environment

from now we can simply run node_exporter and open http://localhost:9100/metrics

to see available options run node_exporter --help

to run as a service:

sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOT

ExecStart=/opt/node_exporter/node_exporter --collector.disable-defaults --web.max-requests=10 --web.disable-exporter-metrics



  • it should not be ran as root user, but for sacke of simplicity it leaved as is
  • if no commandline arguments passed all default collectors enabled, to run it without everything use --collector.disable-defaults --web.disable-exporter-metrics

chrome prometheus formatter extension

for easier discovery of what available there is prometheus formatter chrome extension which will colorize endpoint output

prometheus formatter

sources can be found here


in most tutorials used to calculate cpu usage percentage

node_cpu_seconds_total{cpu="0",mode="idle"} 418594.7
node_cpu_seconds_total{cpu="0",mode="iowait"} 569.35
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 10.64
node_cpu_seconds_total{cpu="0",mode="softirq"} 892.53
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 4018.45
node_cpu_seconds_total{cpu="0",mode="user"} 11230.68


watch for disk space

node_filesystem_avail_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 7.015657472e+09
node_filesystem_size_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 1.0222829568e+10


unfortunatelly there is no queue metrics, but still we can watch for current io operations or measure reads vs writes count, bytes, time

node_disk_io_now{device="sda"} 0


technically if file descriptors will be bigger than maximum system will die, but not sure how big app should be

node_filefd_allocated 2656
node_filefd_maximum 9.223372036854776e+18


cpu usage

node_load1 0.06
node_load5 0.03
node_load15 0


technically can be used to watch for ssh sessions, e.g. imagine that you are paranoic and want alert if there is ssh connection being established

node_logind_sessions{class="user",remote="true",seat="",type="tty"} 2


all memory related stuff, at least to watch for memory not to become bottleneck

node_memory_MemFree_bytes 1.54710016e+08
node_memory_MemTotal_bytes 3.108298752e+09


can be used to watch for incomming and ongoing traffic

node_network_receive_bytes_total{device="ens160"} 5.79413802e+09
node_network_transmit_bytes_total{device="ens160"} 8.744797443e+09


theoretically if you have bazillion servers can be useful to watch for outdated ones

node_os_info{build_id="",id="ubuntu",id_like="debian",image_id="",image_version="",name="Ubuntu",pretty_name="Ubuntu 20.04.3 LTS",variant="",variant_id="",version="20.04.3 LTS (Focal Fossa)",version_codename="focal",version_id="20.04"} 1
node_os_version{id="ubuntu",id_like="debian",name="Ubuntu"} 20.04


can be used to watch for lack of server resources

# counter Total time in seconds that processes have waited for CPU time
node_pressure_cpu_waiting_seconds_total 5063.650201
# counter Total time in seconds no process could make progress due to IO congestion
node_pressure_io_stalled_seconds_total 586.542031
# counter Total time in seconds that processes have waited due to IO congestion
node_pressure_io_waiting_seconds_total 633.69588
# counter Total time in seconds no process could make progress due to memory congestion
node_pressure_memory_stalled_seconds_total 0.08806399999999999
# counter Total time in seconds that processes have waited for memory
node_pressure_memory_waiting_seconds_total 0.589262


from man ps

D uninterruptible sleep (usually IO) I Idle kernel thread R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped by job control signal t stopped by debugger during the tracing W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent

can be seen ps -o pid,state,command

technically can be used for spikes of processes, also for a zoombie processes

node_processes_state{state="I"} 70
node_processes_state{state="R"} 1
node_processes_state{state="S"} 120

node_processes_threads_state{thread_state="I"} 70
node_processes_threads_state{thread_state="R"} 1
node_processes_threads_state{thread_state="S"} 315


Can be used to watch for services stuck waiting for IO

node_procs_blocked 0


Can be used to monitor that required services are up and running

node_systemd_unit_state{name="zookeeper.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="zookeeper.service",state="active",type="simple"} 1
node_systemd_unit_state{name="zookeeper.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="zookeeper.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="zookeeper.service",state="inactive",type="simple"} 0