Kafka JMX Prometheus Metrics
At moment of writing, minial docker example is
docker run -it --rm --name=kafka -p 9092:9092 \
-e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
-e KAFKA_NODE_ID=1 \
-e KAFKA_PROCESS_ROLES=broker,controller \
-e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
-e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
-e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
-e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
-e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
-e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
confluentinc/cp-kafka:7.9.0
to check if everything ok run following examples:
# create topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic topic1
# consume
docker exec -it kafka kafka-console-consumer --bootstrap-server localhost:9092 --topic topic1
# produce
docker exec -it kafka kafka-console-producer --bootstrap-server localhost:9092 --topic topic1
Metrics
minimally to get up and running we need following
jmx.yml
rules:
- pattern: '.*'
plus following environment variables:
KAFKA_JMX_PORT=9101
KAFKA_JMX_HOSTNAME=localhost
KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml
last one, will use jmx_prometheus_javaagent
baked into container, it will listen 9999
port and be configured with /etc/kafka/jmx.yml
which we need to mount, aka
docker run -it --rm --name=kafka -p 9092:9092 -p 9101:9101 -p 9999:9999 \
-e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
-e KAFKA_NODE_ID=1 \
-e KAFKA_PROCESS_ROLES=broker,controller \
-e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
-e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
-e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
-e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
-e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
-e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
-e KAFKA_JMX_PORT=9101 \
-e KAFKA_JMX_HOSTNAME=localhost \
-e KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml \
-v "$PWD/jmx.yml:/etc/kafka/jmx.yml" \
confluentinc/cp-kafka:7.9.0
note: we are exposing 9101
JMX port, you may want to use jconsole
and connect to it, to see what metrics are available, you may need it while configuring prometheus agent
and you should have your metrics here: http://localhost:9999/metrics
detailed docs about metrics can be found here: https://docs.confluent.io/platform/current/kafka/monitoring.html
Rules
By default, our config does expose all metrics as is, which produces bazillion of metrics, from one some of them are weird, it is not always easy to understand whether metric is gauge or counter and after all we do not need all of them
Instead of
rules:
- pattern: '.*'
we may create dedicated rules
let's start from simplest possible one
rules:
# https://docs.confluent.io/platform/current/kafka/monitoring.html#linux-disk-write-bytes
# mbean: kafka.server:type=KafkaServer,name=linux-disk-write-bytes
- pattern: kafka.server<type=KafkaServer, name=linux-disk-write-bytes><>Value
name: kafka_server_disk_write_bytes_total
help: The total number of bytes written by the broker process, including writes from all disks
type: COUNTER
- pattern: kafka.server<type=Fetch><>queue-size
# - pattern: '.*' # uncomment me to see the rest
notes:
- even so we have defined exact metrics we want, exporter still will produce some jvm and process based metrics
- keep an eye on metric names for given example
kafka_server_disk_write_bytes
will be incorrect and it wants_total
prefix, keep an eye on generated comments - be careful to choose
COUNTER
vsGAUGE
Regex
there is a way to catch multiple metrics by using regular expressions here is an example
for example if we want to keep trak controller metrics we may do something like this:
rules:
# https://docs.confluent.io/platform/current/kafka/monitoring.html#activecontrollercount
# kafka.controller:type=KafkaController,name=ActiveControllerCount
- pattern: kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value
name: kafka_controller_active_controller_count
help: The number of active controllers in the cluster. Valid values are 0 or 1. Alert if the aggregated sum across all brokers in the cluster is anything other than 1 because there should be exactly one controller per cluster.
type: GAUGE
# https://docs.confluent.io/platform/current/kafka/monitoring.html#activebrokercount
# kafka.controller:type=KafkaController,name=ActiveBrokerCount
- pattern: kafka.controller<type=KafkaController, name=ActiveBrokerCount><>Value
name: kafka_controller_active_broker_count
help: The number of active brokers as observed by this controller.
type: GAUGE
it will produce:
# HELP kafka_controller_active_controller_count The number of active controllers in the cluster. Valid values are 0 or 1. Alert if the aggregated sum across all brokers in the cluster is anything other than 1 because there should be exactly one controller per cluster.
# TYPE kafka_controller_active_controller_count gauge
kafka_controller_active_controller_count 1.0
# HELP kafka_controller_active_broker_count The number of active brokers as observed by this controller.
# TYPE kafka_controller_active_broker_count gauge
kafka_controller_active_broker_count 1.0
or, instead, we may do:
rules:
- pattern: kafka.controller<type=KafkaController, name=(.+)><>Value
name: kafka_controller_$1_count
type: GAUGE
which will give us:
# HELP kafka_controller_LastAppliedRecordOffset_count Attribute exposed for management kafka.controller:name=LastAppliedRecordOffset,type=KafkaController,attribute=Value
# TYPE kafka_controller_LastAppliedRecordOffset_count gauge
kafka_controller_LastAppliedRecordOffset_count 4165.0
...
# HELP kafka_controller_EventQueueOperationsStartedCount_count Attribute exposed for management kafka.controller:name=EventQueueOperationsStartedCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_EventQueueOperationsStartedCount_count gauge
kafka_controller_EventQueueOperationsStartedCount_count 9631.0
if we do not want all of them we may do:
rules:
- pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
name: kafka_controller_$1_count
type: GAUGE
and output will be:
# HELP kafka_controller_ActiveControllerCount_count Attribute exposed for management kafka.controller:name=ActiveControllerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_ActiveControllerCount_count gauge
kafka_controller_ActiveControllerCount_count 1.0
# HELP kafka_controller_ActiveBrokerCount_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_ActiveBrokerCount_count gauge
kafka_controller_ActiveBrokerCount_count 1.0
to lowercase metrics do this:
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
name: kafka_controller_$1_count
type: GAUGE
# HELP kafka_controller_activecontrollercount_count Attribute exposed for management kafka.controller:name=ActiveControllerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_activecontrollercount_count gauge
kafka_controller_activecontrollercount_count 1.0
# HELP kafka_controller_activebrokercount_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_activebrokercount_count gauge
kafka_controller_activebrokercount_count 1.0
note: so technically it is an question of if you want dedicated help messages and metrics or if you want do everything at once
Labels
there is also an way to relabel metrics like so
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
name: kafka_controller_count
labels:
type: $1
type: GAUGE
# HELP kafka_controller_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_count gauge
kafka_controller_count{type="ActiveBrokerCount",} 1.0
kafka_controller_count{type="ActiveControllerCount",} 1.0
you may really want labels for topic related metrics
here is little bit different way of commands to run
# start kafka
docker run -it --rm --name=kafka -p 9999:9999 \
-e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
-e KAFKA_NODE_ID=1 \
-e KAFKA_PROCESS_ROLES=broker,controller \
-e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
-e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
-e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
-e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
-e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
-e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
-e KAFKA_JMX_PORT=9101 \
-e KAFKA_JMX_HOSTNAME=localhost \
-e KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml \
-v "$PWD/jmx.yml:/etc/kafka/jmx.yml" \
confluentinc/cp-kafka:7.9.0
# create topic
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-topics --bootstrap-server kafka:9092 --create --topic topic1
# start consumer
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-console-consumer --bootstrap-server kafka:9092 --topic topic1
# start producer
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-console-producer --bootstrap-server kafka:9092 --topic topic1
# type something in producer, press enter, repeat few times
# then create few more topics and repeat steps
let's try to expose MessagesInPerSec, BytesInPerSec and BytesOutPerSec for topics
rules:
# https://docs.confluent.io/platform/current/kafka/monitoring.html#messagesinpersec
# kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={topicName}
- pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec, topic=(.+)><>Count
name: kafka_messages_total
help: The incoming message rate per topic.
labels:
topic: $1
type: COUNTER
# https://docs.confluent.io/platform/current/kafka/monitoring.html#bytesinpersec
# kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={topicName}
- pattern: kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=(.+)><>Count
name: kafka_incoming_bytes_total
help: The incoming byte rate from clients, per topic.
labels:
topic: $1
type: COUNTER
# https://docs.confluent.io/platform/current/kafka/monitoring.html#bytesoutpersec
# kkafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={topicName}
- pattern: kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec, topic=(.+)><>Count
name: kafka_outgoing_bytes_total
help: The outgoing byte rate to clients per topic.
labels:
topic: $1
type: COUNTER
# HELP kafka_outgoing_bytes_total The outgoing byte rate to clients per topic.
# TYPE kafka_outgoing_bytes_total counter
kafka_outgoing_bytes_total{topic="topic1",} 300.0
kafka_outgoing_bytes_total{topic="topic2",} 146.0
# HELP kafka_messages_total The incoming message rate per topic.
# TYPE kafka_messages_total counter
kafka_messages_total{topic="topic1",} 5.0
kafka_messages_total{topic="__consumer_offsets",} 6.0
kafka_messages_total{topic="topic2",} 2.0
# HELP kafka_incoming_bytes_total The incoming byte rate from clients, per topic.
# TYPE kafka_incoming_bytes_total counter
kafka_incoming_bytes_total{topic="topic1",} 300.0
kafka_incoming_bytes_total{topic="topic2",} 146.0
kafka_incoming_bytes_total{topic="__consumer_offsets",} 1410.0
Here is one more example with topics + paritions
rules:
# https://docs.confluent.io/platform/current/kafka/monitoring.html#messagesinpersec
# kafka.log:type=Log,name=Size,topic=topic1,partition=0
- pattern: kafka.log<type=Log, name=Size, topic=(.+), partition=(.+)><>Value
name: kafka_log_size_bytes
help: A metric for the total size in bytes of all log segments that belong to a given partition.
labels:
topic: $1
partition: $2
type: GAUGE
# https://docs.confluent.io/platform/current/kafka/monitoring.html#logstartoffset
# kafka.log:type=Log,name=LogStartOffset,topic=topic1,partition=0
- pattern: kafka.log<type=Log, name=LogStartOffset, topic=(.+), partition=(.+)><>Value
name: kafka_log_start_offset
help: The offset of the first message in a partition. Use with LogEndOffset to calculate the current message count for a topic.
labels:
topic: $1
partition: $2
type: GAUGE
# https://docs.confluent.io/platform/current/kafka/monitoring.html#logendoffset
# kafka.log:type=Log,name=LogEndOffset,topic=topic1,partition=0
- pattern: kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value
name: kafka_log_end_offset
help: The offset of the last message in a partition. Use with LogStartOffset to calculate the current message count for a topic.
labels:
topic: $1
partition: $2
type: GAUGE
# HELP kafka_log_end_offset The offset of the last message in a partition. Use with LogStartOffset to calculate the current message count for a topic.
# TYPE kafka_log_end_offset gauge
kafka_log_end_offset{partition="0",topic="topic1",} 5.0
kafka_log_end_offset{partition="0",topic="topic2",} 2.0
# HELP kafka_log_start_offset The offset of the first message in a partition. Use with LogEndOffset to calculate the current message count for a topic.
# TYPE kafka_log_start_offset gauge
kafka_log_start_offset{partition="0",topic="topic1",} 0.0
kafka_log_start_offset{partition="0",topic="topic2",} 0.0
# HELP kafka_log_size_bytes A metric for the total size in bytes of all log segments that belong to a given partition.
# TYPE kafka_log_size_bytes gauge
kafka_log_size_bytes{partition="0",topic="topic2",} 146.0
kafka_log_size_bytes{partition="0",topic="topic1",} 300.0
So, if you want, you may filter metrics as much as you want, probably key ones are mentioned in docs
There are many metrics reported at the broker and controller level that can be monitored and used to troubleshoot issues with your cluster. At minimum, you should monitor and set alerts on
ActiveControllerCount
,OfflinePartitionsCount
, andUncleanLeaderElectionsPerSec
.
But if you are not sure, probably it should be ok to stick with defaults
startDelaySeconds: 10
rules:
- pattern: '.*'
note: startDelaySeconds
is used to give app some time to initialize everything, in some examples there are even values like 120