Kafka JMX Prometheus Metrics

At moment of writing, minial docker example is

docker run -it --rm --name=kafka -p 9092:9092 \
  -e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
  -e KAFKA_NODE_ID=1 \
  -e KAFKA_PROCESS_ROLES=broker,controller \
  -e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
  -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
  -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
  -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
  -e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
  -e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
  -e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
  confluentinc/cp-kafka:7.9.0

to check if everything ok run following examples:

# create topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic topic1

# consume
docker exec -it kafka kafka-console-consumer --bootstrap-server localhost:9092 --topic topic1

# produce
docker exec -it kafka kafka-console-producer --bootstrap-server localhost:9092 --topic topic1

Metrics

minimally to get up and running we need following

jmx.yml

rules:
  - pattern: '.*'

plus following environment variables:

KAFKA_JMX_PORT=9101
KAFKA_JMX_HOSTNAME=localhost
KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml

last one, will use jmx_prometheus_javaagent baked into container, it will listen 9999 port and be configured with /etc/kafka/jmx.yml which we need to mount, aka

docker run -it --rm --name=kafka -p 9092:9092 -p 9101:9101 -p 9999:9999 \
  -e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
  -e KAFKA_NODE_ID=1 \
  -e KAFKA_PROCESS_ROLES=broker,controller \
  -e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
  -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
  -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
  -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
  -e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
  -e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
  -e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
  -e KAFKA_JMX_PORT=9101 \
  -e KAFKA_JMX_HOSTNAME=localhost \
  -e KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml \
  -v "$PWD/jmx.yml:/etc/kafka/jmx.yml" \
  confluentinc/cp-kafka:7.9.0

note: we are exposing 9101 JMX port, you may want to use jconsole and connect to it, to see what metrics are available, you may need it while configuring prometheus agent

and you should have your metrics here: http://localhost:9999/metrics

detailed docs about metrics can be found here: https://docs.confluent.io/platform/current/kafka/monitoring.html

Rules

By default, our config does expose all metrics as is, which produces bazillion of metrics, from one some of them are weird, it is not always easy to understand whether metric is gauge or counter and after all we do not need all of them

Instead of

rules:
  - pattern: '.*'

we may create dedicated rules

let's start from simplest possible one

rules:
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#linux-disk-write-bytes
  # mbean: kafka.server:type=KafkaServer,name=linux-disk-write-bytes
  - pattern: kafka.server<type=KafkaServer, name=linux-disk-write-bytes><>Value
    name: kafka_server_disk_write_bytes_total
    help: The total number of bytes written by the broker process, including writes from all disks
    type: COUNTER
  - pattern: kafka.server<type=Fetch><>queue-size
  # - pattern: '.*' # uncomment me to see the rest

notes:

even so we have defined exact metrics we want, exporter still will produce some jvm and process based metrics
keep an eye on metric names for given example kafka_server_disk_write_bytes will be incorrect and it wants _total prefix, keep an eye on generated comments
be careful to choose COUNTER vs GAUGE

Regex

there is a way to catch multiple metrics by using regular expressions here is an example

for example if we want to keep trak controller metrics we may do something like this:

rules:
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#activecontrollercount
  # kafka.controller:type=KafkaController,name=ActiveControllerCount
  - pattern: kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value
    name: kafka_controller_active_controller_count
    help: The number of active controllers in the cluster. Valid values are 0 or 1. Alert if the aggregated sum across all brokers in the cluster is anything other than 1 because there should be exactly one controller per cluster.
    type: GAUGE

  # https://docs.confluent.io/platform/current/kafka/monitoring.html#activebrokercount
  # kafka.controller:type=KafkaController,name=ActiveBrokerCount
  - pattern: kafka.controller<type=KafkaController, name=ActiveBrokerCount><>Value
    name: kafka_controller_active_broker_count
    help: The number of active brokers as observed by this controller.
    type: GAUGE

it will produce:

# HELP kafka_controller_active_controller_count The number of active controllers in the cluster. Valid values are 0 or 1. Alert if the aggregated sum across all brokers in the cluster is anything other than 1 because there should be exactly one controller per cluster.
# TYPE kafka_controller_active_controller_count gauge
kafka_controller_active_controller_count 1.0
# HELP kafka_controller_active_broker_count The number of active brokers as observed by this controller.
# TYPE kafka_controller_active_broker_count gauge
kafka_controller_active_broker_count 1.0

or, instead, we may do:

rules:
  - pattern: kafka.controller<type=KafkaController, name=(.+)><>Value
    name: kafka_controller_$1_count
    type: GAUGE

which will give us:

# HELP kafka_controller_LastAppliedRecordOffset_count Attribute exposed for management kafka.controller:name=LastAppliedRecordOffset,type=KafkaController,attribute=Value
# TYPE kafka_controller_LastAppliedRecordOffset_count gauge
kafka_controller_LastAppliedRecordOffset_count 4165.0
...
# HELP kafka_controller_EventQueueOperationsStartedCount_count Attribute exposed for management kafka.controller:name=EventQueueOperationsStartedCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_EventQueueOperationsStartedCount_count gauge
kafka_controller_EventQueueOperationsStartedCount_count 9631.0

if we do not want all of them we may do:

rules:
  - pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
    name: kafka_controller_$1_count
    type: GAUGE

and output will be:

# HELP kafka_controller_ActiveControllerCount_count Attribute exposed for management kafka.controller:name=ActiveControllerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_ActiveControllerCount_count gauge
kafka_controller_ActiveControllerCount_count 1.0
# HELP kafka_controller_ActiveBrokerCount_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_ActiveBrokerCount_count gauge
kafka_controller_ActiveBrokerCount_count 1.0

to lowercase metrics do this:

lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
    name: kafka_controller_$1_count
    type: GAUGE

# HELP kafka_controller_activecontrollercount_count Attribute exposed for management kafka.controller:name=ActiveControllerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_activecontrollercount_count gauge
kafka_controller_activecontrollercount_count 1.0
# HELP kafka_controller_activebrokercount_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_activebrokercount_count gauge
kafka_controller_activebrokercount_count 1.0

note: so technically it is an question of if you want dedicated help messages and metrics or if you want do everything at once

Labels

there is also an way to relabel metrics like so

lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: kafka.controller<type=KafkaController, name=(ActiveControllerCount|ActiveBrokerCount)><>Value
    name: kafka_controller_count
    labels:
      type: $1
    type: GAUGE

# HELP kafka_controller_count Attribute exposed for management kafka.controller:name=ActiveBrokerCount,type=KafkaController,attribute=Value
# TYPE kafka_controller_count gauge
kafka_controller_count{type="ActiveBrokerCount",} 1.0
kafka_controller_count{type="ActiveControllerCount",} 1.0

you may really want labels for topic related metrics

here is little bit different way of commands to run

# start kafka
docker run -it --rm --name=kafka -p 9999:9999 \
  -e CLUSTER_ID=DDFn9qNCTrin1hUV7Rd6nQ \
  -e KAFKA_NODE_ID=1 \
  -e KAFKA_PROCESS_ROLES=broker,controller \
  -e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
  -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
  -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
  -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
  -e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
  -e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
  -e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
  -e KAFKA_JMX_PORT=9101 \
  -e KAFKA_JMX_HOSTNAME=localhost \
  -e KAFKA_OPTS=-javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.18.0.jar=9999:/etc/kafka/jmx.yml \
  -v "$PWD/jmx.yml:/etc/kafka/jmx.yml" \
  confluentinc/cp-kafka:7.9.0

# create topic
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-topics --bootstrap-server kafka:9092 --create --topic topic1

# start consumer
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-console-consumer --bootstrap-server kafka:9092 --topic topic1

# start producer
docker run -it --rm --link=kafka confluentinc/cp-kafka:7.9.0 kafka-console-producer --bootstrap-server kafka:9092 --topic topic1

# type something in producer, press enter, repeat few times
# then create few more topics and repeat steps

let's try to expose MessagesInPerSec, BytesInPerSec and BytesOutPerSec for topics

rules:
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#messagesinpersec
  # kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={topicName}
  - pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec, topic=(.+)><>Count
    name: kafka_messages_total
    help: The incoming message rate per topic.
    labels:
      topic: $1
    type: COUNTER
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#bytesinpersec
  # kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={topicName}
  - pattern: kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=(.+)><>Count
    name: kafka_incoming_bytes_total
    help: The incoming byte rate from clients, per topic.
    labels:
      topic: $1
    type: COUNTER
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#bytesoutpersec
  # kkafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={topicName}
  - pattern: kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec, topic=(.+)><>Count
    name: kafka_outgoing_bytes_total
    help: The outgoing byte rate to clients per topic.
    labels:
      topic: $1
    type: COUNTER

# HELP kafka_outgoing_bytes_total The outgoing byte rate to clients per topic.
# TYPE kafka_outgoing_bytes_total counter
kafka_outgoing_bytes_total{topic="topic1",} 300.0
kafka_outgoing_bytes_total{topic="topic2",} 146.0
# HELP kafka_messages_total The incoming message rate per topic.
# TYPE kafka_messages_total counter
kafka_messages_total{topic="topic1",} 5.0
kafka_messages_total{topic="__consumer_offsets",} 6.0
kafka_messages_total{topic="topic2",} 2.0
# HELP kafka_incoming_bytes_total The incoming byte rate from clients, per topic.
# TYPE kafka_incoming_bytes_total counter
kafka_incoming_bytes_total{topic="topic1",} 300.0
kafka_incoming_bytes_total{topic="topic2",} 146.0
kafka_incoming_bytes_total{topic="__consumer_offsets",} 1410.0

Here is one more example with topics + paritions

rules:
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#messagesinpersec
  # kafka.log:type=Log,name=Size,topic=topic1,partition=0
  - pattern: kafka.log<type=Log, name=Size, topic=(.+), partition=(.+)><>Value
    name: kafka_log_size_bytes
    help: A metric for the total size in bytes of all log segments that belong to a given partition.
    labels:
      topic: $1
      partition: $2
    type: GAUGE
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#logstartoffset
  # kafka.log:type=Log,name=LogStartOffset,topic=topic1,partition=0
  - pattern: kafka.log<type=Log, name=LogStartOffset, topic=(.+), partition=(.+)><>Value
    name: kafka_log_start_offset
    help: The offset of the first message in a partition. Use with LogEndOffset to calculate the current message count for a topic.
    labels:
      topic: $1
      partition: $2
    type: GAUGE
  # https://docs.confluent.io/platform/current/kafka/monitoring.html#logendoffset
  # kafka.log:type=Log,name=LogEndOffset,topic=topic1,partition=0
  - pattern: kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value
    name: kafka_log_end_offset
    help: The offset of the last message in a partition. Use with LogStartOffset to calculate the current message count for a topic.
    labels:
      topic: $1
      partition: $2
    type: GAUGE

# HELP kafka_log_end_offset The offset of the last message in a partition. Use with LogStartOffset to calculate the current message count for a topic.
# TYPE kafka_log_end_offset gauge
kafka_log_end_offset{partition="0",topic="topic1",} 5.0
kafka_log_end_offset{partition="0",topic="topic2",} 2.0
# HELP kafka_log_start_offset The offset of the first message in a partition. Use with LogEndOffset to calculate the current message count for a topic.
# TYPE kafka_log_start_offset gauge
kafka_log_start_offset{partition="0",topic="topic1",} 0.0
kafka_log_start_offset{partition="0",topic="topic2",} 0.0
# HELP kafka_log_size_bytes A metric for the total size in bytes of all log segments that belong to a given partition.
# TYPE kafka_log_size_bytes gauge
kafka_log_size_bytes{partition="0",topic="topic2",} 146.0
kafka_log_size_bytes{partition="0",topic="topic1",} 300.0

So, if you want, you may filter metrics as much as you want, probably key ones are mentioned in docs

There are many metrics reported at the broker and controller level that can be monitored and used to troubleshoot issues with your cluster. At minimum, you should monitor and set alerts on ActiveControllerCount, OfflinePartitionsCount, and UncleanLeaderElectionsPerSec.

But if you are not sure, probably it should be ok to stick with defaults

startDelaySeconds: 10
rules:
  - pattern: '.*'

note: startDelaySeconds is used to give app some time to initialize everything, in some examples there are even values like 120