Kafka Prometheus Exporter

Kafka exporter for prometheus configuration example

How it will works:

In Java there is a way to pass so called javaagent which can modify bytecode before it will be ran by JVM

Really good explanation is available in this video starting from 12:20

Confluent already created and published jmx exporter we need

Also there are configuration examples that will be used as a starting point

Kafka

Before anything else lets setup basic kafka (more datailed description can be found here)

sudo wget -O /opt/confluent-community-7.0.1.tar.gz https://packages.confluent.io/archive/7.0/confluent-community-7.0.1.tar.gz
sudo tar -xzf /opt/confluent-community-7.0.1.tar.gz -C /opt
sudo rm /opt/confluent-community-7.0.1.tar.gz
sudo ln -s /opt/confluent-7.0.1 /opt/confluent

JMX Exporter for Kafka

sudo mkdir /opt/confluent-7.0.1/share/java/prometheus
sudo wget -O /opt/confluent-7.0.1/share/java/prometheus/jmx_prometheus_javaagent-0.16.1.jar https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.16.1/jmx_prometheus_javaagent-0.16.1.jar

configs

sudo mkdir /etc/confluent

sudo wget -O /etc/confluent/zookeeper.yml https://raw.githubusercontent.com/confluentinc/jmx-monitoring-stacks/6.1.0-post/shared-assets/jmx-exporter/zookeeper.yml

sudo wget -O /etc/confluent/kafka_broker.yml https://raw.githubusercontent.com/confluentinc/jmx-monitoring-stacks/6.1.0-post/shared-assets/jmx-exporter/kafka_broker.yml

sudo wget -O /etc/confluent/confluent_schemaregistry.yml https://raw.githubusercontent.com/confluentinc/jmx-monitoring-stacks/6.1.0-post/shared-assets/jmx-exporter/confluent_schemaregistry.yml

sudo wget -O /etc/confluent/confluent_rest.yml https://raw.githubusercontent.com/confluentinc/jmx-monitoring-stacks/6.1.0-post/shared-assets/jmx-exporter/confluent_rest.yml

start zookeeper

EXTRA_ARGS=-javaagent:/opt/confluent-7.0.1/share/java/prometheus/jmx_prometheus_javaagent-0.16.1.jar=9101:/etc/confluent/zookeeper.yml /opt/confluent-7.0.1/bin/zookeeper-server-start /opt/confluent-7.0.1/etc/kafka/zookeeper.properties

start kafka

EXTRA_ARGS=-javaagent:/opt/confluent-7.0.1/share/java/prometheus/jmx_prometheus_javaagent-0.16.1.jar=9102:/etc/confluent/kafka_broker.yml /opt/confluent-7.0.1/bin/kafka-server-start /opt/confluent-7.0.1/etc/kafka/server.properties

start schema registry

EXTRA_ARGS=-javaagent:/opt/confluent-7.0.1/share/java/prometheus/jmx_prometheus_javaagent-0.16.1.jar=9103:/etc/confluent/confluent_rest.yml /opt/confluent-7.0.1/bin/schema-registry-start /opt/confluent-7.0.1/etc/schema-registry/schema-registry.properties

start rest proxy

KAFKAREST_OPTS=-javaagent:/opt/confluent-7.0.1/share/java/prometheus/jmx_prometheus_javaagent-0.16.1.jar=9104:/etc/confluent/confluent_schemaregistry.yml /opt/confluent-7.0.1/bin/kafka-rest-start /opt/confluent-7.0.1/etc/kafka-rest/kafka-rest.properties

for metrics to appear we need to wait few minutes, meanwhile lets do some stuff:

# create topic
/opt/confluent-7.0.1/bin/kafka-topics --bootstrap-server localhost:9092 --partitions 1 --create --topic demo
/opt/confluent-7.0.1/bin/kafka-topics --bootstrap-server localhost:9092 --list

# optionaly start consumer
/opt/confluent-7.0.1/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic demo

# start producer and send few messages
/opt/confluent-7.0.1/bin/kafka-console-producer --bootstrap-server localhost:9092 --topic demo

and if everything fine we should get our metrics:

curl http://localhost:9101/metrics
curl http://localhost:9102/metrics

in my case it is single node dev environment, so from one hand I wish not to expose cluster metrics, from another go thrue grafana dashboards to see what is collected and here are my minimal valuable configs

Zookeeper

---
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: "org.apache.ZooKeeperService<name0=(.+), name1=InMemoryDataTree><>(WatchCount|NodeCount): (\\d+)"
    name: zookeeper_inmemorydatatree_$2
    type: GAUGE
    labels:
      server_name: $1
      server_id: '1'
  - pattern: 'org.apache.ZooKeeperService<name0=(.+)><>(.+): (.+)'
    name: zookeeper_$2
    type: GAUGE
# 106mb ram usage
process_resident_memory_bytes 1.21942016E8

# CPU Usage
process_cpu_seconds_total 2.99
# irate(process_cpu_seconds_total{job=\"zookeeper\"}[5m])*100

# JVM Memory Used - 34mb
jvm_memory_bytes_used{area="heap",} 2.9829424E7
jvm_memory_bytes_used{area="nonheap",} 2.0637672E7
; sum without(area)(jvm_memory_bytes_used{job=\"zookeeper\"})
# JVM Memory Max - 512mb - seems like xmx
jvm_memory_bytes_max{area="heap",} 5.36870912E8
; jvm_memory_bytes_max{job=\"zookeeper\",area=\"heap\"}

# Time spent in GC
jvm_gc_collection_seconds_count{gc="G1 Young Generation",} 2.0
jvm_gc_collection_seconds_sum{gc="G1 Young Generation",} 0.018
jvm_gc_collection_seconds_count{gc="G1 Old Generation",} 0.0
jvm_gc_collection_seconds_sum{gc="G1 Old Generation",} 0.0
; sum without(gc)(rate(jvm_gc_collection_seconds_sum{job=\"zookeeper\"}[5m]))

# For clusters: Quorum Size of Zookeeper ensemble
# count(zookeeper_status_quorumsize{job=\"zookeeper\"})

# Number of Alive Connections
zookeeper_numaliveconnections 1.0
; sum(zookeeper_numaliveconnections{job=\"zookeeper\"})

# Number of queued requests in the server. This goes up when the server receives more requests than it can process
zookeeper_outstandingrequests 0.0
; zookeeper_outstandingrequests{job=\"zookeeper\"}

# Number of ZNodes
zookeeper_inmemorydatatree_nodecount{server_id="1",server_name="StandaloneServer_port2181",} 134.0
; avg(zookeeper_inmemorydatatree_nodecount{job=\"zookeeper\"})

# Number of Watchers
zookeeper_inmemorydatatree_watchcount{server_id="1",server_name="StandaloneServer_port2181",} 0.0
; sum(zookeeper_inmemorydatatree_watchcount{job=\"zookeeper\"})

# millisecond Amount of time it takes for the server to respond to a client request
zookeeper_ticktime 3000.0
zookeeper_minrequestlatency 0.0
; zookeeper_minrequestlatency{job=\"zookeeper\"} * zookeeper_ticktime
; zookeeper_avgrequestlatency{job=\"zookeeper\"} * zookeeper_ticktime
; zookeeper_maxrequestlatency{job=\"zookeeper\"} * zookeeper_ticktime

# bytes probably
zookeeper_datadirsize 2.01350043E8
zookeeper_logdirsize  2.01350043E8

Kafka

TODO: config and number of metrics are huge, need more time to figure out what is needed and what isnt

This one from kafka cluster dashboard

# Number of active controllers in the cluster
kafka_controller_kafkacontroller_activecontrollercount 1.0
# stat: kafka_controller_kafkacontroller_activecontrollercount{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"} > 0

# Online Partitions count
kafka_server_replicamanager_partitioncount 51.0
# stat: sum(kafka_server_replicamanager_partitioncount{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"})

# Broker network throughput
kafka_server_brokertopicmetrics_bytesinpersec 2002.0
kafka_server_brokertopicmetrics_bytesinpersec{topic="demo",} 2002.0
kafka_server_brokertopicmetrics_bytesoutpersec 0.0
# sum(rate(kafka_server_brokertopicmetrics_bytesinpersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",topic!=\"\"}[5m]))
# sum(rate(kafka_server_brokertopicmetrics_bytesoutpersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",topic!=\"\"}[5m]))

# Offline Partitions Count
kafka_controller_kafkacontroller_offlinepartitionscount 0.0
# sum(kafka_controller_kafkacontroller_offlinepartitionscount{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"})

# Produce request rate
kafka_network_requestmetrics_requestspersec{request="Produce",version="9",} 23.0
# sum(rate(kafka_network_requestmetrics_requestspersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",request=\"Produce\"}[5m]))

# Consumer Fetch Request Per Sec
kafka_network_requestmetrics_requestspersec{request="FetchConsumer",version="11",} 2.0
# sum(rate(kafka_network_requestmetrics_requestspersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",request=\"FetchConsumer\"}[5m]))

# Broker Fetch Request Per Sec
# sum(rate(kafka_network_requestmetrics_requestspersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",request=\"Fetch\"}[5m]))

# Offset Commit Request Per Sec
kafka_network_requestmetrics_requestspersec{request="OffsetCommit",version="11",} 2.0
# sum(rate(kafka_network_requestmetrics_requestspersec{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",request=\"OffsetCommit\"}[5m]))

# CPU usage
process_cpu_seconds_total 7.67
# irate(process_cpu_seconds_total{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"}[5m])*100

# JVM Memory Used
jvm_memory_bytes_used{area="heap",} 1.66297072E8
jvm_memory_bytes_max{area="heap",} 1.073741824E9
# sum without(area)(jvm_memory_bytes_used{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"})
# jvm_memory_bytes_max{job=\"kafka-broker\",area=\"heap\",env=\"$env\",instance=~\"$instance\"}

# Time spent in GC (% time in GC, threthold 80%)
jvm_gc_collection_seconds_sum{gc="G1 Young Generation",} 0.052
# sum without(gc)(rate(jvm_gc_collection_seconds_sum{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"}[5m]))

# Messages In
kafka_server_brokertopicmetrics_messagesinpersec 69.0
kafka_server_brokertopicmetrics_messagesinpersec{topic="demo",} 69.0
# sum without(instance,topic)(rate(kafka_server_brokertopicmetrics_messagesinpersec{job=\"kafka-broker\",env=\"$env\",topic!=\"\"}[5m]))

# Network Processor Avg Usage Percent - Average fraction of time the network processor threads are idle. Values are between 0 (all resources are used) and 100 (all resources are available)
kafka_network_socketserver_networkprocessoravgidlepercent 1.0
# 1-kafka_network_socketserver_networkprocessoravgidlepercent{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"}

# Zookeeper Request Latency - Latency in millseconds for ZooKeeper requests from broker
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms 108.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.50",} 3.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.95",} 9.0
# kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\"}

# Log size per Topic
kafka_log_log_size{partition="0",topic="demo",} 2898.0
# sum(kafka_log_log_size{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\"}) by (topic)

# Producer - RequestQueueTimeMs - A high value can imply there aren't enough IO threads or the CPU is a bottleneck, or the request queue isnt large enough. The request queue size should match the number of connections.
kafka_network_requestmetrics_requestqueuetimems{request="Produce",} 23.0
kafka_network_requestmetrics_requestqueuetimems{request="Produce",quantile="0.50",} 0.0
# kafka_network_requestmetrics_requestqueuetimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}
# kafka_network_requestmetrics_requestqueuetimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}

# In most cases, a high value can imply slow local storage or the storage is a bottleneck. One should also investigate LogFlushRateAndTimeMs to know how long page flushes are taking, which will also indicate a slow disk. In the case of FetchFollower requests, time spent in LocalTimeMs can be the result of a ZooKeeper write to change the ISR
kafka_network_requestmetrics_localtimems{request="Produce",} 23.0
kafka_network_requestmetrics_localtimems{request="Produce",quantile="0.50",} 1.0
# kafka_network_requestmetrics_localtimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}

# A high value can imply a slow network connection. For fetch request, if the remote time is high, it could be that there is not enough data to give in a fetch response. This can happen when the consumer or replica is caught up and there is no new incoming data. If this is the case, remote time will be close to the max wait time, which is normal. Max wait time is configured via replica.fetch.wait.max.ms and fetch.max.wait.ms.
kafka_network_requestmetrics_remotetimems{request="Produce",} 23.0
kafka_network_requestmetrics_remotetimems{request="Produce",quantile="0.50",} 0.0
# kafka_network_requestmetrics_remotetimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}

# A high value can imply there aren't enough network threads or the network cant dequeue responses quickly enough, causing back pressure in the response queue.
kafka_network_requestmetrics_responsequeuetimems{request="Produce",} 23.0
# kafka_network_requestmetrics_responsequeuetimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}

# A high value can imply the zero-copy from disk to the network is slow, or the network is the bottleneck because the network cant dequeue responses of the TCP socket as quickly as theyre being created. If the network buffer gets full, Kafka will block.
kafka_network_requestmetrics_responsesendtimems{request="Produce",} 23.0
# kafka_network_requestmetrics_responsesendtimems{job=\"kafka-broker\",env=\"$env\",instance=~\"$instance\",quantile=~\"$percentile\",request=\"Produce\"}

# Same set for {request="Fetch"}

# Connections count per listener
kafka_server_socketservermetrics_connection_count{listener="PLAINTEXT",network_processor="0",} 1.0
# sum(kafka_server_socketservermetrics_connection_count{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"}) by (listener)

# Connections creation rate per listener
kafka_server_socketservermetrics_connection_creation_rate{listener="PLAINTEXT",network_processor="1",} 0.0
# sum(kafka_server_socketservermetrics_connection_creation_rate{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"}) by (listener)

# Connections close rate per listener
kafka_server_socketservermetrics_connection_close_rate{listener="PLAINTEXT",network_processor="1",} 0.0
# sum(kafka_server_socketservermetrics_connection_close_rate{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"}) by (listener)

# Connections per client version
kafka_server_socketservermetrics_connections{client_software_name="apache-kafka-java",client_software_version="7.0.1-ccs",listener="PLAINTEXT",network_processor="2",} 0.0
# sum(kafka_server_socketservermetrics_connections{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"}) by (client_software_name, client_software_version)

# Number of consumer group per state
# sum(kafka_coordinator_group_groupmetadatamanager_numgroupsstable{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"})
# sum(kafka_coordinator_group_groupmetadatamanager_numgroupspreparingrebalance{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"})
# sum(kafka_coordinator_group_groupmetadatamanager_numgroupsdead{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"})
# sum(kafka_coordinator_group_groupmetadatamanager_numgroupscompletingrebalance{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"})
# sum(kafka_coordinator_group_groupmetadatamanager_numgroupsempty{job=\"kafka-broker\", env=\"$env\", instance=~\"$instance\"})

And this is from kafka topics dashboard:

# Total # of Topics
kafka_controller_kafkacontroller_globaltopiccount 2.0
# sum(kafka_controller_kafkacontroller_globaltopiccount)

# Messages In
kafka_server_brokertopicmetrics_messagesinpersec 69.0
kafka_server_brokertopicmetrics_messagesinpersec{topic="demo",} 69.0
# sum without(instance) (rate(kafka_server_brokertopicmetrics_messagesinpersec{job=\"kafka-broker\",topic=~\"$topic\",env=~\"$env\"}[5m]))

# Log size
kafka_log_log_size{partition="0",topic="demo",} 2898.0
# sum(kafka_log_log_size{job=\"kafka-broker\",env=\"$env\",topic=~\"$topic\"}) by (topic)

# Total # of Partitions
kafka_controller_kafkacontroller_globalpartitioncount 51.0
# sum(kafka_controller_kafkacontroller_globalpartitioncount)

# Bytes In
kafka_server_brokertopicmetrics_bytesinpersec{topic="demo",} 2002.0
# sum without(instance) (rate(kafka_server_brokertopicmetrics_bytesinpersec{job=\"kafka-broker\",topic=~\"$topic\",env=~\"$env\"}[5m]))

# Bytes Out
kafka_server_brokertopicmetrics_bytesoutpersec 0.0
# sum without(instance) (rate(kafka_server_brokertopicmetrics_bytesoutpersec{job=\"kafka-broker\",topic=~\"$topic\",env=~\"$env\"}[5m]))

# Produce Request per sec
kafka_server_brokertopicmetrics_totalproducerequestspersec{topic="demo",} 23.0
# sum(rate(kafka_server_brokertopicmetrics_totalproducerequestspersec{job=\"kafka-broker\", env=\"$env\", topic=~\"$topic\"}[5m])) by (topic)

# Fetch Request per sec
kafka_server_brokertopicmetrics_totalfetchrequestspersec 0.0
# sum(rate(kafka_server_brokertopicmetrics_totalfetchrequestspersec{job=\"kafka-broker\", env=\"$env\",topic=~\"$topic\"}[5m])) by (topic)

# Start Offset
kafka_log_log_logstartoffset{partition="0",topic="demo",} 0.0
# kafka_log_log_logstartoffset{job=\"kafka-broker\",env=~\"$env\",topic=~\"$topic\"}

# End Offset
kafka_log_log_logendoffset{partition="0",topic="demo",} 86.0
# kafka_log_log_logendoffset{job=\"kafka-broker\",env=~\"$env\",topic=~\"$topic\"}

Schema Registry

# Schema Registry Instances
kafka_schema_registry_registered_count 0.0
# count(kafka_schema_registry_registered_count{job=\"schema-registry\",env=\"$env\"})

# CPU Usage
process_cpu_seconds_total 8.27
# irate(process_cpu_seconds_total{job=\"schema-registry\",env=\"$env\"}[5m])*100

# JVM Memory, JVM GC
# ...

# Active Connections
kafka_schema_registry_jetty_metrics_connections_active 0.0
# kafka_schema_registry_jetty_metrics_connections_active{job=\"schema-registry\",env=\"$env\"}

# Requests Rate
kafka_schema_registry_jersey_metrics_request_rate 0.05037445007891997
# kafka_schema_registry_jersey_metrics_request_rate{job=\"schema-registry\",env=\"$env\"}

# Requests latency 99p
kafka_schema_registry_jersey_metrics_request_latency_99 50.0
# kafka_schema_registry_jersey_metrics_request_latency_99{job=\"schema-registry\",env=\"$env\"}

There is much more interesting metrics, like compatibility check errors, auth errors and so on, need to return back to it after some data will be collected

The same is true for kafka rest

Also both have consumers and producers

consumer

# The number of commit calls per second
kafka_consumer_consumer_coordinator_metrics_commit_rate{job="schema"}

# The average time in ms a request was throttled by a broker
kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_avg{job="schema"}

# Rate of failed authentication attempts
kafka_consumer_consumer_metrics_failed_authentication_rate{job="schema"}

# The number of total rebalance events per hour, both successful and unsuccessful rebalance attempts
kafka_consumer_consumer_coordinator_metrics_rebalance_rate_per_hour{job="schema"} + kafka_consumer_consumer_coordinator_metrics_failed_rebalance_rate_per_hour{job="schema"}

# The average number of bytes consumed per second
rate(kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_total{job="schema",topic="_schemas"}[1m])

# The average number of records consumed per second
rate(kafka_consumer_consumer_fetch_manager_metrics_records_consumed_total{job="schema",topic="_schemas"}[1m])

# The number of fetch requests per second
rate(kafka_consumer_consumer_fetch_manager_metrics_fetch_total{job="schema"}[1m])

# The average number of bytes fetched per request for a topic
kafka_consumer_consumer_fetch_manager_metrics_fetch_size_avg{job="schema",topic="_schemas"}

# The average time taken for a fetch request
kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_avg{job="schema"}

# The average time taken for a commit request
kafka_consumer_consumer_coordinator_metrics_commit_latency_avg{job="schema"}

# The number of commit calls per second
kafka_consumer_consumer_coordinator_metrics_commit_rate{job="schema"}

# Number of simultaneous connections
kafka_consumer_consumer_metrics_connection_count{job="schema"}

# response rate per node
kafka_consumer_consumer_node_metrics_response_rate{job="schema"}

producer

# The average per-second number of retried record sends for a topic. An increase could signal connectivity problems from the application to the broker
kafka_producer_producer_metrics_record_retry_rate{job="schema"}

# The average per-second number of record sends that resulted in errors.
rate(kafka_producer_producer_metrics_record_error_total{job="schema"}[1m])

# he total amount of buffer memory that is not being used (either unallocated or in the free list)
kafka_producer_producer_metrics_buffer_available_bytes{job="schema"}

# The average time in ms a request was throttled by a broker.
kafka_producer_producer_metrics_produce_throttle_time_avg{job="schema"}

# The average compression rate of record batches for a topic, defined as the average ratio of the compressed batch size over the uncompressed size.
kafka_producer_producer_metrics_compression_rate_avg{job="schema"}

# The average request latency in ms
kafka_producer_producer_metrics_request_latency_avg{job="schema"}

# The average time in ms record batches spent in the send buffer
kafka_producer_producer_metrics_record_queue_time_avg{job="schema"}

# The rate of failed authentication per seconds
rate(kafka_producer_producer_metrics_failed_authentication_total{job="schema"}[1m])

# The average number of bytes sent per second to the broker.
rate(kafka_producer_producer_metrics_outgoing_byte_total{job="schema"}[1m])

# The average number of bytes sent per second to the broker per topic.
sum(rate(kafka_producer_producer_topic_metrics_byte_total{job="schema"}[1m])) by (instance,topic)

# The average number of messages sent per second to the broker.
rate(kafka_producer_producer_metrics_record_send_total{job="schema"}[1m])

# The average number of messages sent per second to the broker per topic.
sum(rate(kafka_producer_producer_topic_metrics_record_send_total{job="schema"}[1m])) by (instance, topic)

# The average number of bytes sent per partition per-request.
kafka_producer_producer_metrics_batch_size_avg{job="schema"}

# The current number of in-flight requests awaiting a response.
kafka_producer_producer_metrics_requests_in_flight{job="schema"}

# The average record size
kafka_producer_producer_metrics_record_size_avg{job="schema"}

# The average number of requests sent per second to the broker.
rate(kafka_producer_producer_metrics_request_total{job="schema"}[1m])

# The average number of records per request.
kafka_producer_producer_metrics_request_size_avg{job="schema"}

# The average number of response received per second to the broker.
rate(kafka_producer_producer_metrics_response_total{job="schema"}[1m])

# The current number of active connections.
rate(kafka_producer_producer_metrics_connection_count{job="schema"}[1m])

# New connections established per second in the window.
kafka_producer_producer_metrics_connection_creation_rate{job="schema"}

# Connections closed per second in the window.
kafka_producer_producer_metrics_connection_close_rate{job="schema"}

# The fraction of time the I/O thread spent doing I/O.
kafka_producer_producer_metrics_io_ratio{job="schema"}

# The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
kafka_producer_producer_metrics_io_wait_time_ns_avg{job="schema"}

# The average length of time for I/O per select call in nanoseconds.
kafka_producer_producer_metrics_io_time_ns_avg{job="schema"}

# The fraction of time the I/O thread spent waiting.
kafka_producer_producer_metrics_io_wait_ratio{job="schema"}

# Number of times the I/O layer checked for new I/O to perform per second.
kafka_producer_producer_metrics_select_rate{job="schema"}

jetty

# The number of requests in the jetty thread pool queue
kafka_schema_registry_jetty_metrics_request_queue_size{job="schema"}

jersey

# The average number of HTTP requests per second
kafka_schema_registry_jersey_metrics_subjects_get_schema_request_rate{job="schema"}

# The average number of HTTP requests per second.
kafka_schema_registry_jersey_metrics_subjects_versions_deleteschemaversion_schema_request_rate

# The average number of HTTP responses per second.
kafka_schema_registry_jersey_metrics_schemas_get_schemas_response_rate

# The average number of requests per second that resulted in HTTP error responses
kafka_schema_registry_jersey_metrics_metadata_id_request_error_rate

# bazillion of metrics per endpoint to measure number of requests, timing, error rate

Links: