Troubleshooting and Deep Dive for Kafka

What is Kafka:

“Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.” - Official Kafka Site

Kafka is essentially a powerful, fast message brokering system used to transfer a payload/message from many applications to many applications. This is a Java based application that exposes metrics through mBeans.

Kafka Components:

There are four main components to Kafka:

  • Broker: Cluster of nodes responsible for establishing the mechanisms to write and read messages. (Main piece of Kafka, always in Java, usually managed by Apache Zookeeper)
  • Producer: This will be the application(s) that is writing the messages that you are interested in viewing. (Most commonly in Java, but possibly in other languages)
  • Consumer: This is the application(s) that is receiving your set of messages. (Most commonly in Java, but possibly in other languages)
  • Topics - These are the mailboxes of messages that Producers and Consumers will subscribe to. When writing or reading a message in Kafka, you must specify which "topic" you are to read from. You can think of this like a channel in slack, you join the ones you want to post and read messages to. Each topic then has a list of offsets that will inform you where you are in the number of messages you have read/have left to read.

A more full dive into Kafka can be found here as well as on a Datadog Blogpost

Datadog Kafka Integrations:

It is important to note that we currently have two distinct Kafka Integrations. The first is simply named Kafka while the second is Kafka_Consumer.

The Kafka Integration utilizes Datadog’s JMXFetch application to pull metrics, just like our other Java based applications such as Cassandra, JMX, Tomcat, etc. This pulls metrics through the use of mBeans, where the engineering team has included a list of commonly used mBeans in the Kafka.yaml file. This can be extended with any other beans the user would like, or if your version of Kafka supports additional metrics.

The Kafka_Consumer Integration collects metrics like our standard Python based checks. This utilizes an internal Zookeeper API. Zookeeper is an Apache application that is responsible for managing the configuration for the cluster of nodes known as the Kafka broker. (In version 0.9 of Kafka things are a bit different, Zookeeper is no longer required, see the Troubleshooting section for more information). This check picks up only three metrics, and these do not come from JMXFetch.

Troubleshooting:

There are a few common issues you may face when it comes to the Kafka Integration. Here is a common list of issues that could be affecting users.

1) The most common issue stems from a newer feature of Kafka that Datadog does not yet support. This specifically affects the Kafka_Consumer Integration. Since version 0.9 of Kafka, the ability to store consumer based offsets can be done within Kafka itself, no longer requiring the Zookeeper application. This means that the kafka.consumer.offset metric that comes from the Kafka_Consumer Integration is not available as we calculate this metric based off of values pulled in from Zookeeper. There are currently efforts to support this, but it is not currently available. In the meantime, they can use Linkdin Burrows and the community written Datadog Plugin to get these Kafka Consumer metrics.  

2) The second most common issue is the following error for the Kafka Integration:

instance #kafka-localhost-<PORT_NUM> [ERROR]: 'Cannot connect to instance localhost:<PORT_NUM>. java.io.IOException: Failed to retrieve RMIServer stub

This error essentially means that the Datadog Agent is unable to connect to the Kafka instance to retrieve metrics from the exposed mBeans over the RMI protocol.

This error can be resolved by including the following JVM (Java Virtual Machine) arguments when starting the Kafka instance (required for Producer, Consumer, and Broker as they are all separate Java instances)

-Dcom.sun.management.jmxremote.port=<PORT_NUM> -Dcom.sun.management.jmxremote.rmi.port=<PORT_NUM>

 

3) The next issue is one that affects the Kafka Integration. The issue is that people may not be seeing Consumer and Producer metrics in your account. By default we only collect broker based metrics.

Additionally, there are cases where users are using custom Producer and Consumer clients that are not written in Java and/or not exposing mBeans, so having this enabled would still collect zero metrics. To start pulling in metrics, if you're running Java based Producers and Consumers, you can uncomment this section of the yaml file and point the Agent to the proper ports:

# - host: remotehost
    # port: 9998 # Producer
    # tags:
    # kafka: producer0
    # env: stage
    # newTag: test
    # - host: remotehost
    # port: 9997 # Consumer
    # tags:
    # kafka: consumer0
    # env: stage
    # newTag: test


4) This issue is specifically for the Kafka_Consumer check. If you specify a partition in your Kafka_Consumer.yaml file that doesn't exist in your environment, you will see the following error in info.log:

instance - #0 [Error]: ''

The solution here would be to only specify the specific partition for your topic. This correlates to this specific line:

#my_topic [0, 1, 4, 12]

Feel free to add to email help@datadog.com should any new Kafka or Kafka_Consumer issues arise!

Have more questions? Submit a request

0 Comments

Article is closed for comments.
Powered by Zendesk