This post is Part 1 of a 3-part series about monitoring Apache Kafka. [Part 2][part2] is about collecting operational data from Kafka, and [Part 3][part3] details how to monitor Kafka with Datadog.
Kafka is a distributed, partitioned, replicated, [log][what-is-a-log] service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide "a unified platform for handling all the real-time data feeds a large company might have".[1][design-motivation]
There are a few key differences between Kafka and other queueing systems like [RabbitMQ], [ActiveMQ], or [Redis's Pub/Sub][Redis]:
- As mentioned above, it is fundamentally a replicated log service.
- It does [not use AMQP][not-AMQP] or any other pre-existing protocol for communication. Instead, it uses a custom binary TCP-based protocol.
- It is [very fast][kafka-benchmark], even in a small cluster.
- It has str