Naming Kafka objects (II) – Producers and Consumers

In a previous post in this “naming” series, we discussed how to name Kafka topics. The intention was to name them semantically meaningfully while avoiding collisions and ambiguity. The focus was on the “nature” of the topics’ data.

This post will discuss the two main “clients” connected to those topics: producers who write data into them and consumers who read data from them. The focus will move away from the “data” towards the applications involved in the data flow.

Producers first

Nothing can be consumed if produced first; therefore, let’s start with producers.

They do a very “simple” job:

Pull metadata from the cluster to understand which brokers take the “leader” role for which topics/partitions.
Serialise the data into byte arrays.
Send the data to the appropriate broker.

In reality, it is much more complicated than this, but this is a good enough abstraction. Out of the dozens of configuration settings producers support, only two settings accept a “naming convention”.

`client.id`

The Kafka documentation defines client.id as follows:

An id string to pass to the server when making requests. The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

We want a naming convention that makes mapping Producer applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Producer application aims to achieve.

Organising your JMX metrics

There is also an extra role that client.id plays that people tend to forget: it namespaces observability metrics. For example, producers emit metrics under the following JMX MBean namespaces:

kafka.producer:type=producer-metrics,client-id={clientId}
kafka.producer:type=producer-node-metrics,client-id={clientId},node-id=([0-9]+)
kafka.producer:type=producer-topic-metrics,client-id={clientId},topic={topic}

Notice how all of them use clientId as part of the namespace name. Therefore, if we don’t assign meaningful values to client.id, we won’t be able to distinguish the appropriate metrics when multiple producers consolidate their metrics into a single metrics system (like Prometheus), especially if they come from the same application (i.e., 1 application using N producers).

client.id also regularly features in other observability components like logs.

Naming convention

The proposed convention looks like this:

[environment]-com.[your-company].[domain].[subdomain(s)].[app name].[entity/event name]

Component	Description
`[environment]`	(Logical) environment that the producer is part of. For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
`com.[your-company]`	Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
`[domain].[subdomain(s)]`	Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
`[app-name]`	The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
`[entity/event-name]`	Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account). This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple producers, client.id needs to be unique for each one. Therefore, the ‘entity/event’ in the last section of the client.id name disambiguates them. You don’t need to define an entity/event name if you only use one producer for the application.

Don’t we need a ‘version’ part?

Other naming conventions define a ‘version’ as part of their respective names. This is only necessary when the client is related to state; for example, Consumers and Streams apps must store committed offsets.

Producers, on the other hand, are completely stateless. Adding a ‘version’ part would only make sense if we keep multiple Producer application versions running side-by-side. Even then, one would argue that versioning the application itself would be a better strategy than versioning the Producer client.id

`transactional.id`

The Kafka documentation defines transactional.id as follows:

The TransactionalId to use for transactional delivery. This enables reliability semantics which span multiple producer sessions since it allows the client to guarantee that transactions using the same TransactionalId have been completed prior to starting any new transactions. If no TransactionalId is provided, then the producer is limited to idempotent delivery. If a TransactionalId is configured, enable.idempotence is implied

There are a few “small” differences between client.id and transactional.id:

client.id doesn’t need to be unique (but I strongly recommend it). transactional.id MUST be unique.
client.id is more “visible” towards developers (through O11Y). transactional.id is mostly opaque, operating behind the scenes in the transaction management subsystem.
client.id can change, although it would make your O11Y information very confusing. transactional.id MUST be stable between restarts.

Other than that, there is nothing special about transactional.id so I recommend using the same naming convention that I have proposed for client.id in the section above.

Consumers second

We have sorted consumers and they are happily producing data. It’s time to look at the other side: consumers.

They too do a very “simple” job:

Get a bunch of topic/partitions assigned as part of the consumer group partition assignment process.
Connect to the brokers acting as leaders for those topic/partitions.
Regularly (attempt) to pull new data from the assigned topic/partitions.
When there is something available, read it (as byte arrays) through the connection.
When it arrives to the application space, deserialise the data into actual objects.

A few configuration settings play a roll in this process.

`group.id`

The Kafka documentation defines group.id as follows:

A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by using subscribe(topic) or the Kafka-based offset management strategy.

We want a naming convention that makes mapping Consumers applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Consumer application aims to achieve.

The proposed naming convention is as follows:

[environment]-com.[company-name].[domain].[subdomain(s)].[app name].[entity/event-name]-[version]

Component	Description
`[environment]`	(Logical) environment that the consumer is part of. For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
`com.[your-company]`	Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
`[domain].[subdomain(s)]`	Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
`[app-name]`	The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
`[entity/event-name]`	Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account). This field is not mandatory.
`[version]`	Only introduce or change this value if you need to run side-by-side versions of the app or simply start from scratch. Format: vXY (e.g., ‘v01’, ‘v14’). This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple consumers, it needs a unique group.id for every one of them. Therefore, the ‘entity/event’ in the last section of the group.id name should disambiguate between them, and it becomes mandatory.

You don’t need to define an entity/event name if you only use one consumer for the application.

Why versioning the `group.id` value?

The Kafka Consumer uses group.id to define a consumer group for multiple instances of the same application. Those instances collaborate within the group, sharing partitions, picking up partitions from failed instances and committing offsets so other instances don’t process records that another instance has processed already.

Offsets are committed under the group.id name. Therefore, it is critical to use the same group.id value across application deployments to guarantee that it continues to consume from where it left it.

However, there are times when we might want to change the group.id and effectively reset the consumer. The easiest way to do that is to change the group.id. In this case, we can use ‘version’ to have a new consumer group that ignores where the previous deployment instances got up to and falls back to auto.offset.reset to decide where to start consuming.

If I’m versioning my application, should I use it for the ‘version’ value?

Short answer: NO

Longer answer: you probably are (loosely) semantic versioning your application; every merged PR will represent a new version. You don’t want to change your group.id every time your application version changes. The ‘version’ mentioned in the group.id is very specific to the consumer group and how it manages offsets. Don’t mix the two together.

`group.instance.id`

The Kafka documentation defines group.instance.id as follows:

A unique identifier of the consumer instance provided by the end user. Only non-empty strings are permitted. If set, the consumer is treated as a static member, which means that only one instance with this ID is allowed in the consumer group at any time. This can be used in combination with a larger session timeout to avoid group rebalances caused by transient unavailability (e.g. process restarts). If not set, the consumer will join the group as a dynamic member, which is the traditional behavior.

In other words, while group.id identifies 1 or more instances that belong to a consumer group, group.instance.id identifies unique instances.

The main purpose of group.instance.id is to enable static membership to the consumer group. This helps reducing group rebalancings when instances are not available briefly. The assumption is it is better to delay whatever partitions are consumed by the temporarily missing instance than rebalance the complete group, affecting all other instances.

I recommend using the same group.id naming convention PLUS something that identifies the instance uniquely and is stable between restarts.

`client.id`

client.id serves the exact same purpose in consumers and producers. Therefore, I will refer you to the previous section for producer’s client.id for a naming convention proposal. See https://javierholguera.com/2024/09/12/naming-kafka-objects-i-producers-and-consumers/#client-id

Conclusions

Naming is difficult and requires care. However, investing in good naming conventions reduces accidental complexity, helps with debugging and diagnosing your system, and supports development through its end-to-end lifecycle.

In this post, I proposed multiple naming conventions that aim to be semantically meaningful, allow you to organize your system into sensible components, and support your system’s incremental evolution.

Getting started with AWS MSK

Amazon Managed Streaming for Apache Kafka

Apache Kafka is one of the technologies with the fastest popularity growth in the last 10 years. AWS, always vigilant for new tech to incorporate into its offering, launched its Kafka as a managed service in February 2019: Amazon MSK.

MSK follows the RDS model: customers choose how much hardware to provision (number of nodes, CPU, memory, etc.) and AWS manages the service for you. Since Kafka is a complex software that is not easy to operate, having AWS dealing with that for you is quite appealing, especially at the beginning of your journey with it.

In this article, we are going to look at how to provision your first MSK cluster, the different options that you encounter (and what they mean) and how to do a quick-and-dirty performance testing to understand how much the most humble cluster can actually process.

Creating the smallest cluster possible

The proces starts with logging into the AWS Console, selecting/searching MSK and clicking on “Create Cluster”. That leads you to a typically dry first screen with lots of options to select. Don’t be worry, we will see what they mean one by one.

Step 1 – Sofware version

Creating MSK cluster – step 1 – software version

Firstly, we are asked for a name for the new cluster. Choose your own adventure here.

Anybody familiar with AWS will recognize the option for VPC. The easiest (and least safe) option is to choose your default VPC, which will grant access to anything to everybody. After all, we are just testing here, right?

Finally, a more important choice: the Apache Kafka version. Since AWS MSK launched, they have consistently only supported x.y.1 versions, meaning 1.1.1, 2.1.1, 2.2.1, etc. Personally, I try to stay away from x.y.0 versions, especially for least mature components like Kafka Connect or Kafka Streams. Besides that rule, choose the newest version possible to stay away from annoying bugs like this.

Step 2 – Network options

*Creating MSK cluster – step 2 – availability zones*

MSK offers the option to deploy your Kafka brokers to as many as 3 availability zones, being also the recommended option for high availability.

Obviously, the more availability zones, the more brokers you will get provisioned (and the more expensive your cluster will be). For simplicity, let’s go with “3” and assign the default AZ and subnet existing in your default VPC.

Step 3 – Broker configuration

Every Kafka Broker requires configuration for a number of properties. Apache Kafka comes with defaults for pretty much all of them. However, AWS MSK overrides some of them with its own defaults. In this section, it is possible to choose your own custom configuration for Apache Kafka, assuming you have created one. In a future post, we will see how to do that. For now, let’s run with the defaults.

Step 4 – Hardware and tags

Things are getting interesting now. You have to choose the hardware family/size of the EC2 instances that will power your Kafka cluster, plus how many of them to run per AZ (remember, we have chosen 3 AZs). For this example, let’s go with 1 broker per AZ.

Time to look at MSK pricing. For this example, I’m going to choose the cheapest options for both instance type and storage. That would cost me, on a monthly basis (eu-west-1 region, 30-days month):

kafka.m5.large: 168.48$ / month
1000 GB storage: 0.11$ / month
Total: (168.48 * 3) + (0.11 * 3) = 505.77$ / month

~~For reference, EC2 instance m5.large cost 77.04$/month. AWS is charging you approx. 2x for managing your Kafka cluster.~~

UPDATE: I got good feedback on this point. When considering all costs involved (EC2 instances for Zookeeper nodes, EC2 instances for Broker nodes, replication traffic cost, etc., the overall cost of MSK is almost the same as running the cluster yourself (assuming your DevOps team works for free… which they don’t).

AWS has published a Pricing Calculator to size your MSK cluster correctly for your expected traffic; it also compares its cost with a self-managed option. Spoiler alert, you shouldn’t do it unless you really know what you’re doing (ample experience with both AWS and Kafka), and even then it is unclear to me why you would do that to yourself 🙂

WARNING: remember to delete your cluster once you are done with the tutorial or you will regret having followed it!!

Step 5 – Security options

In this section you choose a bunch of security-related options:

Do you want to encrypt the communication between brokers? Yeah, why not!
Do you want to force your clients to use SSL/TLS? For testing, probably allowing both TLS and plaintext is the best option. For production, you might want to restrict to TLS.
Should I encrypt my data at rest? Definitively yes.
Should I use TLS to authenticate clients? Well, you probably want to have some form of authentication for production environments, although depends on your security requirements. For testing your first cluster, leave it unticked.

We are almost there… one more step!

Step 6 – Monitoring

You definitively want to monitor your cluster, even if this is a “managed” service. At the end of the day, your clients might have an impact on your cluster (load, resource consumption, etc.) that AWS will definitively not monitor or alert on.

You have two choices to make here:

What level of monitoring do you need? There are three options: basic, cluster level or topic level. Basic level, but cluster and topic level can save your day if the cluster starts acting weird. For instance, if you find one of your topics being really hot (lots of writes and/or reads).
Where do you want to send your metrics? For a test cluster, CloudWatch can be good enough. For a production cluster, consider Prometheus, especially if you are already using it.

Step 7 – Have a coffee (or tea)

Creating MSK cluster – step 7- waiting for new cluster

Just fly past the “Advanced Settings” section and click “Create Cluster” and… wait, a lot. Like 15 minutes… or more.

Step 8 – Provision the client

You are going to need a client that can connect to your newly created cluster, just to play a bit with it. Let’s provision a small EC2 instance, install Kafka command-line tools and give it a spin. I won’t go into too much detail here, I assume you already know how to do this with EC2:

Navigate to EC2.
Click on “Launch Instance” button.
Select “Amazon Linux 2 AMI (HVM), SSD Volume Type”.
Select “t2.micro” from the free tier.
Keep all defaults for the rest of the options.
Make sure that you have the key needed to SSH into the instance. Otherwise, create a new one.

Click on View Instance to go back to the EC2 Instances section. You should see your instance here. Select it and copy/paste its public IP.

Let’s SSH into this instance (don’t bother trying to connect to the IP in this example, by the time I publish this post I will have already deleted it :)). Make sure you have the key located and do:

ssh -i [path-to-your-key] ec2-user@[ec2-public-ip]

If ssh complains about the key permissions being too open, just do a chmod 600 [key-path]to make sure they are restricted enough to make ssh happy.

Step 9 – Installing Kafka command-line tools

We are going to need the command line tools to connect to our cluster. Luckily, you can easily curl all versions of Kafka from the official download page.

curl https://downloads.apache.org/kafka/2.3.1/kafka_2.12-2.3.1.tgz -o kafka.tgz
tar -xvzf kafka.tgz

Once the file is decompressed, you have a new folder like kafka_2.12-2.3.1. Navigate to the bin subfolder to find all the command-line tools there.

However, if we try to run any of the tools here, they will all fail because we don’t have Java installed in the machine. Let’s get that too:

sudo yum install java

You will be prompted with a summary of what is going to install. Accept and wait.

Step 10 – Connecting to your cluster

Once the installation is finished, let’s try to connect to our cluster. Head back to MSK main page, choose your cluster and click on the “View client information” button on the top-right side of the screen. A pop-up window opens with the details to connect to your cluster (TLS and/or plaintext) like the one in the picture below.

Let’s go back to the EC2 instance and we try to list topics with the following command:

./kafka-topics.sh --bootstrap-server [first-broker plaintext url] --list

We launch the command, we wait, we wait a little bit more, even more… and eventually we get this error:

Step 11 – Opening the Kafka port

The EC2 instance is running in its own security group, created when the instance was launched. This group allows SSH traffic to the instances that belong to it, which is why we can connect from our computers to the instance.

The MSK cluster, on the other hand, is running in the VPC default security group. This group allows incoming traffic to any port when it originates in the group itself. However, it rejects the traffic coming from the security group where the EC2 is running.

The good news is it has an easy solution: change the default security group to accept traffic from the EC2 instance security group. Follow these steps:

Head to the “Security Groups” section under EC2.
Choose the “default” security group.
Click on the “Edit” button.
In the pop-up window, click on the “Add Rule” button.
Choose:
1. Type: Custom TCP Rule
2. Protocol: TCP
3. Port: 9092 (Kafka port)
4. Source: Custom + the name of the EC2 security group
Click on the “Save” button.

That is. Go back to the EC2 instance console and try the kafka-topics command again. This time it should return quickly, but without yielding a result (there isn’t any topic in the cluster yet).

Step 12 – Launching the performance producer tool

Let’s put some load through the system, just for fun. Firstly, we need to create a topic that we will use for performance testing.

./kafka-topics.sh --bootstrap-server [first-broker plaintext url] --create --topic performance-topic --partitions 4 --replication-factor 2

With this command, we are saying we want a topic with four partitions and that should be replicated twice.

description of the performance-topic topic

Once it is created, we can launch the performance producer.

./kafka-producer-perf-test.sh --topic performance-topic --num-records 1000000 --throughput 100 --producer-props bootstrap.servers=b-1.testing-cluster.34vag9.c4.kafka.eu-west-1.amazonaws.com:9092 acks=all --record-size 10240

What this command does is:

Sends 1 million records of 10KB size.
Awaits replication to complete (acks=all) up to the min.in.sync.replicas number (2 in this case).

Step 13 – Launching the performance consumer tool

How can we know that these records are going somewhere? Well, we can obviously consume them back.

Run the following command from a separate SSH session.

./kafka-consumer-perf-test.sh --broker-list b-1.testing-cluster.34vag9.c4.kafka.eu-west-1.amazonaws.com:9092 --messages 1000000 --print-metrics --show-detailed-stats --topic performance-topic

What this command does is:

Consumes 1 million records.
Prints detailed stats while it does so.

Step 14 – Watching the metrics

We can also look at CloudWatch metrics to see them live, growing with the load we are sending to the cluster. Head to Cloud Watch in your AWS Console. Once there:

Click on “Metrics”.
Choose “AWS/Kafka”.
Choose “Broker ID, Cluster Name, Topic”.

You will see that the only topic-level metrics available are for the topic just created (the cluster does not have any other topic at the moment). Click on “Bytes In” for the 3 brokers. You will see a growing graph like this one.

Make sure to configure the “Metrics Period” to 1 minute (under “Graphed Metrics”) to have a more accurate visualization.

Step 15 – Delete everything if you don’t want to pay

Once you are satisfied with all your tests, it’s time to clean everything up and avoid nasty surprises when the AWS bill arrives at the end of the month.

Head to the EC2 section first to kill your EC2 instance and follow these steps:

Select “Instances” on the left menu.
Select your EC2 instance to kill.
Click on the “Actions” button.
Choose “Instant state” -> “Terminate”.
In the pop-up window, click on “Yes, terminate”.

In a few minutes, your instance will be dead. Take this opportunity to also remove its orphan security group.

Select “Security Groups” on the left menu.
Select the old EC2 instance security group (something like launch-wizard).
Click on the “Actions” button.
Choose “Delete security group”.
A pop-up window informs you that you can’t delete the security group because it is being referenced from another group (the default group, remember step 11).
Choose the default group, click the “Edit” button and delete the Kafka related rule (TCP port 9092).
Try to delete the EC2 security group again, this time the pop-up window displays a “Yes, Delete” button. Click it to remove the security group.

Last but not least, remove the Kafka cluster. Head to MSK and choose your instance there.

Type “delete” on the pop-up window. Your cluster status will change to “deleting”. A minute later, it will be gone for good.

Conclusions

15 steps aren’t the simplest process possible, but if we think about it, we have covered a lot of ground:

Created the cheapest possible cluster.
Provisioned an EC2 instance with the Kafka command-line tools to test the cluster.
Run performance producers and consumers.
Monitored the cluster load with CloudWatch.

Even more important, with a really small cluster, we were sending 100 messages/s with a total load of 1 MB/s, from one single client, and our cluster didn’t even blink.

That is the power of Kafka, one of the fastest tools available in the market when it comes to moving data. And now, with AWS MSK, it is really easy to get a cluster up and running.

Producers first

client.id

Organising your JMX metrics

Naming convention

Why do we need an entity/event name?

Don’t we need a ‘version’ part?

transactional.id

Consumers second

group.id

Why do we need an entity/event name?

Why versioning the group.id value?

If I’m versioning my application, should I use it for the ‘version’ value?

group.instance.id

client.id

Conclusions

Creating the smallest cluster possible

Step 1 – Sofware version

Step 2 – Network options

Step 3 – Broker configuration

Step 4 – Hardware and tags

Step 5 – Security options

Step 6 – Monitoring

Step 7 – Have a coffee (or tea)

Step 8 – Provision the client

Step 9 – Installing Kafka command-line tools

Step 10 – Connecting to your cluster

Step 11 – Opening the Kafka port

Step 12 – Launching the performance producer tool

Step 13 – Launching the performance consumer tool

Step 14 – Watching the metrics

Step 15 – Delete everything if you don’t want to pay

Conclusions

`client.id`

`transactional.id`

`group.id`

Why versioning the `group.id` value?

`group.instance.id`

`client.id`