Naming Kafka objects (III) – Kafka Connectors

We discussed naming conventions for Kafka topics and Kafka Producers/Consumers in the previous two posts. This time, we are focusing on Kafka Connect and the connectors running on it.

We will not discuss naming conventions related to Kafka Connect clusters (e.g., config/storage/offset topic names, group.id, etc.) They are normally managed by SysAdmin/DevOps teams and these posts are zooming in developer-related naming conventions.

Kafka Connect in a few words

What is Kafka Connect?

Kafka Connect is a tool for streaming data between Apache Kafka and external systems, such as databases, cloud services, or file systems. It simplifies data integration by providing a scalable and fault-tolerant framework for connecting Kafka with other data sources and sinks, without the need to write custom code.

In other words, Kafka doesn’t exist in a vacuum; there are different “non-Kafka” systems that it needs to interact with, i.e., consume from and produce to. Kafka Connect simplifies this task massively by offering “connector plugins” that will translate between:

  • Protocols: SFTP->Kafka, Kafka->SFTP, Kafka->HTTP, Salesforce API->Kafka, etc.
  • Formats: CSV->Avro, Avro->JSON, Avro->Parquet, etc.

Theoretically, Kafka Connect can also translate between schemas (i.e., data mapping) via Single Message Transformations. However, I advise against using them except for the most trivial transformations.

Kafka Connect defines two types of connectors:

  • Source connectors consume data from non-Kafka systems (e.g., databases via CDC, file systems, other message brokers) and produce it for Kafka topics. These connectors are “Kafka producers” with a client connection to the source data system.
  • Sink connectors consume data from Kafka topics and produce it to non-Kafka systems (e.g., databases, file systems, APIs). Internally, they work as “Kafka Consumers” with a client connection to the destination system.

Naming Conventions

Now that we have a basic understanding of Kafka Connect let’s examine the most relevant settings that require precise, meaningful naming conventions.

connector.name

This is obviously the “number one” setting to define. A few things to consider:

  • It has to be globally unique within the Connect cluster. In other words, no connector in the cluster can share the same name.
  • It is part of the path to access information about the connector config and/or status via the Connect REST API (unless you use the expand option and get them all at once).
  • For Sink connectors, it serves as the default underlying consumer group (plus a connect- prefix). In other words, if your connector is called my-connector, the underlying consumer group will be called connect-my-connector by default.

With that in mind, the proposed naming convention is as follows:

[environment].[domain].[subdomain(s)].[connector name]-[connector version]

ComponentDescription
environment(Logical) environment that the connector is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
domain.subdomain(s)Leveraging DDD to organise your system “logically” based on business/domain components.

Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
connector-nameA descriptive name for what the connector is meant to do.
connector-versionAs the connector evolves, we might need to run side-by-side versions of it or reset the connector completely giving it a new version. Format: vXY (e.g., ‘v01’, ‘v14’).

This field is not mandatory; you can skip it if this is the first deployment.

Do we really need [environment]?

This is a legit question. We said that the connector name must be (globally) unique in a given Kafka Connect cluster where it is deployed. A Kafka Connect cluster can only be deployed against a single Kafka cluster. Therefore, it can only sit in a single (physical) environment. If that is the case, isn’t the “environment” implicit?

Not necessarily:

  1. Your Kafka cluster might be serving multiple logical environments (DEV1, DEV2, etc.). As a result, a single Kafka Connect cluster might be sitting across multiple logical environments even if it belongs to a single physical environment. In this deployment topology, you might have the same connector in multiple logical environments, which would require the [environment] component to disambiguate and guarantee uniqueness.
  2. Alternatively, you might deploy multiple Kafka Connect clusters serving single logical environments against a single (physical) Kafka cluster. You might be tempted to think in this scenario [environment] is not needed since the connector name will be unique within its cluster. However, “behind the scenes”, sink connectors create a Kafka Consumer whose name matches the connector name (plus a connect- prefix). Therefore, if multiple Connect clusters with the same connector name create the same Kafka Consumer consumer group, all sort of “issues” will arise (in practice, they end up either forming a big consumer group targeting the topics across all logical environments in that physical Kafka cluster).

In summary, if you don’t use any concept of “logical environment(s)” and can guarantee that a given connector will be globally unique in the Kafka cluster, you don’t need the [environment] component.

consumer.override.group.id

Starting with 2.3.0, client configuration overrides can be configured individually per connector by using the prefixes producer.override. and consumer.override. for Kafka sources or Kafka sinks respectively. These overrides are included with the rest of the connector’s configuration properties.

Generally, I don’t recommend playing with consumer.override.group.id. Instead, it is better to give an appropriate name to your connector (via connector.name), as per the previous section.

However, there might be scenarios where you can’t or don’t want to change your connector.name yet you still need to alter your default sink connector’s consumer group. Some examples:

  • You have already deployed your connectors without [environment] in your connector.name (or other components) and now you want to retrofit them into your consumer group.
  • You have strict consumer group or connector.name naming conventions that aren’t compatible with each other.
  • You want to “rewind” your consumer group but, for whatever reason, don’t want to change the connector.name.

In terms of a naming convention, I would recommend the simplest option possible:

[environment or any-other-component].[connector.name]

In other words, I believe your consumer group name should track as closely as possible your connector.name to avoid misunderstandings.

consumer/producer.override.client.id

client.id was discussed in a previous post about Producer’s client.id and Consumer’s client.id.

As discussed in that post, it is responsible for a few things:

  • Shows up in logs to make it easier to correlate them with specific producer/consumer instances in an application with many of them (like a Kafka Streams app or a Kafka Connect cluster).
  • It shows up in the namespace/path for JMX metrics coming from producers and consumers.

With that in mind, knowing that we already have a pretty solid, meaningful and (globally) unique connector.name convention, this is how we can name our producer/consumer client.id values.

Connector TypeProperty OverrideValue
Source connectorproducer.override.client.id{connector.name}-producer
Sink connectorconsumer.override.client.id{connector.name}-consumer

Conclusion

We have discussed most relevant properties that require naming conventions in Kafka Connect connectors. As usual, we aim to have semantically meaningful values that we can use to “reconcile” what’s running in our systems and what every team (and developer) owns and maintains.

By now, we can see emerging a consistent naming approach rooted around environments, DDD naming conventions and some level of versioning (when required).

Naming Kafka objects (II) – Producers and Consumers

In a previous post in this “naming” series, we discussed how to name Kafka topics. The intention was to name them semantically meaningfully while avoiding collisions and ambiguity. The focus was on the “nature” of the topics’ data.

This post will discuss the two main “clients” connected to those topics: producers who write data into them and consumers who read data from them. The focus will move away from the “data” towards the applications involved in the data flow.

  1. Producers first
    1. client.id
      1. Organising your JMX metrics
      2. Naming convention
        1. Why do we need an entity/event name?
        2. Don’t we need a ‘version’ part?
    2. transactional.id
  2. Consumers second
    1. group.id
      1. Why do we need an entity/event name?
      2. Why versioning the group.id value?
      3. If I’m versioning my application, should I use it for the ‘version’ value?
    2. group.instance.id
    3. client.id
  3. Conclusions

Producers first

Nothing can be consumed if produced first; therefore, let’s start with producers.

They do a very “simple” job:

  1. Pull metadata from the cluster to understand which brokers take the “leader” role for which topics/partitions.
  2. Serialise the data into byte arrays.
  3. Send the data to the appropriate broker.

In reality, it is much more complicated than this, but this is a good enough abstraction. Out of the dozens of configuration settings producers support, only two settings accept a “naming convention”.

client.id

The Kafka documentation defines client.id as follows:

An id string to pass to the server when making requests. The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

We want a naming convention that makes mapping Producer applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Producer application aims to achieve.

Organising your JMX metrics

There is also an extra role that client.id plays that people tend to forget: it namespaces observability metrics. For example, producers emit metrics under the following JMX MBean namespaces:

  • kafka.producer:type=producer-metrics,client-id={clientId}
  • kafka.producer:type=producer-node-metrics,client-id={clientId},node-id=([0-9]+)
  • kafka.producer:type=producer-topic-metrics,client-id={clientId},topic={topic}

Notice how all of them use clientId as part of the namespace name. Therefore, if we don’t assign meaningful values to client.id, we won’t be able to distinguish the appropriate metrics when multiple producers consolidate their metrics into a single metrics system (like Prometheus), especially if they come from the same application (i.e., 1 application using N producers).

client.id also regularly features in other observability components like logs.

Naming convention

The proposed convention looks like this:

[environment]-com.[your-company].[domain].[subdomain(s)].[app name].[entity/event name]

ComponentDescription
[environment](Logical) environment that the producer is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
com.[your-company]Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
[domain].[subdomain(s)]Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
[app-name]The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
[entity/event-name]Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account).

This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple producers, client.id needs to be unique for each one. Therefore, the ‘entity/event’ in the last section of the client.id name disambiguates them. You don’t need to define an entity/event name if you only use one producer for the application.

Don’t we need a ‘version’ part?

Other naming conventions define a ‘version’ as part of their respective names. This is only necessary when the client is related to state; for example, Consumers and Streams apps must store committed offsets.

Producers, on the other hand, are completely stateless. Adding a ‘version’ part would only make sense if we keep multiple Producer application versions running side-by-side. Even then, one would argue that versioning the application itself would be a better strategy than versioning the Producer client.id

transactional.id

The Kafka documentation defines transactional.id as follows:

The TransactionalId to use for transactional delivery. This enables reliability semantics which span multiple producer sessions since it allows the client to guarantee that transactions using the same TransactionalId have been completed prior to starting any new transactions. If no TransactionalId is provided, then the producer is limited to idempotent delivery. If a TransactionalId is configured, enable.idempotence is implied

There are a few “small” differences between client.id and transactional.id:

  1. client.id doesn’t need to be unique (but I strongly recommend it). transactional.id MUST be unique.
  2. client.id is more “visible” towards developers (through O11Y). transactional.id is mostly opaque, operating behind the scenes in the transaction management subsystem.
  3. client.id can change, although it would make your O11Y information very confusing. transactional.id MUST be stable between restarts.

Other than that, there is nothing special about transactional.id so I recommend using the same naming convention that I have proposed for client.id in the section above.

Consumers second

We have sorted consumers and they are happily producing data. It’s time to look at the other side: consumers.

They too do a very “simple” job:

  1. Get a bunch of topic/partitions assigned as part of the consumer group partition assignment process.
  2. Connect to the brokers acting as leaders for those topic/partitions.
  3. Regularly (attempt) to pull new data from the assigned topic/partitions.
  4. When there is something available, read it (as byte arrays) through the connection.
  5. When it arrives to the application space, deserialise the data into actual objects.

A few configuration settings play a roll in this process.

group.id

The Kafka documentation defines group.id as follows:

A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by using subscribe(topic) or the Kafka-based offset management strategy.

We want a naming convention that makes mapping Consumers applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Consumer application aims to achieve.

The proposed naming convention is as follows:

[environment]-com.[company-name].[domain].[subdomain(s)].[app name].[entity/event-name]-[version]

ComponentDescription
[environment](Logical) environment that the consumer is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
com.[your-company]Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
[domain].[subdomain(s)]Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
[app-name]The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
[entity/event-name]Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account).

This field is not mandatory.
[version]Only introduce or change this value if you need to run side-by-side versions of the app or simply start from scratch. Format: vXY (e.g., ‘v01’, ‘v14’).

This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple consumers, it needs a unique group.id for every one of them. Therefore, the ‘entity/event’ in the last section of the group.id name should disambiguate between them, and it becomes mandatory.

You don’t need to define an entity/event name if you only use one consumer for the application.

Why versioning the group.id value?

The Kafka Consumer uses group.id to define a consumer group for multiple instances of the same application. Those instances collaborate within the group, sharing partitions, picking up partitions from failed instances and committing offsets so other instances don’t process records that another instance has processed already.

Offsets are committed under the group.id name. Therefore, it is critical to use the same group.id value across application deployments to guarantee that it continues to consume from where it left it.

However, there are times when we might want to change the group.id and effectively reset the consumer. The easiest way to do that is to change the group.id. In this case, we can use ‘version’ to have a new consumer group that ignores where the previous deployment instances got up to and falls back to auto.offset.reset to decide where to start consuming.

If I’m versioning my application, should I use it for the ‘version’ value?

Short answer: NO

Longer answer: you probably are (loosely) semantic versioning your application; every merged PR will represent a new version. You don’t want to change your group.id every time your application version changes. The ‘version’ mentioned in the group.id is very specific to the consumer group and how it manages offsets. Don’t mix the two together.

group.instance.id

The Kafka documentation defines group.instance.id as follows:

A unique identifier of the consumer instance provided by the end user. Only non-empty strings are permitted. If set, the consumer is treated as a static member, which means that only one instance with this ID is allowed in the consumer group at any time. This can be used in combination with a larger session timeout to avoid group rebalances caused by transient unavailability (e.g. process restarts). If not set, the consumer will join the group as a dynamic member, which is the traditional behavior.

In other words, while group.id identifies 1 or more instances that belong to a consumer group, group.instance.id identifies unique instances.

The main purpose of group.instance.id is to enable static membership to the consumer group. This helps reducing group rebalancings when instances are not available briefly. The assumption is it is better to delay whatever partitions are consumed by the temporarily missing instance than rebalance the complete group, affecting all other instances.

I recommend using the same group.id naming convention PLUS something that identifies the instance uniquely and is stable between restarts.

client.id

client.id serves the exact same purpose in consumers and producers. Therefore, I will refer you to the previous section for producer’s client.id for a naming convention proposal. See https://javierholguera.com/2024/09/12/naming-kafka-objects-i-producers-and-consumers/#client-id

Conclusions

Naming is difficult and requires care. However, investing in good naming conventions reduces accidental complexity, helps with debugging and diagnosing your system, and supports development through its end-to-end lifecycle.

In this post, I proposed multiple naming conventions that aim to be semantically meaningful, allow you to organize your system into sensible components, and support your system’s incremental evolution.

Naming Kafka objects (I) – Topics

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

I started working with Apache Kafka in 2015 when I joined Funding Circle. We were early adopters of the technology (and the architectural approach that comes with it) and had the pleasure (not really) to work with pre-1.0 versions. Since then, I have worked with Kafka in every single company I have been part of.

Fast-forward almost 10 years, and I think I have accrued enough experience to take the risk of proposing solutions to one of the hardest problems in software engineering: naming things.

In this first post, I propose a naming convention for the most important object in a Kafka system: Kafka topics. In subsequent posts, I’ll cover other objects like consumer groups, Stream topologies, and Kafka Connect connectors.

  1. DDD all things
  2. Topic naming convention
    1. [environment]
    2. [visibility]
    3. [topic-type]
    4. [domain].[subdomain(s)]
    5. [record-name]
    6. [key-name]
    7. [topic-version]
  3. Conclusions

DDD all things

I asked ChatGTP for a brief definition of what DDD is (emphasis mine):

Domain-Driven Design (DDD) is a software development approach that focuses on modeling software based on the real-world domain it represents, emphasizing collaboration between technical experts and domain experts to create a shared understanding and ensure that the software reflects the core business logic and needs.

In other words, when applying DDD, you have to adopt the vocabulary of the business and structure your system based on how the company structures (and collaborates) itself. This is very aligned with a sociotechnical approach to software architecture.

I like leveraging DDD to structure systems and communicate intent. Naming things is the cornerstone for both. Names (and namespaces) organise concepts (i.e., structure) while revealing what they are “about” (i.e., intent).

Topic naming convention

Kafka topics contain the information stored and streamed in Apache Kafka. In that sense, they are not too different from other data containers like database tables, messaging queues, S3 buckets, files, etc.

With that in mind, there are three factors to take into account:

  • A topic is named after the content (not after any given consumer or producer). If you think about it, you would never name a database table based on what reads/writes from it. Neither you should do with Kafka topics.
  • The topic names must explicitly convey the topic content (i.e., what data is stored in there). One of Kafka’s selling points is the ability for producers and consumers to exchange information without coupling; topic names that clearly define data content underpin this exchange.
  • Your naming convention must avoid naming collisions. It should support the ability to choose the right name for a topic without worrying that this name might be a token.

With that in mind, the following convention ticks all those requirements:


[environment].[visibility].[topic type].[domain].[subdomain(s)].[record name]-by-[key name]-[topic version]

Let’s break down each component.

[environment]

Mandatory: yes

The environment describes the “logical” environment that a topic belongs to.

Take into account Kafka clusters can be expensive, whether self-hosted (e.g., K8S via Strimzi Operator) or via SaaS (e.g., AWS MSK, Confluent Cloud). If you have a low-volume environment like most pre-production environments, it makes economic sense to “reuse” the same physical cluster for multiple logical environments (e.g., DEV, QA, Sandbox, UAT, Pre-PROD), particularly in an environment where cost savings are king (see https://javierholguera.com/2024/07/02/software-engineering-trends-that-are-reverting-i/).

It also supports ephemeral environments if you achieve the necessary CI/CD maturity.

[visibility]

Mandatory: yes

David Parnas enunciated the principle of “information hiding” in 1972. It’s 2024, and we still haven’t found a better way to reduce coupling and complexity than “hiding what you don’t need to see”; in other words, encapsulation and abstraction are still king to reduce cognitive overload and improve maintainability. This is true for data as well, not just code/behaviour.

With that in mind, it makes sense to categorise topics depending on “who should have access to this data”. Kafka doesn’t have any such taxonomy built-in; instead, you can “play” with roles and ACLs to limit/grant access. However, that is more of an InfoSec approach than naturally categorising topics (and data) based on their nature. One can argue that if topics are properly categorised in terms of visibility, we are one step away from automating ACL access that enforces such visibility at the producer/consumer level.

Another strong reason to categorise visibility is incompatible changes (backward and/or forward). When we know they apply to a topic with limited visibility, we can more easily manage the blast radius of such changes.

I propose the following visibility options.

ValueDescription
ExternalFor topics used to exchange data with systems outside yours (e.g., third parties, SaaS, customers) if they support a Kafka-based interface.

These topics require strong backward and forward compatibility (unless otherwise negotiated with the other parties).
SharedTopics that are visible to all other components of the system between system areas (similar to internal REST APIs).

These topics require strong backward and forward compatibility
InternalTopics that should only be accessed within the same DDD domain or bounced context.

Ideally, that would be a limited number of well-known applications that can be maintained and deployed together.
PrivateTopics that are owned by a single application. They are similar to a private database/table that supports the functioning of a single microservice.

Compatibility is decided on a per-application basis by the team maintaining it.

[topic-type]

Mandatory: yes

Not all messages are born equal. Clemens Vasters has a good introduction to different types of messages in his “Events, Data points and messages” talk from 2017. While Kafka is not the best option for some of these types of messages, we all end up “abusing” it to an extent, if anything, because having multiple messaging technologies in your stack might be worse than stretching one of them beyond its optimal use case.

With that in mind, and without going to the level of granularity that Clemens proposes, below covers the “most common” types. Choosing the right type helps convey a lot of meaning about the nature of the data that is stored in the topic.

Message Typetopic-type valueDescription
CommandscommandA command conveys the intention to change the state of the system (e.g., “ConfirmPayment”).
EventseventAn event captures something that has happened and is being broadcasted for other systems to consider (e.g., “Payment Confirmed”).
EntitiesentityAn entity groups fields that represent a single, consistent unit of information.

When you design event-first architectures, events are modelled first (to support business processes), and entities “emerge” from them to group information, generally around DDD boundaries (for example, “Payment”, “Customer”, “Account).
Change Data CapturecdcCDC topics define a stream of changes happening to the state of a given entity, commonly represented as a database table. In other words, they are a stream of CRUD changes to a table.

Capturing the before/after of the affected entity simplifies downstream consumers.
NotificationsnotificationUnlike a normal event, they don’t contain all the information that was captured as part of the event happening. Instead, they contain a link back to a place where the information was captured.

They are useful when the payload would be “too big” to embed in the event, so instead, the consumer has the option to “trace back” to independent storage like S3 for full details.

[domain].[subdomain(s)]

Mandatory: yes

This part leverages DDD (or any other approach that you choose to organise your system) to guarantee there aren’t any naming collisions.

It is also quite powerful as a first layer to navigate the hundreds (or thousands) of topics in your system. If you combine visibility AND domain/subdomains(s), you can quickly identify the appropriate stream containing the information that your consumer requires.

For example, if your consumer needs to react to a payment being confirmed, the following topic would be quite easy to find.

prod.shared.event.payments.reconciliation.payment-confirmed-by-payment-id

You can search by the environment, visibility, and domain/subdomain and list a few dozen topics at most that will be part of the “reconciliation” domain before finding the right one.

[record-name]

Mandatory: yes

A descriptive name for the data stored in the topic is similar to naming database tables.

The record name “grammar” will be influenced by the topic type.

Topic typeRecord nameExample
CommandsPresent Tense + NounSendEmail, ApproveOrder
EventsNoun + Past TensePaymentReceived, AccountClosed
EntitiesNounOrder, Account, Payment, Customer
Change Data CaptureTable Name
NotificationsNoun + Past TenseLike ‘events’

[key-name]

Mandatory: yes

The name of the field that serves as the key for the record. It is VERY important to clearly identify what field in the message payload is used as the key to partition records in the topic.

Remember that in Kafka, strict ordering is only guaranteed within the partition (and not guaranteed across partitions in a given topic). Therefore, whatever key is used to partition will be the ordering criteria.

For example, if we have an “entities” topic partitioned by the entity’s natural key, we know all changes to that entity will be strictly ordered. If, for any reason, we chose a different key, the same changes to that entity might be “consumed” in a different order than they happened across different partitions within the topic.

[topic-version]

Mandatory: no

The final field is only here because incompatible changes DO happen despite our best intentions to avoid them.

When we face such change, we normally need to follow a process like this:

  1. Create a second version of the topic topic (i.e., -v2).
  2. Start a migrating topology that upcasts messages from -v1 to -v2 continuously.
  3. Point all our consumers to the new -v2 topic.
  4. Once the consumers are up-to-date, migrate all our producers to the -v2 topic.
  5. Remove -v1 topic.

Since this situation shouldn’t happen for most of your topics, this name component is optional (until you need it).

Conclusions

It is not a simple naming convention, and although it is unlikely, you must be careful not to overrun the topic name length limit (255 characters).

However, if you follow it thoroughly, you will find that what would be otherwise a mess of topics where it is impossible to find the right information becomes a structured, well-organised, self-serving data platform.