A mental model for Data Integrations

If you’ve ever worked on a data integration project and felt overwhelmed by all the moving parts, you’re not alone. It’s easy to get lost in the details or to over-engineer parts that don’t really need it.

This post introduces a simple model that can help you make sense of it all. By breaking data integrations into three clear levels (protocol, format, and schema), you’ll be better equipped to focus your energy where it matters and rely on proven tools where it doesn’t.

What is a “Data Integration”?

Whenever two systems connect or integrate, they inevitably share some form of information. For example, if System A requests System B to process a payment, System A must provide certain data for the payment to proceed.

However, this kind of functional exchange is not what I define as a “Data Integration.” Instead, a Data Integration occurs when the primary goal is to share data itself, rather than invoking another system’s functionality. In other words, Data Integration refers specifically to integrations where one system consumes data provided by another, rather than utilizing that system’s behavioral capabilities.

We now live in a world dominated by data-intensive applications, making Data Integrations the most critical form of system integration. These integrations have displaced traditional functional integrations, which used to prevail when data was less ubiquitous and valuable.

A 3-level model for data integrations

Every data integration can be broken down into three levels or layers, each with distinct responsibilities (and constraints). The beauty of it is that they are (almost) completely independent of each other and exchangeable.

I like my data integrations like I like my cake: layered

Protocol Level

This is the foundational level and it is all about “exchanging bytes” between systems, without concern for anything else.

This layer is typically hidden behind higher-level abstractions and, while all abstractions leak, you can be generally confident that you won’t need to look under the hood to check out what is happening here and how it works. Things just work.

Examples of protocols:

  • HTTP(s), which powers all REST-based APIs.
  • SMTP to exchange emails.
  • (s)FTP to exchange files.
  • Proprietary protocols, like those powering technologies like SQL Server, PostgreSQL, Apache Kafka, Google’s gRPC, etc. They tend to be built on top of TCP.

As with all technologies, the more established a protocol is, the stronger and more reliable its abstractions become, requiring you to know less about underlying details.

However, technology is subject to The Lindy Effect (the idea that longevity indicates future durability); therefore, investing time learning these protocols will result in the highest ROI for your professional career.

Format Level

The next level is all about giving shape to those bytes. In other words, if we were to make an analogy with human communication, protocols are about deciding whether to talk by phone or to exchange letters, while formats are about choosing the language used (e.g., English, Spanish).

There is a long tail of formats, some more established, some more recent (and immature). There is also a surprising amount of “variability” in some well established formats when it comes to what is (generally) considered acceptable (for example, JSON’s loose specification has introduced significant variability).

Examples of formats:

  • Good old text in UTF-8 or any of its derivates (e.g., UTF-16, UTF-32).
  • Internet/Web formats like JSON or XML.
  • File exchange formats like CSV.
  • Modern formats for Big Data (Avro, Parquet), RPC (Protocol Buffers, Thrift, MessagePack), etc.
  • Proprietary formats, like what technologies like SQL Server or PostgreSQL use to send information back and forth between client and server.

While some of these formats will get regularly paired with certain protocols (for example, HTTP/JSON, CSV/SFTP, Kafka/Avro, gRPC/Proto), in many cases they are completely decoupled and can be swapped. For example, it’s entirely possible to use XML with HTTP, upload Avro files to an SFTP server or write a CSV as a UTF-8 message to Kafka (if you are crazy enough to consider it).

Schema Level

The final level, where application-specific logic and business requirements become concrete. In other words, this is the level where your specific use case, business scenario, API specification, etc. becomes reality. No abstraction, no commoditization, you are doing all the work here.

In this level you need to do things like:

  • Data modelling to create schemas and data structures representing your business entities and relationships.
  • Data transformations to convert input schemas map to output schemas, including rules for data enrichment, filtering, aggregation, or reformatting.
  • Implement non-functional requirements like:
    • Identify sensitive and non-sensitive data, ensuring compliance with security standards.
    • Clarify how frequently data is refreshed (e.g., daily, hourly, near-real time).
    • Track where data originated and how it has been transformed or moved.

How is this model useful?

There are several reasons why this three-level model is particularly useful for Data Integrations:

Clear separation of concerns

Each level has distinct responsibilities, making complex integrations easier to understand and manage. Engineers can tackle each level independently, enabling parallel work. Different engineers can focus simultaneously on different levels without conflicts.

Improved reusability

Since each level is independent and components at any level can easily be swapped, it becomes simpler to reuse existing code, expertise, and infrastructure.

For example, if an existing integration uses HTTP with JSON, it requires minimal effort to replace the JSON serializer with an XML serializer, while continuing to leverage the existing protocol-level (HTTP) and schema-level implementations.

Targeted commoditization

The Protocol level, as defined, is the most mature and heavily abstracted. Similarly, the Format level is also mature in most scenarios. This maturity drives commoditization, enabling the use of standard, off-the-shelf technologies to handle conversions between protocols and formats with minimal custom code.

For instance, technologies like Azure Data Factory or Apache Airflow can convert seamlessly between SMTP and HTTP, or between XML and JSON, using no-code or low-code interfaces, provided schema-level details remain the same.

This commoditization accelerates Data Integration development and allows engineers to concentrate on schema-level transformations, where the real business logic resides.

Summary

In this post, I shared a model for thinking about Data Integrations that has served me well over the years. It may not be the most sophisticated, but its simplicity makes it practical: it helps you distinguish the parts that truly require your attention (like the schema level) from those you can reliably delegate to off-the-shelf technology (such as the protocol and format levels).

The “Passive-Aggressive” Event

Many developers believe that events inherently offer advantages over other message types, such as commands, making them “better” by default. In other words, they think that messaging between services should heavily rely on events while minimising the use of commands.

In this article, I aim to dispel that myth. As is often the case, the correct answer is: “It depends.” Neither events nor commands are inherently superior; each is best suited to specific scenarios.

  1. Commands are from Mars, Events are from Venus
  2. Coming from the left or the right side
  3. How does a “good” event look?
  4. What is the “passive-aggressive” event?
    1. Modelling with an event
    2. Using a command instead
    3. Why does it matter?
  5. Events enable decoupling, commands don’t?

Commands are from Mars, Events are from Venus

Clemens Vasters, Principal Architect of Azure Messaging Systems (Azure Service Bus, Azure Event Hubs, etc.), proposed a classification of distributed messaging types that I find particularly insightful. He divides them into two categories:

  1. Intents (left column): Reflect a transfer of control from one service to another, such as commands and queries.
  2. Facts (right column): Represent something that has happened in the past and, as such, cannot be retracted. Domain events (such as in Event Sourcing) fall into this category.

Coming from the left or the right side

Most developers start their distributed messaging journey, favouring one side over the other: either systems dominated by “intents” (especially commands) or fundamentally “event-driven” choreographies and orchestrations.

Until about five years ago, your first experience with distributed messaging was likely using brokers like SQS or RabbitMQ to queue commands and load-balance work across worker pools. Applications at the time were less data-intensive and more focused on behaviour and interaction.

Then came the Big Data movement, “Designing Data-Intensive Applications” (DDIA), and Kafka. Suddenly, we modelled systems as data flows represented by events. Data was king, and systems became a series of computations built on top of data movements.

Regardless of how your journey began, it likely left you with a “blind spot” toward the other side. If you started with events, everything looked like an event. If you started with commands, everything looked like a command.

I know this firsthand; I spent ten years primarily building event-driven systems, where events were foundational and Kafka was the default transport layer. Over time, I noticed the emergence of a problematic pattern I call the “passive-aggressive” event.

How does a “good” event look?

Before we understand the “passive-aggressive” event anti-pattern, we need to define what a good event is. Jonas Bonér offers a helpful perspective in his legendary Voxxed Days talk on how events are reshaping modern systems.

In short:

  • Events represent facts from the past and, therefore, cannot be changed. However, they can be corrected and accumulated.
  • Events are broadcast without any expectations from the producer about what consumers will do. In fact, the producer neither knows nor cares who the consumers are (Bonér describes this as “anonymous”).

This model supports greater system autonomy and decoupling. However, “can” is the operative word—abusing events can lead to an “insidious” form of coupling.

What is the “passive-aggressive” event?

In simple terms, a “passive-aggressive” event is a command improperly modelled as an event. The giveaway is when the producer cares about the consumer’s reaction or expects feedback.

In proper event modelling, the producer should have zero expectations about how or whether the event is handled. In a passive-aggressive event, a hidden “backchannel” exists where the producer expects actions or responses from the consumer.

Let’s explore this with an example.

Imagine an application that manages payments. When money is received, it is flagged as “Received.” Upon reception, we want to notify the sender by email. Once the email is sent, the payment is moved to “Confirmed.”

There are two services:

  • A Payment Service that manages the payment lifecycle.
  • An Email Service responsible solely for sending emails.

Modelling with an event

Let’s model this interaction with an event first.

Flow:

  1. A user sends a payment.
  2. The Payment Service receives and stores it as “Received.”
  3. It broadcasts an event: “PaymentReceived.”
  4. The Email Service consumes the event, sends an email, and then emits a “PaymentConfirmed” event or calls the Payment Service directly to update the payment state.

Using a command instead

Let’s model it with a command now.

Flow:

  1. A user sends a payment.
  2. The Payment Service receives and stores it as “Received.”
  3. The Payment Service sends a command to the Email Service: “SendConfirmationEmail.”
  4. The Email Service processes the request and responds with a success or failure message.
  5. The Payment Service updates the payment state to “Confirmed” based on the response.

Why does it matter?

The second model maintains better encapsulation (a principle emphasised by David Parnas since the 1970s). In the event model, business logic leaks into the Email Service, breaking the Payment Service’s encapsulation and increasing the cost of change.

Furthermore, the Email Service becomes less reusable because it must understand business logic unrelated to email delivery.

If this approach is repeated across integrations—e.g., linking Email Service with User Account Service, Bank Account Service, and so on—it becomes unsustainable, leading to a “Distributed Monolith”: the worst aspects of monoliths and microservices combined.

Clear boundaries and separation of concerns (echoing SOLID’s Single Responsibility Principle and DDD’s Bounded Contexts) improve organisational flow and reduce cognitive load. Clean, explicit contracts facilitate easier understanding and modification.

Events enable decoupling, commands don’t?

Many like to repeat this mantra, but it is misleading.

Encapsulation is what enables decoupling. Encapsulation allows systems to integrate via low-touch interfaces, exposing only what is necessary.

In our example, using a command allows the Payment Service to know only the minimal information needed to send an email. The Email Service is concerned solely with sending emails, not with the reason behind the email being sent.

Does this mean we should always prefer commands to events? Absolutely not. When it’s safe to simply publish an event and let downstream consumers react (or not), events are preferable. However, we must be cautious not to inadvertently increase coupling through hidden dependencies and back channels.

A few broad rules about microservices

I know, we don’t like microservices anymore, and they are out of fashion. However, I still think about how I was trying to build them until the sudden change in industry fashion convinced me that building modular monoliths was entirely different from how monoliths were meant to be built in the past.

In this post, I want to reflect on a few broad roles I follow(ed) when building microservices. I believe broad rules and flexible heuristics are appropriate when making architectural decisions because context (e.g., business, technical, human) is more critical than (blindly) following rules.

  1. What is a microservice?
  2. Rule 1 – Teams don’t share microservices
  3. Rule 2 – Microservices don’t share (private) data storage
    1. Public vs private data storage
    2. Avoiding resource contention
  4. Rule 3 – Avoid distributing transactions through the network
  5. Rule 4 – Network latency adds up
  6. Rule 5 – No Service to Service calls between services
  7. Conclusions

What is a microservice?

We should first qualify what a microservice represents.

Most people equate a microservice with a single service deployable. For example, my microservice is a REST API web server receiving traffic on port 8080. This is not the right way of thinking about microservices because microservices weren’t invented as a technical concept but as a sociotechnical one, a solution to the organisational scalability problem (in a technical context, nonetheless).

The rules that define microservice boundaries cannot be purely technical (albeit this is a crucial factor); they also need to incorporate aspects like team structure, cognitive load, product boundaries, geographical boundaries, etc.

Therefore, it becomes easier to define what a microservice is not:

  • A microservice is not a single process (e.g., a single JVM running in a container).
  • A microservice is not a single deployable (e.g., a single Docker image).
  • A microservice is not defined in a single repository (e.g., Github).

On the contrary, the following things are perfectly possible:

  • One microservice runs as multiple processes because, in its Kubernetes pod, we also have sidecar containers performing roles like Service Mesh, Ambassador, etc.
  • Multiple deployables working in coordination define one microservice. For example, in a CQRS or Event Sourcing architecture, you might separate your command handler and/or writer model producer from your query handler and/or event consumer.
  • If we choose to use a company-wide mono repo or product mono repos (all parts of a given product), one repository contains multiple microservices.

To summarise these rules, we can say:

A Microservice is an application (i.e., a logical boundary for behavior represented by code and state resources) that is developed, deployed and maintain together.

Rule 1 – Teams don’t share microservices

In other words, every microservice has only one team that owns it.

This principle doesn’t rule out any of the following:

  • Multiple teams (and their products) depending on a given microservice. This is fine because it doesn’t change the fact that only one team owns the microservice.
  • Multiple teams contribute to the microservice code. This, too, is okay, provided the team that owns the microservice is happy to support some kind of coordinated contribution process (like the Inner Source model).

For reasons of accountability, only one team must own a service. We trust teams to build services and run them in production; in exchange, we must empower them to make the right decisions (within whatever architectural framework the company has adopted).

This empowerment is voided if it doesn’t come with the necessary accountability for the consequences of those decisions. “Sharing” ownership, at a minimum, deludes accountability; at worst, it completely prevents it.

Rule 2 – Microservices don’t share (private) data storage

Private data storage is where a single microservice stores information for execution. This can be master data for lookups, historical payments data for de-duplication and/or state validation, etc.

Microservices must be able to change the schema of their data storage freely; after all, it is their private concern.

We achieve this by ensuring that two microservices never share the same private data storage.

Public vs private data storage

It is important to differentiate what private data storage is and when we see it as public.

Private data storage is created to support the functionality of a microservice; when the microservice logic changes, the storage (might) also change. Private data storages contain private data designed to be consumed by the microservice but not shared or available to other microservices. Private data storages don’t make any promises regarding schema changes (e.g., backward compatibility, forward compatibility) beyond what the single microservice requires. To sum up, private data is an implementation concern.

Public data storage is the opposite of what is described above: it is designed for sharing, uses schemas with compatibility guarantees, and is easy to consume. In other words, public data is an API.

The following table contains some examples of public and private data storage:

StorageTypeDescription
Private Kafka TopicsPrivateAkin to tables in a private database.

For an explanation of what a “private topic” means, see: https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#visibility
Shared/External Kafka TopicsPublicAkin to REST API endpoints.

For an explanation of what a “shared/external topic” means, see: https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#visibility
Private Blob Storage (S3/Azure Blob Storage) foldersPrivateOnly used by a single microservice and/or application
Public Blob Storage (S3/Azure Blob Storage) foldersPublicProduced by one microservice, available for others to read.
Relational databasesPrivateOnly used by a single microservice and/or application.
NoSQL databasesPrivateOnly used by a single microservice and/or application.
Like-for-Like shared/external Kafka topics sunk into databasesPublicThis is a particular case of the above.

If a topic producer decides to offer a “queryable” version of the same data as a (SQL/NoSQL) database and it is captured as a like-for-like sink
of a shared/external topic, it is public because:

– Its schema will follow the same compatibility rules as the Kafka
topic(s).

– Its database (and tables/containers) are provisioned exclusively for
sharing purposes, not as a private concern of any specific
microservice.

Avoiding resource contention

Another reason for separating microservices databases is to avoid resource contention.

In a scenario where two microservices share the same database (or other infrastructure resources), it can run into Noisy Neighbour antipattern problems:

  1. Application A receives a spike of traffic/load and starts accessing the shared resources more intensely.
  2. Application B starts randomly failing when it cannot access the shared resources (or it takes longer than it can tolerate, leading to timeouts).

Ensuring every microservice accesses independent resources guarantees we don’t suffer these problems.

This principle can lead to increased infrastructure costs. For that reason, it is perfectly reasonable to consider the following exceptions:

  • Reuse underlying infrastructure in environments before the production environment, where the consequences of occasional resource contention are not particularly worrying.
  • Reuse underlying infrastructure between services whose volume is expected to be low. As long as the microservices aren’t coupled at the logical level (i.e., the data itself, not the storage infrastructure), it is relatively easy to “separate” their infrastructure in the future if required (compared to separating them if coupled at the logical schema level).

For the last point, I would advise against doing this with microservices shared by somewhat distant “organizationally” teams (e.g., crossing departments or division boundaries, minimum timezone overlap, or any other barrier that prevents fluid communication).

Rule 3 – Avoid distributing transactions through the network

I always recommend considering DDD heuristics to drive your microservice design. I use the DDD “Aggregate Root” concept to help me model microservices and their responsibilities. DDD defines “Aggregate Root” as follows:

An Aggregate Root in Domain Driven Design (DDD) is a design concept where an entity or object serves as an entry point to a collection of related objects, or aggregates. The aggregate root guarantees the consistency of changes being made within the aggregate by forbidding external objects from holding references to its members. This means all modifications within the aggregate are controlled and coordinated by the aggregate root, ensuring the aggregate as a whole remains in a consistent state. This concept helps enforce business rules and simplifies the model by limiting relationships between objects.

An aggregate root should always have one single “source of truth”, i.e., one microservice that manages its state (and modifications). We want this because it means we avoid (as much as possible) distributing transactions over multiple services (through the network).

The alternative (i.e., distributed transactions) suffers from a variety of problems:

  • Performance problems when leveraging Distributed Transaction Coordination technology like XA Transactions or Microsoft DTC (i.e., 2-phase commits).
  • Complexity when using patterns like Saga pattern and/or Compensating Transaction pattern.

Designing your Aggregate Roots perfectly doesn’t guarantee you won’t need some of those patterns. However, it will minimise how often you need them.

In summary, if your microservice setup splits an aggregate root, you are doing it wrong; you should “merge” those two services.

Rule 4 – Network latency adds up

Crossing the network is one of the slowest operations. It also introduces massive uncertainty and new failure scenarios compared to running a single process in the same memory space. Jonas Boner has a fantastic talk about the “dangers” of networks’ non-deterministic behaviour compared to the “consistency” one can expect from in-memory communication.

This is true when you call other microservices (e.g., directly via REST or indirectly via asynchronous communication) and when talking to external infrastructure dependencies like databases.

When considering “dividing” your system into multiple microservices, consider the impact on end-to-end latency against any non-functional requirements for latency (e.g., 99th percentile latency).

Rule 5 – No Service to Service calls between services

This rule only applies if you are following a strict “Event Driven Architecture”. Even if that scenario, there will be cases where S2S calls will be “necessary” to avoid unnecessary complexity.

One of the benefits of microservice architecture is the decoupling that we get from services that
depend on each other indirectly. In a monolith, all modules live and fail together, causing a large “blast radius” when something goes wrong (i.e., the whole thing fails).

In traditional microservices (e.g., sync communication based via REST/HTTP or gRPC), there is a decoupling in “space” (i.e., the services don’t share the same hardware). However, they are still coupled “in time” (i.e., to an extent, they all need to be healthy for the system to perform). Some patterns, like circuit breakers, aim to mitigate this risk.

Avoiding S2S calls breaks the couple “in time” by introducing a middleware (e.g., message broker, distributed log) that guarantees producers and consumers don’t need to be online simultaneously, only the middleware. This middleware software tends to be designed to be highly available and resilient to network and software failure. For example, Kafka has some parts verified using TLA+.

To sum up, “forcing” microservices to communicate asynchronously causes teams to consider their architecture in terms of:

  • Eventual consistency
  • Asynchronous communication
  • Commands and events exchanged between them

This leads to more resilient, highly available systems in exchange for (potential) complexity. If you follow the principles of the Reactive Manifesto, you’ll consider this a staple. However, it might feel technically challenging if you are used to n-tier monoliths sitting on an extensive Oracle/SQLServer database.

Conclusions

There are a few hard rules that one must always follow in anything related to building software. It is such a contextual activity that, for every question, there is almost always an “It depends” answer. That said, having a target architecture, a north star that the team collectively agrees to aim for, is good. When it is not followed, some analysis (ideally recorded for the future) should be done about why a decision was made against the “ideal” design.

In this post, I proposed a few rules I tend to follow (and recommend) when building microservices. Sometimes, it will make sense to break them; however, if you find yourself breaking them “all the time”, you might not be doing microservices in anything other than the name (and that, too, could be okay, but just call it what it is :))

Naming Kafka objects (III) – Kafka Connectors

We discussed naming conventions for Kafka topics and Kafka Producers/Consumers in the previous two posts. This time, we are focusing on Kafka Connect and the connectors running on it.

We will not discuss naming conventions related to Kafka Connect clusters (e.g., config/storage/offset topic names, group.id, etc.) They are normally managed by SysAdmin/DevOps teams and these posts are zooming in developer-related naming conventions.

Kafka Connect in a few words

What is Kafka Connect?

Kafka Connect is a tool for streaming data between Apache Kafka and external systems, such as databases, cloud services, or file systems. It simplifies data integration by providing a scalable and fault-tolerant framework for connecting Kafka with other data sources and sinks, without the need to write custom code.

In other words, Kafka doesn’t exist in a vacuum; there are different “non-Kafka” systems that it needs to interact with, i.e., consume from and produce to. Kafka Connect simplifies this task massively by offering “connector plugins” that will translate between:

  • Protocols: SFTP->Kafka, Kafka->SFTP, Kafka->HTTP, Salesforce API->Kafka, etc.
  • Formats: CSV->Avro, Avro->JSON, Avro->Parquet, etc.

Theoretically, Kafka Connect can also translate between schemas (i.e., data mapping) via Single Message Transformations. However, I advise against using them except for the most trivial transformations.

Kafka Connect defines two types of connectors:

  • Source connectors consume data from non-Kafka systems (e.g., databases via CDC, file systems, other message brokers) and produce it for Kafka topics. These connectors are “Kafka producers” with a client connection to the source data system.
  • Sink connectors consume data from Kafka topics and produce it to non-Kafka systems (e.g., databases, file systems, APIs). Internally, they work as “Kafka Consumers” with a client connection to the destination system.

Naming Conventions

Now that we have a basic understanding of Kafka Connect let’s examine the most relevant settings that require precise, meaningful naming conventions.

connector.name

This is obviously the “number one” setting to define. A few things to consider:

  • It has to be globally unique within the Connect cluster. In other words, no connector in the cluster can share the same name.
  • It is part of the path to access information about the connector config and/or status via the Connect REST API (unless you use the expand option and get them all at once).
  • For Sink connectors, it serves as the default underlying consumer group (plus a connect- prefix). In other words, if your connector is called my-connector, the underlying consumer group will be called connect-my-connector by default.

With that in mind, the proposed naming convention is as follows:

[environment].[domain].[subdomain(s)].[connector name]-[connector version]

ComponentDescription
environment(Logical) environment that the connector is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
domain.subdomain(s)Leveraging DDD to organise your system “logically” based on business/domain components.

Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
connector-nameA descriptive name for what the connector is meant to do.
connector-versionAs the connector evolves, we might need to run side-by-side versions of it or reset the connector completely giving it a new version. Format: vXY (e.g., ‘v01’, ‘v14’).

This field is not mandatory; you can skip it if this is the first deployment.

Do we really need [environment]?

This is a legit question. We said that the connector name must be (globally) unique in a given Kafka Connect cluster where it is deployed. A Kafka Connect cluster can only be deployed against a single Kafka cluster. Therefore, it can only sit in a single (physical) environment. If that is the case, isn’t the “environment” implicit?

Not necessarily:

  1. Your Kafka cluster might be serving multiple logical environments (DEV1, DEV2, etc.). As a result, a single Kafka Connect cluster might be sitting across multiple logical environments even if it belongs to a single physical environment. In this deployment topology, you might have the same connector in multiple logical environments, which would require the [environment] component to disambiguate and guarantee uniqueness.
  2. Alternatively, you might deploy multiple Kafka Connect clusters serving single logical environments against a single (physical) Kafka cluster. You might be tempted to think in this scenario [environment] is not needed since the connector name will be unique within its cluster. However, “behind the scenes”, sink connectors create a Kafka Consumer whose name matches the connector name (plus a connect- prefix). Therefore, if multiple Connect clusters with the same connector name create the same Kafka Consumer consumer group, all sort of “issues” will arise (in practice, they end up either forming a big consumer group targeting the topics across all logical environments in that physical Kafka cluster).

In summary, if you don’t use any concept of “logical environment(s)” and can guarantee that a given connector will be globally unique in the Kafka cluster, you don’t need the [environment] component.

consumer.override.group.id

Starting with 2.3.0, client configuration overrides can be configured individually per connector by using the prefixes producer.override. and consumer.override. for Kafka sources or Kafka sinks respectively. These overrides are included with the rest of the connector’s configuration properties.

Generally, I don’t recommend playing with consumer.override.group.id. Instead, it is better to give an appropriate name to your connector (via connector.name), as per the previous section.

However, there might be scenarios where you can’t or don’t want to change your connector.name yet you still need to alter your default sink connector’s consumer group. Some examples:

  • You have already deployed your connectors without [environment] in your connector.name (or other components) and now you want to retrofit them into your consumer group.
  • You have strict consumer group or connector.name naming conventions that aren’t compatible with each other.
  • You want to “rewind” your consumer group but, for whatever reason, don’t want to change the connector.name.

In terms of a naming convention, I would recommend the simplest option possible:

[environment or any-other-component].[connector.name]

In other words, I believe your consumer group name should track as closely as possible your connector.name to avoid misunderstandings.

consumer/producer.override.client.id

client.id was discussed in a previous post about Producer’s client.id and Consumer’s client.id.

As discussed in that post, it is responsible for a few things:

  • Shows up in logs to make it easier to correlate them with specific producer/consumer instances in an application with many of them (like a Kafka Streams app or a Kafka Connect cluster).
  • It shows up in the namespace/path for JMX metrics coming from producers and consumers.

With that in mind, knowing that we already have a pretty solid, meaningful and (globally) unique connector.name convention, this is how we can name our producer/consumer client.id values.

Connector TypeProperty OverrideValue
Source connectorproducer.override.client.id{connector.name}-producer
Sink connectorconsumer.override.client.id{connector.name}-consumer

Conclusion

We have discussed most relevant properties that require naming conventions in Kafka Connect connectors. As usual, we aim to have semantically meaningful values that we can use to “reconcile” what’s running in our systems and what every team (and developer) owns and maintains.

By now, we can see emerging a consistent naming approach rooted around environments, DDD naming conventions and some level of versioning (when required).

Naming Kafka objects (II) – Producers and Consumers

In a previous post in this “naming” series, we discussed how to name Kafka topics. The intention was to name them semantically meaningfully while avoiding collisions and ambiguity. The focus was on the “nature” of the topics’ data.

This post will discuss the two main “clients” connected to those topics: producers who write data into them and consumers who read data from them. The focus will move away from the “data” towards the applications involved in the data flow.

  1. Producers first
    1. client.id
      1. Organising your JMX metrics
      2. Naming convention
        1. Why do we need an entity/event name?
        2. Don’t we need a ‘version’ part?
    2. transactional.id
  2. Consumers second
    1. group.id
      1. Why do we need an entity/event name?
      2. Why versioning the group.id value?
      3. If I’m versioning my application, should I use it for the ‘version’ value?
    2. group.instance.id
    3. client.id
  3. Conclusions

Producers first

Nothing can be consumed if produced first; therefore, let’s start with producers.

They do a very “simple” job:

  1. Pull metadata from the cluster to understand which brokers take the “leader” role for which topics/partitions.
  2. Serialise the data into byte arrays.
  3. Send the data to the appropriate broker.

In reality, it is much more complicated than this, but this is a good enough abstraction. Out of the dozens of configuration settings producers support, only two settings accept a “naming convention”.

client.id

The Kafka documentation defines client.id as follows:

An id string to pass to the server when making requests. The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

We want a naming convention that makes mapping Producer applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Producer application aims to achieve.

Organising your JMX metrics

There is also an extra role that client.id plays that people tend to forget: it namespaces observability metrics. For example, producers emit metrics under the following JMX MBean namespaces:

  • kafka.producer:type=producer-metrics,client-id={clientId}
  • kafka.producer:type=producer-node-metrics,client-id={clientId},node-id=([0-9]+)
  • kafka.producer:type=producer-topic-metrics,client-id={clientId},topic={topic}

Notice how all of them use clientId as part of the namespace name. Therefore, if we don’t assign meaningful values to client.id, we won’t be able to distinguish the appropriate metrics when multiple producers consolidate their metrics into a single metrics system (like Prometheus), especially if they come from the same application (i.e., 1 application using N producers).

client.id also regularly features in other observability components like logs.

Naming convention

The proposed convention looks like this:

[environment]-com.[your-company].[domain].[subdomain(s)].[app name].[entity/event name]

ComponentDescription
[environment](Logical) environment that the producer is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
com.[your-company]Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
[domain].[subdomain(s)]Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
[app-name]The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
[entity/event-name]Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account).

This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple producers, client.id needs to be unique for each one. Therefore, the ‘entity/event’ in the last section of the client.id name disambiguates them. You don’t need to define an entity/event name if you only use one producer for the application.

Don’t we need a ‘version’ part?

Other naming conventions define a ‘version’ as part of their respective names. This is only necessary when the client is related to state; for example, Consumers and Streams apps must store committed offsets.

Producers, on the other hand, are completely stateless. Adding a ‘version’ part would only make sense if we keep multiple Producer application versions running side-by-side. Even then, one would argue that versioning the application itself would be a better strategy than versioning the Producer client.id

transactional.id

The Kafka documentation defines transactional.id as follows:

The TransactionalId to use for transactional delivery. This enables reliability semantics which span multiple producer sessions since it allows the client to guarantee that transactions using the same TransactionalId have been completed prior to starting any new transactions. If no TransactionalId is provided, then the producer is limited to idempotent delivery. If a TransactionalId is configured, enable.idempotence is implied

There are a few “small” differences between client.id and transactional.id:

  1. client.id doesn’t need to be unique (but I strongly recommend it). transactional.id MUST be unique.
  2. client.id is more “visible” towards developers (through O11Y). transactional.id is mostly opaque, operating behind the scenes in the transaction management subsystem.
  3. client.id can change, although it would make your O11Y information very confusing. transactional.id MUST be stable between restarts.

Other than that, there is nothing special about transactional.id so I recommend using the same naming convention that I have proposed for client.id in the section above.

Consumers second

We have sorted consumers and they are happily producing data. It’s time to look at the other side: consumers.

They too do a very “simple” job:

  1. Get a bunch of topic/partitions assigned as part of the consumer group partition assignment process.
  2. Connect to the brokers acting as leaders for those topic/partitions.
  3. Regularly (attempt) to pull new data from the assigned topic/partitions.
  4. When there is something available, read it (as byte arrays) through the connection.
  5. When it arrives to the application space, deserialise the data into actual objects.

A few configuration settings play a roll in this process.

group.id

The Kafka documentation defines group.id as follows:

A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by using subscribe(topic) or the Kafka-based offset management strategy.

We want a naming convention that makes mapping Consumers applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Consumer application aims to achieve.

The proposed naming convention is as follows:

[environment]-com.[company-name].[domain].[subdomain(s)].[app name].[entity/event-name]-[version]

ComponentDescription
[environment](Logical) environment that the consumer is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
com.[your-company]Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
[domain].[subdomain(s)]Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
[app-name]The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
[entity/event-name]Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account).

This field is not mandatory.
[version]Only introduce or change this value if you need to run side-by-side versions of the app or simply start from scratch. Format: vXY (e.g., ‘v01’, ‘v14’).

This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple consumers, it needs a unique group.id for every one of them. Therefore, the ‘entity/event’ in the last section of the group.id name should disambiguate between them, and it becomes mandatory.

You don’t need to define an entity/event name if you only use one consumer for the application.

Why versioning the group.id value?

The Kafka Consumer uses group.id to define a consumer group for multiple instances of the same application. Those instances collaborate within the group, sharing partitions, picking up partitions from failed instances and committing offsets so other instances don’t process records that another instance has processed already.

Offsets are committed under the group.id name. Therefore, it is critical to use the same group.id value across application deployments to guarantee that it continues to consume from where it left it.

However, there are times when we might want to change the group.id and effectively reset the consumer. The easiest way to do that is to change the group.id. In this case, we can use ‘version’ to have a new consumer group that ignores where the previous deployment instances got up to and falls back to auto.offset.reset to decide where to start consuming.

If I’m versioning my application, should I use it for the ‘version’ value?

Short answer: NO

Longer answer: you probably are (loosely) semantic versioning your application; every merged PR will represent a new version. You don’t want to change your group.id every time your application version changes. The ‘version’ mentioned in the group.id is very specific to the consumer group and how it manages offsets. Don’t mix the two together.

group.instance.id

The Kafka documentation defines group.instance.id as follows:

A unique identifier of the consumer instance provided by the end user. Only non-empty strings are permitted. If set, the consumer is treated as a static member, which means that only one instance with this ID is allowed in the consumer group at any time. This can be used in combination with a larger session timeout to avoid group rebalances caused by transient unavailability (e.g. process restarts). If not set, the consumer will join the group as a dynamic member, which is the traditional behavior.

In other words, while group.id identifies 1 or more instances that belong to a consumer group, group.instance.id identifies unique instances.

The main purpose of group.instance.id is to enable static membership to the consumer group. This helps reducing group rebalancings when instances are not available briefly. The assumption is it is better to delay whatever partitions are consumed by the temporarily missing instance than rebalance the complete group, affecting all other instances.

I recommend using the same group.id naming convention PLUS something that identifies the instance uniquely and is stable between restarts.

client.id

client.id serves the exact same purpose in consumers and producers. Therefore, I will refer you to the previous section for producer’s client.id for a naming convention proposal. See https://javierholguera.com/2024/09/12/naming-kafka-objects-i-producers-and-consumers/#client-id

Conclusions

Naming is difficult and requires care. However, investing in good naming conventions reduces accidental complexity, helps with debugging and diagnosing your system, and supports development through its end-to-end lifecycle.

In this post, I proposed multiple naming conventions that aim to be semantically meaningful, allow you to organize your system into sensible components, and support your system’s incremental evolution.

Naming Kafka objects (I) – Topics

There are only two hard things in Computer Science: cache invalidation and naming things.

Phil Karlton

I started working with Apache Kafka in 2015 when I joined Funding Circle. We were early adopters of the technology (and the architectural approach that comes with it) and had the pleasure (not really) to work with pre-1.0 versions. Since then, I have worked with Kafka in every single company I have been part of.

Fast-forward almost 10 years, and I think I have accrued enough experience to take the risk of proposing solutions to one of the hardest problems in software engineering: naming things.

In this first post, I propose a naming convention for the most important object in a Kafka system: Kafka topics. In subsequent posts, I’ll cover other objects like consumer groups, Stream topologies, and Kafka Connect connectors.

  1. DDD all things
  2. Topic naming convention
    1. [environment]
    2. [visibility]
    3. [topic-type]
    4. [domain].[subdomain(s)]
    5. [record-name]
    6. [key-name]
    7. [topic-version]
  3. Conclusions

DDD all things

I asked ChatGTP for a brief definition of what DDD is (emphasis mine):

Domain-Driven Design (DDD) is a software development approach that focuses on modeling software based on the real-world domain it represents, emphasizing collaboration between technical experts and domain experts to create a shared understanding and ensure that the software reflects the core business logic and needs.

In other words, when applying DDD, you have to adopt the vocabulary of the business and structure your system based on how the company structures (and collaborates) itself. This is very aligned with a sociotechnical approach to software architecture.

I like leveraging DDD to structure systems and communicate intent. Naming things is the cornerstone for both. Names (and namespaces) organise concepts (i.e., structure) while revealing what they are “about” (i.e., intent).

Topic naming convention

Kafka topics contain the information stored and streamed in Apache Kafka. In that sense, they are not too different from other data containers like database tables, messaging queues, S3 buckets, files, etc.

With that in mind, there are three factors to take into account:

  • A topic is named after the content (not after any given consumer or producer). If you think about it, you would never name a database table based on what reads/writes from it. Neither you should do with Kafka topics.
  • The topic names must explicitly convey the topic content (i.e., what data is stored in there). One of Kafka’s selling points is the ability for producers and consumers to exchange information without coupling; topic names that clearly define data content underpin this exchange.
  • Your naming convention must avoid naming collisions. It should support the ability to choose the right name for a topic without worrying that this name might be a token.

With that in mind, the following convention ticks all those requirements:


[environment].[visibility].[topic type].[domain].[subdomain(s)].[record name]-by-[key name]-[topic version]

Let’s break down each component.

[environment]

Mandatory: yes

The environment describes the “logical” environment that a topic belongs to.

Take into account Kafka clusters can be expensive, whether self-hosted (e.g., K8S via Strimzi Operator) or via SaaS (e.g., AWS MSK, Confluent Cloud). If you have a low-volume environment like most pre-production environments, it makes economic sense to “reuse” the same physical cluster for multiple logical environments (e.g., DEV, QA, Sandbox, UAT, Pre-PROD), particularly in an environment where cost savings are king (see https://javierholguera.com/2024/07/02/software-engineering-trends-that-are-reverting-i/).

It also supports ephemeral environments if you achieve the necessary CI/CD maturity.

[visibility]

Mandatory: yes

David Parnas enunciated the principle of “information hiding” in 1972. It’s 2024, and we still haven’t found a better way to reduce coupling and complexity than “hiding what you don’t need to see”; in other words, encapsulation and abstraction are still king to reduce cognitive overload and improve maintainability. This is true for data as well, not just code/behaviour.

With that in mind, it makes sense to categorise topics depending on “who should have access to this data”. Kafka doesn’t have any such taxonomy built-in; instead, you can “play” with roles and ACLs to limit/grant access. However, that is more of an InfoSec approach than naturally categorising topics (and data) based on their nature. One can argue that if topics are properly categorised in terms of visibility, we are one step away from automating ACL access that enforces such visibility at the producer/consumer level.

Another strong reason to categorise visibility is incompatible changes (backward and/or forward). When we know they apply to a topic with limited visibility, we can more easily manage the blast radius of such changes.

I propose the following visibility options.

ValueDescription
ExternalFor topics used to exchange data with systems outside yours (e.g., third parties, SaaS, customers) if they support a Kafka-based interface.

These topics require strong backward and forward compatibility (unless otherwise negotiated with the other parties).
SharedTopics that are visible to all other components of the system between system areas (similar to internal REST APIs).

These topics require strong backward and forward compatibility
InternalTopics that should only be accessed within the same DDD domain or bounced context.

Ideally, that would be a limited number of well-known applications that can be maintained and deployed together.
PrivateTopics that are owned by a single application. They are similar to a private database/table that supports the functioning of a single microservice.

Compatibility is decided on a per-application basis by the team maintaining it.

[topic-type]

Mandatory: yes

Not all messages are born equal. Clemens Vasters has a good introduction to different types of messages in his “Events, Data points and messages” talk from 2017. While Kafka is not the best option for some of these types of messages, we all end up “abusing” it to an extent, if anything, because having multiple messaging technologies in your stack might be worse than stretching one of them beyond its optimal use case.

With that in mind, and without going to the level of granularity that Clemens proposes, below covers the “most common” types. Choosing the right type helps convey a lot of meaning about the nature of the data that is stored in the topic.

Message Typetopic-type valueDescription
CommandscommandA command conveys the intention to change the state of the system (e.g., “ConfirmPayment”).
EventseventAn event captures something that has happened and is being broadcasted for other systems to consider (e.g., “Payment Confirmed”).
EntitiesentityAn entity groups fields that represent a single, consistent unit of information.

When you design event-first architectures, events are modelled first (to support business processes), and entities “emerge” from them to group information, generally around DDD boundaries (for example, “Payment”, “Customer”, “Account).
Change Data CapturecdcCDC topics define a stream of changes happening to the state of a given entity, commonly represented as a database table. In other words, they are a stream of CRUD changes to a table.

Capturing the before/after of the affected entity simplifies downstream consumers.
NotificationsnotificationUnlike a normal event, they don’t contain all the information that was captured as part of the event happening. Instead, they contain a link back to a place where the information was captured.

They are useful when the payload would be “too big” to embed in the event, so instead, the consumer has the option to “trace back” to independent storage like S3 for full details.

[domain].[subdomain(s)]

Mandatory: yes

This part leverages DDD (or any other approach that you choose to organise your system) to guarantee there aren’t any naming collisions.

It is also quite powerful as a first layer to navigate the hundreds (or thousands) of topics in your system. If you combine visibility AND domain/subdomains(s), you can quickly identify the appropriate stream containing the information that your consumer requires.

For example, if your consumer needs to react to a payment being confirmed, the following topic would be quite easy to find.

prod.shared.event.payments.reconciliation.payment-confirmed-by-payment-id

You can search by the environment, visibility, and domain/subdomain and list a few dozen topics at most that will be part of the “reconciliation” domain before finding the right one.

[record-name]

Mandatory: yes

A descriptive name for the data stored in the topic is similar to naming database tables.

The record name “grammar” will be influenced by the topic type.

Topic typeRecord nameExample
CommandsPresent Tense + NounSendEmail, ApproveOrder
EventsNoun + Past TensePaymentReceived, AccountClosed
EntitiesNounOrder, Account, Payment, Customer
Change Data CaptureTable Name
NotificationsNoun + Past TenseLike ‘events’

[key-name]

Mandatory: yes

The name of the field that serves as the key for the record. It is VERY important to clearly identify what field in the message payload is used as the key to partition records in the topic.

Remember that in Kafka, strict ordering is only guaranteed within the partition (and not guaranteed across partitions in a given topic). Therefore, whatever key is used to partition will be the ordering criteria.

For example, if we have an “entities” topic partitioned by the entity’s natural key, we know all changes to that entity will be strictly ordered. If, for any reason, we chose a different key, the same changes to that entity might be “consumed” in a different order than they happened across different partitions within the topic.

[topic-version]

Mandatory: no

The final field is only here because incompatible changes DO happen despite our best intentions to avoid them.

When we face such change, we normally need to follow a process like this:

  1. Create a second version of the topic topic (i.e., -v2).
  2. Start a migrating topology that upcasts messages from -v1 to -v2 continuously.
  3. Point all our consumers to the new -v2 topic.
  4. Once the consumers are up-to-date, migrate all our producers to the -v2 topic.
  5. Remove -v1 topic.

Since this situation shouldn’t happen for most of your topics, this name component is optional (until you need it).

Conclusions

It is not a simple naming convention, and although it is unlikely, you must be careful not to overrun the topic name length limit (255 characters).

However, if you follow it thoroughly, you will find that what would be otherwise a mess of topics where it is impossible to find the right information becomes a structured, well-organised, self-serving data platform.

You probably don’t know monoliths

In the last couple of months or so, there has been an uptick in posts bashing microservices and confidently stating that all you need is a monolith.

In this post, I will explain why I think microservices have a bad reputation, why monoliths are making a comeback, and what the dangers of monoliths are.

What follows is a bit of a ranty post. If you are looking for rational, technically sound content, consider other posts in this blog and skip this one 🙂

Everybody hates microservices

Except for all the people who have successfully built systems using them for the last 15 years but don’t mind them, focus on the LinkedIn haters who have been posting forever that “all you need is a monolith” and that microservices were wasteful and complex.

These people come in two flavours.

The “I don’t like change” type

Seasoned practitioners decided that microservices weren’t their cup of tea and stuck to their guns. In any other industry, they would face an “adapt or die” challenge and be wiped out of the field. However, Software Engineering has been ballooning for decades; we have absorbed people with all sorts of random backgrounds, and, of course, we haven’t left anyone behind and kept the pseudo-neo-luddites around for good measure.

That said, it is useful to have contrarians around because they serve as a counterbalance to the “happy trigger” techno-optimists who blindly embrace anything new because “new is always better.”

The “Bullshit merchant” type

This is a new phenomenon resulting mostly from the advent of social networks and the rise of personal branding. These guys need to make some kind of noise to be relevant, so why not choose something relatively established and make strident noises about it to capture people’s attention? After all, we live in the attention economy world now.

While the change-resistant folks have a utility as a counterbalance (and, sometimes, early proponents of very serious problems), the bullshit merchants only add noise to the conversation and live to serve their own agenda.

How do you spot them?

  • They say things like “I have never seen anything like this in my career”, and they only left university a few years ago.
  • They used words like “huge”, “insane”, “incredible”, etc., to describe mundane novelty and/or change.
  • They work at consultancy companies (hello, dear McKinsey reader!).
  • They have something they want to sell to you, even if you’re not sure what it is (maybe expertise?).
  • They post about random events in the world and how they connect to B2C SaaS sales.

Real issues with microservices

Discrediting the people who discredit microservices doesn’t make them the right architectural choice, right? It would also be ridiculous to pretend that microservices are always “the right tool for the job”. There are plenty of cases where you should not use them. Why did it all go wrong?

The unfathomable sizing

Sizing a monolith is easy: you don’t have to. You just dump everything, every line of code, every new feature, there. A whole world of pain is avoided.

SOA architectures predate microservices. Yet, somehow, they managed to avoid the never-ending discussion about sizing. Most likely, because you would have “few enough” services it would be distinctly obvious when it was time to stand a new one up separately. It would be screaming in your face, no choice given.

Microservices, on the other hand, were meant to be “micro”, i.e., small. Otherwise, you would lose their alleged benefits, such as using the appropriate technology stack, offering clear and granular enough boundaries for teams to grow and operate independently, and being able to scale parts of the system separately.

But what was the right size? Nobody knew. One of the OG articles on microservices, Martin Fowler’s blog post, asks the same questions about size without answering them. It’s 10 years old and it is quite telling of what would lay ahead. So many heuristics:

  • A two-pizza team should be able to maintain as many microservices as team members.
  • Sam Neumman, the father of microservices, proposed its size as something that could be rewritten in two weeks.
  • There were various heuristics (which I personally adhere to) based on DDD (Domain-Driven Design) and transactional boundaries to guarantee data consistency and cognitive load balance.
  • I once attended a talk where a guy happily claimed to create a new microservice every time he had to implement a new feature at his 2-people startup. God knows what happened to them; surely nothing good.

In other words, nobody knew how to size them. When something is difficult to grasp and doesn’t have clear guidelines, expect the most horrible abuses, followed by undesired side effects and a sudden realization that [insert your technique here] is baaaad.

Getting ready for the improbable success

Most startups fail. For those that don’t fail, they don’t achieve planetary scale or experience hockey stick growth.

How annoying I’m finding AI-related news over time

Microservices were, first and foremost, a tool to scale organisations. They were a sociotechnical architecture pattern that guaranteed that, as companies grew their engineering departments, they would not fall for diminishing returns. Thanks to microservices, people would work and deploy independently in highly cohesive, loosely coupled teams that were aligned to organizational goals. This is what Facebook, Netflix, Amazon, Google, Microsoft, etc., were doing to great success.

Unfortunately, we got the causality arrow wrong. Just like buying a luxurious car doesn’t make you rich, adopting microservices doesn’t make you (or help to be) successful. Successful organisations were forced to adopt microservices (even before they were named) as a consequence of their success (and organisational growth). It was a tax to pay (since microservices, like most things in life, aren’t free lunch) to continue riding the J curve to surreal market valuations.

It follows that adopting microservices as preparation for the inevitable success would be a reverse self-fulfilled prophecy: detrimental to hitting the jackpot.

A long-tail of technologies

While one could implement microservices with the most rudimentary tech stack that already existed in the late 90s and early 2000s, most people ended up dragging a bunch of usual suspects that only increased the suspicion that “microservices were hard” and unnecessarily complex:

  • (Docker) containers because you were meant to adopt the right tool for the job, which meant a polyglot stack (Python, Node, Java, Go, etc.)
  • An orchestrator to manage those containers, like Kubernetes. This one deserves its own post, as it has come to be seen by many (not me) as a trojan horse planted by Google to slow down the startup ecosystem and maintain technological dominance 🙂
  • Polyglot persistence, where NoSQLs like MongoDB would be the key to webscale
  • Your favourite cloud provider because managing all those technologies would require an army of DevOps/SysOps, and the cloud provider did all of that for you (for a penny)
  • A variety of testing tools to support complex E2E scenarios that didn’t quite exist when you were hitting a single application/service
  • Various microservice-related patterns like circuit breakers, sagas, choreography, orchestration, etc.

In other words, we came to associate microservices with many other technologies and tools that were adopted together, even if not always needed, increasing the cognitive overhead for the whole solution.

Why are monoliths back now?

Well, it’s the economy, stupid! Or, more specifically, the end of ZIRP (Zero Interest Rates) and a renewed focus on costs of all kinds.

A part of this is easy to understand: microservices are perceived as expensive, hence why we should ditch them. However, this is the wrong reason to discard microservices. If you believe you need them, the extra cost is significantly smaller than the horrors of not adopting them, slowly grinding your tech department to a hold, incapable of delivering new features or at significantly larger cycles.

What is more interesting is how the end of ZIRP will affect organisations’ bottom line. For years, “growth” was the only metric: more customers and more market share. There was a drive to grow organisations, offer more products and continue in this virtuous cycle until some kind of exit; revenue and profit were an afterthought, something that would materialise eventually. If it worked for Google or Amazon, why wouldn’t work for me? This added a lot of pressure for tech departments to scale: hire more engineers, onboard them as fast as possible, and continue launching.

The part is over, though.

VCs and other liquidity providers want to see results in the short term instead of some fantastic future growth that never materialises into ROI. They want more revenue and more profit for the same (or less) investment. That means “doing more with less” (as every CEO puts it) or, in other words, forgetting about hiring/growing and doubling the workload on your existing staff. If you don’t like it, go join the long queue of engineers who have been laid off from busted startups and “more nimble, prepared for future growth” FAANG companies.

In this context, you probably need microservices less often than before. If you don’t have to solve an organisational scalability problem and nothing suggests you are gonna be one of the unicorns that experience tremendous demand growth, why would you jump on that wagon?

You probably started with a single service anyway. Stick to it for as long as it makes sense and ignore the labels.

Is that it? No more microservices?

Are we gonna worship monoliths now like we once (mostly) worshipped microservices?

I truly hope not. I was part of the industry before microservices were given a name. I have seen some absolutely horrendous codebases that could only be deployed every 3+ months because they were too big and too bloated to do any more often. I have also been involved in a few “microservices migrations” where the company would try to transition away from the monolith(s); spoiler alert, it was NOT pretty, and it was NOT successful. If people advocating blindly monoliths over microservices think that is progress, they don’t know what they are talking about.

Maybe what they call “monolith” is your ever-slightly enlarged, quite-young-startup microservice. Comparing that to a microservice would be like comparing your dog to a T-Rex

why do people like torturing their pets?

If they had truly experienced monoliths (the Airbus 380 type ones), they would not be happily advocating them. For all the “distributed monolith” systems out there, we have 10x more monoliths that have taken / will take years to migrate to appropriate architectural approaches because the hardest problem is always the data (and yes, it is also for microservices, as Christian Posta called out 5 years ago).

My recommendation is to ask lots of questions to the business/product department, dig as deep as possible into future growth plans (particularly head counts) and draft an architectural roadmap that accounts for that. If your whole engineering department is a single team of colocated developers, don’t even think about microservices.

If your company is planning to open multiple offices and hire developers across the globe in separate time zones, you need to start considering patterns that enable people to work as asynchronously as independently as possible. Can you do that with a monolith? Sure. Is it easier to do than adopting microservices? Unlikely.

There is a time and place for microservices, just as there is for monoliths. Anyone pretending otherwise is not to be trusted.

Software engineering trends that are reverting (I)

When I entered the software industry a long time ago, people who had been part of it warned me that software trends came and went and eventually returned. “This thing that you call ‘new’, I have seen it before”. I refused to believe it. Like a wannabe Barney Stinson, I thought ‘new’ was always the way.

I have been around long enough to see this phenomenon with my own eyes. In this series of posts, I want to call out a few examples of “trends” (i.e., new things) that a) aren’t new anymore and b) people are walking away from. The series starts with one of the most “controversial” trends in the last 10-15 years: microservices!

Microservices are so 2010s

Microservices are dead! I’m joking. They are not dead, but they are not the default option anymore. We are back to… monoliths.

While there have always been people who thought microservices weren’t a good idea, the inflexion point was the (in)famous blog post from Amazon Prime video about replacing their serverless architecture with a good, old monolith (FaaS is just an “extreme” version of microservices).

Why was this more significant than the thousands of posts claiming microservices were unnecessary complexity, talking about distributed monoliths and criticising an architectural approach that came from FAANG and only suited FAANG? Well, because… it came from FAANG. The haters could claim that even a FAANG company had realised microservices weren’t a good idea (“We won!”).

Realistically, this would have been anecdotal if it weren’t for something more important than a bunch of guys finding a way to save money when they serve millions of daily viewers (do YOU have THAT problem?).

It’s the economy, stupid

US FED interest rate since 1990

The image above shows the US FED official interest rates. Historically, interest rates have been pretty high (about 5%, according to a recent interview with Nassim Taleb on Bloomberg). From 2008 to post-COVID 2022, we experienced an anomaly: close to 0% rates for almost 15 years. Investors desperate to find good returns for their money poured billions on tech companies, hoping to land the next Google or Facebook/Meta.

Source: https://goingdigital.oecd.org/en/indicator/35

Lots of startups with huge rounding funds started to cosplay as future members of the FAANG club: copy their HR policies, copy their lovely offices, and, of course, copy their architectural solutions because, you know, we are going to be so great that we need to be ready, or we might die of success.

We all built Cloud Native systems with Share-Nothing architectures that followed every principle in the Reactive Manifesto and were prepared to scale… to the moon! 🚀 Microservices were the standard choice unless you were more adventurous and wanted to go full AWS Lambda (or a similar FaaS offering) and embrace FinOps to its purest form.

The only drawback is that it was expensive (let’s ignore complexity for now, shall we?). That didn’t matter when the money was flowing, but now the music has stopped, and everybody is intensely staring at their cloud provider bill and wondering what they can do to pay a fraction of it.

What is next?

Downsizing all things.

BeforeAfterComment
Microservices/FaaSMonolith(s)“Collapse” multiple codebases into one and deploy as a single unit.

The hope is that teams have become more disciplined at modularising (unlikely) and “build systems” have become more efficient in managing large codebases (possibly).
Messaging (Kafka et al)Avoid middleware as much as possibleMiddleware is expensive technology. With monoliths, there will be fewer network calls that require it.

Direct communication (e.g., HTTP, gRPC) will be the standard (again) when necessary. Chuckier monoliths will reduce network traffic compared to microservices
NoSQLRelationalMany NoSQL databases optimise for high throughput / low latency / high durability, which will happily be sacrificed for cost savings. Relational databases are easier to operate and run yourself (i.e., self-host), which is the cheapest option (some NoSQL, like CosmosDB or DynamoDB, can’t be self-hosted).

On the complexity side, relational databases are seen as easier for developers to understand (until you see things like this).
Stream ProcessingGone except for truly big dataStream Processing is expensive and complex. Most businesses won’t care enough about latency to pay for it, nor will have volumes that require it.
KubernetesCloud-specific container solutionsWe should see a transition towards more “Heroklu-like” execution platforms. It will be a tradeoff between flexibility (with K8S offers bucketloads) and cost/simplicity.

Sometimes, containers will be ditched too and replaced by language-specific solutions (like Azure Spring Apps) to raise the abstraction bar even higher.
Multi-region / Multi-AZ deploymentsNo multi-region unless compliance requirement.
Fewer multi-AZ deployments
Elon has proved that a semi-broken Twitter is still good enough, so why wouldn’t companies building less critical software aim for 3-5 9s?
Event-Driven ArchitectureHere to stayThis approach isn’t more or less expensive than Batch Processing (if anything, it’s cheaper) and still models business flows more accurately.

What are we gaining and losing?

Microservices are neither the silver bullet nor the worst idea ever. As with most things, they have PROs and CONs. If we ditch them (or push back harder against their adoption), we will win things and lose things.

What do we win?

  • It is easier to develop against a single codebase.
  • Local testing is simpler because running a single service in your machine is more straightforward than running ten. Remote testing is also more accessible, as hitting one API is less complicated than hitting many across the network.
  • It is also easier to deploy a single service than many.
  • Easier maintainability/evolvability. When a business process has been incorrectly modelled, it is easier to fix on a monolith (with, ideally, single data storage) than across many services with public APIs and different data storages.

What do we lose?

  • Once a codebase is large enough, it is tough to work against it. Software is fractal, which is also valid for “build systems”: you want to divide and conquer.
  • Deploying a single service can be more challenging if multiple people (or, even worse, teams) need to release changes simultaneously. More frequent deployments can alleviate the problem, but most companies don’t go from a dev branch to PROD in hours but days/weeks.
  • The blast radius for incorrect changes will be higher. Systems are more resilient when they are appropriately compartmentalized.
  • Organisations growing (are there any left?) will struggle to increase their team’s productivity linearly with the headcount when the monolith becomes the bottleneck for all software engineering activities.
  • FinOps and general cost observability against business value will massively suffer. A single monolith will lump everything together. With multiple teams involved, it will be harder to understand who is making good implementation decisions and who isn’t, as the cost will be amalgamated into a single data point.

Summary

Microservices are not dead. However, they are suspicious because they are expensive in terms of infrastructure cost and, indirectly, engineering hours due to their increased complexity. However, they are also crucial to unlocking organisational productivity as the engineering team grows beyond a bunch of guys sitting together.

As the industry turns its back to FAANG practices and we sacrifice various “-ilities” on the altar of cost savings, the future of microservices will be decided based on how often we identify when they are the absolute right solution and how well we articulate its case. When in doubt, the answer will be (and perhaps it should have always been) ‘NO’.

As a parting thought, I have been involved in 3 large-scale monolith refactors/rewrites to microservices. All these projects were incredibly complex, significantly delayed and more of a failure than a success (some never entirely completed). Starting with a monolith is, most of the time, the correct answer. However, delaying a transition to smaller, independent services is almost always as bad (if not worse) than starting with microservices would have been in the first place. We are entering a new era where short-time thinking will be even more prevalent than before.

Software is fractal

Fractal image generated by ChatGPT

I have been building software for nearly 20 years and keep stumbling upon the idea that software is fractal. It is like a nagging feeling that I have been unable to concretise. This post is a (probably poor) attempt at it and why it matters.

What does “fractal” mean?

Let’s ask ChatGPT, the source of all modern knowledge.

A fractal is a complex shape that looks similar at any scale. If you zoom in on a part of a fractal, the shape you see is similar to the whole. Fractals are often self-similar and infinitely detailed. They can be found in nature in patterns like snowflakes, coastlines, and leaf arrangements.

ChatGTP, 2024

In other words, fractal structures exhibit similar shapes (and, potentially, behaviour) at different scale levels (i.e., zooming in and out). Thus, one can use a similar set of concepts and rules to understand them regardless of where in the scale you focus.

Examples of software “fractalism

What better way to prove that software is fractal than with examples. Pretty much all software systems are a combination of the following components:

  • Code, which captures desired behaviour.
  • Messages, which represent communication between code components.
  • Load Distribution, which guarantees the best possible performance.

Example 1 – Code

First, let’s look at how we capture “behaviour”, i.e., code that executes what we want the software to do. While not the lowest level, a reasonable low level would be classes and functions capturing code written by developers according to specifications that capture the desired behaviour.

Every class or function offers a series of “promises” (à la Promise Theory) about what it can do. Classes and functions use other classes and functions to compose higher-order behaviours. Modules and components aggregate classes and functions, producing new promises for more complex behaviours.

Eventually, modules and components are built up to (micro/macro)services, resulting in fully-fledged systems.

At every level, we forfeit details. The newly formed aggregate hides information about how it implements its promises, which isn’t relevant for those consuming them. It is crucial to design the appropriate APIs (aka, boundaries) and restrict premature abstractions with incomplete information (see Rule of Three).

Example 2 – Messages

It is all good to package your code with the appropriate boundaries and promises, but it will only be helpful if someone/something uses it. Users and machines decide to consume your promises, but they require a way to “exercise” them. Messaging happens again at all possible layers.

Clients and servers can be of any nature: at a low level, they can be two classes or functions. For example, one function A invokes another function B. B expects a given set of parameters, which A prepares and “sends” in its invocation. This is akin to a message from A to B, which waits for the result. The interaction can be async/sync, blocking or non-blocking; it doesn’t matter. The mechanics are the same.

Object orientation done right was all about message passing. Smalltalk, the OG object-oriented language, represented every interaction as a message sent to an object.

Components and modules rely on similar interactions. They can be as low level as between classes and functions (i.e., direct invocation through memory address) or more abstracted (like in-memory event buses like Spring Application Event).

Once we move further up and we need to cross the network to communicate, messages become more explicit. Events, commands, notifications, etc., represent messages that different code components/services exchange in a (not always) beautiful choreography to implement complex behaviours. In this layer, you find REST APIs with JSON payloads, gRPC (or, if you are unlucky enough, previous interactions like CORBA, DCOM, Java RMI, .NET Remoting, etc.), Kafka/Avro, etc.

At the top, we have systems interacting with other systems. It is the same metaphor (and, quite often, the same protocols and formats) but with reduced trust and a more straightforward boundary/interface.

Some components repeat themselves across all levels:

  • Sender / Recipient information
  • Body (parameters and values)
  • Metadata (time reference, size, integrity checksums)

Example 3 – Load Distribution

So far, we have behaviour represented as code and “communication intent” captured as messages. We now need to make sure those intents travel between code components. Enter “Load Distribution”.

At the lowest level, we have memory addresses (i.e., find my message at the beginning of this pointer and read consecutively) and the kernel scheduler assigning CPU time “equally”. This is both a way to distribute the message representing the intent and implement a form of load balancing since multiple code components “compete” for scarce resources. These components “wait” until resources are assigned; in other words, they queue. We scale by throwing more CPU cores to every problem (if we are lucky enough to have a parallelisable problem).

Moving up the stack, we find thread pools and managers that aim to multiplex the underlying physical (or virtual) CPUs as exposed by the OS. It is the same premise but applied a a higher level of abstraction with added support from the tooling. We start dealing with various types of explicit (in memory) queues and other synchronisation objects.

Things become more interesting when we get to the service level and the network is involved. Addresses move from memory to explicit (i.e., IP protocol), Load Balancers become network appliances, and where we once threw CPU cores at problems, we now throw instances/containers (if we are lucky enough to have a share-nothing architecture).

At the top level, we start thinking of distributing load across (cloud provider) Regions. While uncommon from a load perspective (most companies don’t need this), it can be required for availability/redundancy purposes. At this point, we have zoomed out so much that the whole system deployed to a set of Availability Zones in a single Region is just a node in the graph.

Why does this all matter?

I believe all of this matters because, when problems, patterns and structures repeat at a certain level, solutions become reusable.

For example, our mental model should not be different when we test a class/function or a whole service. What differs is “how” we apply that mental model.

AspectClass/FunctionService
BoundaryMethod/Function signaturePublic API
ArrangePrepare input data
Set dependency expectations (e.g., mocks, stubs)
Inject dependencies (other classes)
Prepare input message
Roll out infrastructure (database, message bus)
Prepare required state (e.g., preload database)
ActInvoke method/functionSend “message” to public API
AssertAwait results
Validate returned data
Verify dependency expectations
Await results
Validate response message
Validate state mutations

This is just one of many examples where, while the specific change, the mental model remains the same. Other examples include:

  • (Some) SOLID principles apply at multiple levels
    • Both SRP and IS are about granularity, which leads to Microservices architecture.
    • Extending is always safer than modifying, whatever your boundary.
    • The Inversion of Control that DI seeks can be achieved at service-level injecting client libraries that decouple actual service implementation(s).
  • Data structures showing up at multiple levels
    • Hashmaps (along with lists) are the most common data structure in software engineering (every JavaScript object is basically one).
    • Key Value databases are wildly popular and pretty much the same thing, at a different scale.
  • Contracts as a way to enforce promises
    • At the language level, we have method signatures with types (if you are using statically typed languages).
    • At the system level, we have attempts like Protobuf, Avro, JSON Schema and the whole “Data Contracts” movement to replicate.

If this rings true, then we should focus more on the rules that handle these commonalities and less on the specifics. If we see these patterns emerging, solutions that we know well in one scale can apply to other scales, just like your knowledge about SQL helps you with multiple databases or your mastery of Java translates (to a degree) to Kotlin or C#.

Summary

This metaphor needs to be more consistent (i.e., there will be many cases where it doesn’t apply) and complete (i.e., I’d need to dedicate the rest of my career to finding all the missing examples).

However, I believe once you start “seeing it”, it cannot be “unseen”. You will find more and more cases where your knowledge reapplies and there is a feeling of familiarity.

Some might call it “common sense,” but I prefer to think of it as more elegant than that.