A few broad rules about microservices

I know, we don’t like microservices anymore, and they are out of fashion. However, I still think about how I was trying to build them until the sudden change in industry fashion convinced me that building modular monoliths was entirely different from how monoliths were meant to be built in the past.

In this post, I want to reflect on a few broad roles I follow(ed) when building microservices. I believe broad rules and flexible heuristics are appropriate when making architectural decisions because context (e.g., business, technical, human) is more critical than (blindly) following rules.

What is a microservice?

We should first qualify what a microservice represents.

Most people equate a microservice with a single service deployable. For example, my microservice is a REST API web server receiving traffic on port 8080. This is not the right way of thinking about microservices because microservices weren’t invented as a technical concept but as a sociotechnical one, a solution to the organisational scalability problem (in a technical context, nonetheless).

The rules that define microservice boundaries cannot be purely technical (albeit this is a crucial factor); they also need to incorporate aspects like team structure, cognitive load, product boundaries, geographical boundaries, etc.

Therefore, it becomes easier to define what a microservice is not:

A microservice is not a single process (e.g., a single JVM running in a container).
A microservice is not a single deployable (e.g., a single Docker image).
A microservice is not defined in a single repository (e.g., Github).

On the contrary, the following things are perfectly possible:

One microservice runs as multiple processes because, in its Kubernetes pod, we also have sidecar containers performing roles like Service Mesh, Ambassador, etc.
Multiple deployables working in coordination define one microservice. For example, in a CQRS or Event Sourcing architecture, you might separate your command handler and/or writer model producer from your query handler and/or event consumer.
If we choose to use a company-wide mono repo or product mono repos (all parts of a given product), one repository contains multiple microservices.

To summarise these rules, we can say:

A Microservice is an application (i.e., a logical boundary for behavior represented by code and state resources) that is developed, deployed and maintain together.

In other words, every microservice has only one team that owns it.

This principle doesn’t rule out any of the following:

Multiple teams (and their products) depending on a given microservice. This is fine because it doesn’t change the fact that only one team owns the microservice.
Multiple teams contribute to the microservice code. This, too, is okay, provided the team that owns the microservice is happy to support some kind of coordinated contribution process (like the Inner Source model).

For reasons of accountability, only one team must own a service. We trust teams to build services and run them in production; in exchange, we must empower them to make the right decisions (within whatever architectural framework the company has adopted).

This empowerment is voided if it doesn’t come with the necessary accountability for the consequences of those decisions. “Sharing” ownership, at a minimum, deludes accountability; at worst, it completely prevents it.

Private data storage is where a single microservice stores information for execution. This can be master data for lookups, historical payments data for de-duplication and/or state validation, etc.

Microservices must be able to change the schema of their data storage freely; after all, it is their private concern.

We achieve this by ensuring that two microservices never share the same private data storage.

Public vs private data storage

It is important to differentiate what private data storage is and when we see it as public.

Private data storage is created to support the functionality of a microservice; when the microservice logic changes, the storage (might) also change. Private data storages contain private data designed to be consumed by the microservice but not shared or available to other microservices. Private data storages don’t make any promises regarding schema changes (e.g., backward compatibility, forward compatibility) beyond what the single microservice requires. To sum up, private data is an implementation concern.

Public data storage is the opposite of what is described above: it is designed for sharing, uses schemas with compatibility guarantees, and is easy to consume. In other words, public data is an API.

The following table contains some examples of public and private data storage:

Storage	Type	Description
Private Kafka Topics	Private	Akin to tables in a private database. For an explanation of what a “private topic” means, see: https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#visibility
Shared/External Kafka Topics	Public	Akin to REST API endpoints. For an explanation of what a “shared/external topic” means, see: https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#visibility
Private Blob Storage (S3/Azure Blob Storage) folders	Private	Only used by a single microservice and/or application
Public Blob Storage (S3/Azure Blob Storage) folders	Public	Produced by one microservice, available for others to read.
Relational databases	Private	Only used by a single microservice and/or application.
NoSQL databases	Private	Only used by a single microservice and/or application.
Like-for-Like shared/external Kafka topics sunk into databases	Public	This is a particular case of the above. If a topic producer decides to offer a “queryable” version of the same data as a (SQL/NoSQL) database and it is captured as a like-for-like sink of a shared/external topic, it is public because: – Its schema will follow the same compatibility rules as the Kafka topic(s). – Its database (and tables/containers) are provisioned exclusively for sharing purposes, not as a private concern of any specific microservice.

Avoiding resource contention

Another reason for separating microservices databases is to avoid resource contention.

In a scenario where two microservices share the same database (or other infrastructure resources), it can run into Noisy Neighbour antipattern problems:

Application A receives a spike of traffic/load and starts accessing the shared resources more intensely.
Application B starts randomly failing when it cannot access the shared resources (or it takes longer than it can tolerate, leading to timeouts).

Ensuring every microservice accesses independent resources guarantees we don’t suffer these problems.

This principle can lead to increased infrastructure costs. For that reason, it is perfectly reasonable to consider the following exceptions:

Reuse underlying infrastructure in environments before the production environment, where the consequences of occasional resource contention are not particularly worrying.
Reuse underlying infrastructure between services whose volume is expected to be low. As long as the microservices aren’t coupled at the logical level (i.e., the data itself, not the storage infrastructure), it is relatively easy to “separate” their infrastructure in the future if required (compared to separating them if coupled at the logical schema level).

For the last point, I would advise against doing this with microservices shared by somewhat distant “organizationally” teams (e.g., crossing departments or division boundaries, minimum timezone overlap, or any other barrier that prevents fluid communication).

Rule 3 – Avoid distributing transactions through the network

I always recommend considering DDD heuristics to drive your microservice design. I use the DDD “Aggregate Root” concept to help me model microservices and their responsibilities. DDD defines “Aggregate Root” as follows:

An Aggregate Root in Domain Driven Design (DDD) is a design concept where an entity or object serves as an entry point to a collection of related objects, or aggregates. The aggregate root guarantees the consistency of changes being made within the aggregate by forbidding external objects from holding references to its members. This means all modifications within the aggregate are controlled and coordinated by the aggregate root, ensuring the aggregate as a whole remains in a consistent state. This concept helps enforce business rules and simplifies the model by limiting relationships between objects.

An aggregate root should always have one single “source of truth”, i.e., one microservice that manages its state (and modifications). We want this because it means we avoid (as much as possible) distributing transactions over multiple services (through the network).

The alternative (i.e., distributed transactions) suffers from a variety of problems:

Performance problems when leveraging Distributed Transaction Coordination technology like XA Transactions or Microsoft DTC (i.e., 2-phase commits).
Complexity when using patterns like Saga pattern and/or Compensating Transaction pattern.

Designing your Aggregate Roots perfectly doesn’t guarantee you won’t need some of those patterns. However, it will minimise how often you need them.

In summary, if your microservice setup splits an aggregate root, you are doing it wrong; you should “merge” those two services.

Rule 4 – Network latency adds up

Crossing the network is one of the slowest operations. It also introduces massive uncertainty and new failure scenarios compared to running a single process in the same memory space. Jonas Boner has a fantastic talk about the “dangers” of networks’ non-deterministic behaviour compared to the “consistency” one can expect from in-memory communication.

This is true when you call other microservices (e.g., directly via REST or indirectly via asynchronous communication) and when talking to external infrastructure dependencies like databases.

When considering “dividing” your system into multiple microservices, consider the impact on end-to-end latency against any non-functional requirements for latency (e.g., 99th percentile latency).

Rule 5 – No Service to Service calls between services

This rule only applies if you are following a strict “Event Driven Architecture”. Even if that scenario, there will be cases where S2S calls will be “necessary” to avoid unnecessary complexity.

One of the benefits of microservice architecture is the decoupling that we get from services that
depend on each other indirectly. In a monolith, all modules live and fail together, causing a large “blast radius” when something goes wrong (i.e., the whole thing fails).

In traditional microservices (e.g., sync communication based via REST/HTTP or gRPC), there is a decoupling in “space” (i.e., the services don’t share the same hardware). However, they are still coupled “in time” (i.e., to an extent, they all need to be healthy for the system to perform). Some patterns, like circuit breakers, aim to mitigate this risk.

Avoiding S2S calls breaks the couple “in time” by introducing a middleware (e.g., message broker, distributed log) that guarantees producers and consumers don’t need to be online simultaneously, only the middleware. This middleware software tends to be designed to be highly available and resilient to network and software failure. For example, Kafka has some parts verified using TLA+.

This was an excellent talk at #kafkasummit. Great to see formal methods (here: TLA+) being used to verify the @apachekafka replication protocol. https://t.co/pKNgBbaSeI https://t.co/0OLviunk5i
— Martin Kleppmann (@martinkl) October 16, 2018

To sum up, “forcing” microservices to communicate asynchronously causes teams to consider their architecture in terms of:

Eventual consistency
Asynchronous communication
Commands and events exchanged between them

This leads to more resilient, highly available systems in exchange for (potential) complexity. If you follow the principles of the Reactive Manifesto, you’ll consider this a staple. However, it might feel technically challenging if you are used to n-tier monoliths sitting on an extensive Oracle/SQLServer database.

Conclusions

There are a few hard rules that one must always follow in anything related to building software. It is such a contextual activity that, for every question, there is almost always an “It depends” answer. That said, having a target architecture, a north star that the team collectively agrees to aim for, is good. When it is not followed, some analysis (ideally recorded for the future) should be done about why a decision was made against the “ideal” design.

In this post, I proposed a few rules I tend to follow (and recommend) when building microservices. Sometimes, it will make sense to break them; however, if you find yourself breaking them “all the time”, you might not be doing microservices in anything other than the name (and that, too, could be okay, but just call it what it is :))

Naming Kafka objects (III) – Kafka Connectors

We discussed naming conventions for Kafka topics and Kafka Producers/Consumers in the previous two posts. This time, we are focusing on Kafka Connect and the connectors running on it.

We will not discuss naming conventions related to Kafka Connect clusters (e.g., config/storage/offset topic names, group.id, etc.) They are normally managed by SysAdmin/DevOps teams and these posts are zooming in developer-related naming conventions.

Kafka Connect in a few words

What is Kafka Connect?

Kafka Connect is a tool for streaming data between Apache Kafka and external systems, such as databases, cloud services, or file systems. It simplifies data integration by providing a scalable and fault-tolerant framework for connecting Kafka with other data sources and sinks, without the need to write custom code.

In other words, Kafka doesn’t exist in a vacuum; there are different “non-Kafka” systems that it needs to interact with, i.e., consume from and produce to. Kafka Connect simplifies this task massively by offering “connector plugins” that will translate between:

Protocols: SFTP->Kafka, Kafka->SFTP, Kafka->HTTP, Salesforce API->Kafka, etc.
Formats: CSV->Avro, Avro->JSON, Avro->Parquet, etc.

Theoretically, Kafka Connect can also translate between schemas (i.e., data mapping) via Single Message Transformations. However, I advise against using them except for the most trivial transformations.

Kafka Connect defines two types of connectors:

Source connectors consume data from non-Kafka systems (e.g., databases via CDC, file systems, other message brokers) and produce it for Kafka topics. These connectors are “Kafka producers” with a client connection to the source data system.
Sink connectors consume data from Kafka topics and produce it to non-Kafka systems (e.g., databases, file systems, APIs). Internally, they work as “Kafka Consumers” with a client connection to the destination system.

Naming Conventions

Now that we have a basic understanding of Kafka Connect let’s examine the most relevant settings that require precise, meaningful naming conventions.

`connector.name`

This is obviously the “number one” setting to define. A few things to consider:

It has to be globally unique within the Connect cluster. In other words, no connector in the cluster can share the same name.
It is part of the path to access information about the connector config and/or status via the Connect REST API (unless you use the expand option and get them all at once).
For Sink connectors, it serves as the default underlying consumer group (plus a connect- prefix). In other words, if your connector is called my-connector, the underlying consumer group will be called connect-my-connector by default.

With that in mind, the proposed naming convention is as follows:

[environment].[domain].[subdomain(s)].[connector name]-[connector version]

Component	Description
`environment`	(Logical) environment that the connector is part of. For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
`domain.subdomain(s)`	Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
`connector-name`	A descriptive name for what the connector is meant to do.
`connector-version`	As the connector evolves, we might need to run side-by-side versions of it or reset the connector completely giving it a new version. Format: vXY (e.g., ‘v01’, ‘v14’). This field is not mandatory; you can skip it if this is the first deployment.

Do we really need `[environment]`?

This is a legit question. We said that the connector name must be (globally) unique in a given Kafka Connect cluster where it is deployed. A Kafka Connect cluster can only be deployed against a single Kafka cluster. Therefore, it can only sit in a single (physical) environment. If that is the case, isn’t the “environment” implicit?

Not necessarily:

Your Kafka cluster might be serving multiple logical environments (DEV1, DEV2, etc.). As a result, a single Kafka Connect cluster might be sitting across multiple logical environments even if it belongs to a single physical environment. In this deployment topology, you might have the same connector in multiple logical environments, which would require the [environment] component to disambiguate and guarantee uniqueness.
Alternatively, you might deploy multiple Kafka Connect clusters serving single logical environments against a single (physical) Kafka cluster. You might be tempted to think in this scenario [environment] is not needed since the connector name will be unique within its cluster. However, “behind the scenes”, sink connectors create a Kafka Consumer whose name matches the connector name (plus a connect- prefix). Therefore, if multiple Connect clusters with the same connector name create the same Kafka Consumer consumer group, all sort of “issues” will arise (in practice, they end up either forming a big consumer group targeting the topics across all logical environments in that physical Kafka cluster).

In summary, if you don’t use any concept of “logical environment(s)” and can guarantee that a given connector will be globally unique in the Kafka cluster, you don’t need the [environment] component.

consumer.override.group.id

Starting with 2.3.0, client configuration overrides can be configured individually per connector by using the prefixes producer.override. and consumer.override. for Kafka sources or Kafka sinks respectively. These overrides are included with the rest of the connector’s configuration properties.

Generally, I don’t recommend playing with consumer.override.group.id. Instead, it is better to give an appropriate name to your connector (via connector.name), as per the previous section.

However, there might be scenarios where you can’t or don’t want to change your connector.name yet you still need to alter your default sink connector’s consumer group. Some examples:

You have already deployed your connectors without [environment] in your connector.name (or other components) and now you want to retrofit them into your consumer group.
You have strict consumer group or connector.name naming conventions that aren’t compatible with each other.
You want to “rewind” your consumer group but, for whatever reason, don’t want to change the connector.name.

In terms of a naming convention, I would recommend the simplest option possible:

[environment or any-other-component].[connector.name]

In other words, I believe your consumer group name should track as closely as possible your connector.name to avoid misunderstandings.

consumer/producer.override.client.id

client.id was discussed in a previous post about Producer’s client.id and Consumer’s client.id.

As discussed in that post, it is responsible for a few things:

Shows up in logs to make it easier to correlate them with specific producer/consumer instances in an application with many of them (like a Kafka Streams app or a Kafka Connect cluster).
It shows up in the namespace/path for JMX metrics coming from producers and consumers.

With that in mind, knowing that we already have a pretty solid, meaningful and (globally) unique connector.name convention, this is how we can name our producer/consumer client.id values.

Connector Type	Property Override	Value
Source connector	`producer.override.client.id`	`{connector.name}-producer`
Sink connector	`consumer.override.client.id`	`{connector.name}-consumer`

Conclusion

We have discussed most relevant properties that require naming conventions in Kafka Connect connectors. As usual, we aim to have semantically meaningful values that we can use to “reconcile” what’s running in our systems and what every team (and developer) owns and maintains.

By now, we can see emerging a consistent naming approach rooted around environments, DDD naming conventions and some level of versioning (when required).

Naming Kafka objects (II) – Producers and Consumers

In a previous post in this “naming” series, we discussed how to name Kafka topics. The intention was to name them semantically meaningfully while avoiding collisions and ambiguity. The focus was on the “nature” of the topics’ data.

This post will discuss the two main “clients” connected to those topics: producers who write data into them and consumers who read data from them. The focus will move away from the “data” towards the applications involved in the data flow.

Producers first

Nothing can be consumed if produced first; therefore, let’s start with producers.

They do a very “simple” job:

Pull metadata from the cluster to understand which brokers take the “leader” role for which topics/partitions.
Serialise the data into byte arrays.
Send the data to the appropriate broker.

In reality, it is much more complicated than this, but this is a good enough abstraction. Out of the dozens of configuration settings producers support, only two settings accept a “naming convention”.

`client.id`

The Kafka documentation defines client.id as follows:

An id string to pass to the server when making requests. The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

We want a naming convention that makes mapping Producer applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Producer application aims to achieve.

Organising your JMX metrics

There is also an extra role that client.id plays that people tend to forget: it namespaces observability metrics. For example, producers emit metrics under the following JMX MBean namespaces:

kafka.producer:type=producer-metrics,client-id={clientId}
kafka.producer:type=producer-node-metrics,client-id={clientId},node-id=([0-9]+)
kafka.producer:type=producer-topic-metrics,client-id={clientId},topic={topic}

Notice how all of them use clientId as part of the namespace name. Therefore, if we don’t assign meaningful values to client.id, we won’t be able to distinguish the appropriate metrics when multiple producers consolidate their metrics into a single metrics system (like Prometheus), especially if they come from the same application (i.e., 1 application using N producers).

client.id also regularly features in other observability components like logs.

Naming convention

The proposed convention looks like this:

[environment]-com.[your-company].[domain].[subdomain(s)].[app name].[entity/event name]

Component	Description
`[environment]`	(Logical) environment that the producer is part of. For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
`com.[your-company]`	Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
`[domain].[subdomain(s)]`	Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
`[app-name]`	The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
`[entity/event-name]`	Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account). This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple producers, client.id needs to be unique for each one. Therefore, the ‘entity/event’ in the last section of the client.id name disambiguates them. You don’t need to define an entity/event name if you only use one producer for the application.

Don’t we need a ‘version’ part?

Other naming conventions define a ‘version’ as part of their respective names. This is only necessary when the client is related to state; for example, Consumers and Streams apps must store committed offsets.

Producers, on the other hand, are completely stateless. Adding a ‘version’ part would only make sense if we keep multiple Producer application versions running side-by-side. Even then, one would argue that versioning the application itself would be a better strategy than versioning the Producer client.id

`transactional.id`

The Kafka documentation defines transactional.id as follows:

The TransactionalId to use for transactional delivery. This enables reliability semantics which span multiple producer sessions since it allows the client to guarantee that transactions using the same TransactionalId have been completed prior to starting any new transactions. If no TransactionalId is provided, then the producer is limited to idempotent delivery. If a TransactionalId is configured, enable.idempotence is implied

There are a few “small” differences between client.id and transactional.id:

client.id doesn’t need to be unique (but I strongly recommend it). transactional.id MUST be unique.
client.id is more “visible” towards developers (through O11Y). transactional.id is mostly opaque, operating behind the scenes in the transaction management subsystem.
client.id can change, although it would make your O11Y information very confusing. transactional.id MUST be stable between restarts.

Other than that, there is nothing special about transactional.id so I recommend using the same naming convention that I have proposed for client.id in the section above.

Consumers second

We have sorted consumers and they are happily producing data. It’s time to look at the other side: consumers.

They too do a very “simple” job:

Get a bunch of topic/partitions assigned as part of the consumer group partition assignment process.
Connect to the brokers acting as leaders for those topic/partitions.
Regularly (attempt) to pull new data from the assigned topic/partitions.
When there is something available, read it (as byte arrays) through the connection.
When it arrives to the application space, deserialise the data into actual objects.

A few configuration settings play a roll in this process.

`group.id`

The Kafka documentation defines group.id as follows:

A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by using subscribe(topic) or the Kafka-based offset management strategy.

We want a naming convention that makes mapping Consumers applications to domains and teams easy. Furthermore, these names should be descriptive enough to understand what the Consumer application aims to achieve.

The proposed naming convention is as follows:

[environment]-com.[company-name].[domain].[subdomain(s)].[app name].[entity/event-name]-[version]

Component	Description
`[environment]`	(Logical) environment that the consumer is part of. For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
`com.[your-company]`	Follows a “Java-like” namespacing approach to avoid collisions with other components emitting metrics to the centralised metric database
`[domain].[subdomain(s)]`	Leveraging DDD to organise your system “logically” based on business/domain components. Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
`[app-name]`	The app name should be specific enough to make it easy to find the codebase involved and the team that owns it.
`[entity/event-name]`	Describes what information the producer is sending. It doesn’t need to include the full topic name since the context is already clear (e.g., payment, transaction, account). This field is not mandatory.
`[version]`	Only introduce or change this value if you need to run side-by-side versions of the app or simply start from scratch. Format: vXY (e.g., ‘v01’, ‘v14’). This field is not mandatory.

Why do we need an entity/event name?

When your application has multiple consumers, it needs a unique group.id for every one of them. Therefore, the ‘entity/event’ in the last section of the group.id name should disambiguate between them, and it becomes mandatory.

You don’t need to define an entity/event name if you only use one consumer for the application.

Why versioning the `group.id` value?

The Kafka Consumer uses group.id to define a consumer group for multiple instances of the same application. Those instances collaborate within the group, sharing partitions, picking up partitions from failed instances and committing offsets so other instances don’t process records that another instance has processed already.

Offsets are committed under the group.id name. Therefore, it is critical to use the same group.id value across application deployments to guarantee that it continues to consume from where it left it.

However, there are times when we might want to change the group.id and effectively reset the consumer. The easiest way to do that is to change the group.id. In this case, we can use ‘version’ to have a new consumer group that ignores where the previous deployment instances got up to and falls back to auto.offset.reset to decide where to start consuming.

If I’m versioning my application, should I use it for the ‘version’ value?

Short answer: NO

Longer answer: you probably are (loosely) semantic versioning your application; every merged PR will represent a new version. You don’t want to change your group.id every time your application version changes. The ‘version’ mentioned in the group.id is very specific to the consumer group and how it manages offsets. Don’t mix the two together.

`group.instance.id`

The Kafka documentation defines group.instance.id as follows:

A unique identifier of the consumer instance provided by the end user. Only non-empty strings are permitted. If set, the consumer is treated as a static member, which means that only one instance with this ID is allowed in the consumer group at any time. This can be used in combination with a larger session timeout to avoid group rebalances caused by transient unavailability (e.g. process restarts). If not set, the consumer will join the group as a dynamic member, which is the traditional behavior.

In other words, while group.id identifies 1 or more instances that belong to a consumer group, group.instance.id identifies unique instances.

The main purpose of group.instance.id is to enable static membership to the consumer group. This helps reducing group rebalancings when instances are not available briefly. The assumption is it is better to delay whatever partitions are consumed by the temporarily missing instance than rebalance the complete group, affecting all other instances.

I recommend using the same group.id naming convention PLUS something that identifies the instance uniquely and is stable between restarts.

`client.id`

client.id serves the exact same purpose in consumers and producers. Therefore, I will refer you to the previous section for producer’s client.id for a naming convention proposal. See https://javierholguera.com/2024/09/12/naming-kafka-objects-i-producers-and-consumers/#client-id

Conclusions

Naming is difficult and requires care. However, investing in good naming conventions reduces accidental complexity, helps with debugging and diagnosing your system, and supports development through its end-to-end lifecycle.

In this post, I proposed multiple naming conventions that aim to be semantically meaningful, allow you to organize your system into sensible components, and support your system’s incremental evolution.

Software engineering trends that are reverting (I)

When I entered the software industry a long time ago, people who had been part of it warned me that software trends came and went and eventually returned. “This thing that you call ‘new’, I have seen it before”. I refused to believe it. Like a wannabe Barney Stinson, I thought ‘new’ was always the way.

I have been around long enough to see this phenomenon with my own eyes. In this series of posts, I want to call out a few examples of “trends” (i.e., new things) that a) aren’t new anymore and b) people are walking away from. The series starts with one of the most “controversial” trends in the last 10-15 years: microservices!

Microservices are so 2010s

Microservices are dead! I’m joking. They are not dead, but they are not the default option anymore. We are back to… monoliths.

While there have always been people who thought microservices weren’t a good idea, the inflexion point was the (in)famous blog post from Amazon Prime video about replacing their serverless architecture with a good, old monolith (FaaS is just an “extreme” version of microservices).

Why was this more significant than the thousands of posts claiming microservices were unnecessary complexity, talking about distributed monoliths and criticising an architectural approach that came from FAANG and only suited FAANG? Well, because… it came from FAANG. The haters could claim that even a FAANG company had realised microservices weren’t a good idea (“We won!”).

Realistically, this would have been anecdotal if it weren’t for something more important than a bunch of guys finding a way to save money when they serve millions of daily viewers (do YOU have THAT problem?).

It’s the economy, stupid

The image above shows the US FED official interest rates. Historically, interest rates have been pretty high (about 5%, according to a recent interview with Nassim Taleb on Bloomberg). From 2008 to post-COVID 2022, we experienced an anomaly: close to 0% rates for almost 15 years. Investors desperate to find good returns for their money poured billions on tech companies, hoping to land the next Google or Facebook/Meta.

Source: https://goingdigital.oecd.org/en/indicator/35

Lots of startups with huge rounding funds started to cosplay as future members of the FAANG club: copy their HR policies, copy their lovely offices, and, of course, copy their architectural solutions because, you know, we are going to be so great that we need to be ready, or we might die of success.

We all built Cloud Native systems with Share-Nothing architectures that followed every principle in the Reactive Manifesto and were prepared to scale… to the moon! 🚀 Microservices were the standard choice unless you were more adventurous and wanted to go full AWS Lambda (or a similar FaaS offering) and embrace FinOps to its purest form.

The only drawback is that it was expensive (let’s ignore complexity for now, shall we?). That didn’t matter when the money was flowing, but now the music has stopped, and everybody is intensely staring at their cloud provider bill and wondering what they can do to pay a fraction of it.

What is next?

Downsizing all things.

Before	After	Comment
Microservices/FaaS	Monolith(s)	“Collapse” multiple codebases into one and deploy as a single unit. The hope is that teams have become more disciplined at modularising (unlikely) and “build systems” have become more efficient in managing large codebases (possibly).
Messaging (Kafka et al)	Avoid middleware as much as possible	Middleware is expensive technology. With monoliths, there will be fewer network calls that require it. Direct communication (e.g., HTTP, gRPC) will be the standard (again) when necessary. Chuckier monoliths will reduce network traffic compared to microservices
NoSQL	Relational	Many NoSQL databases optimise for high throughput / low latency / high durability, which will happily be sacrificed for cost savings. Relational databases are easier to operate and run yourself (i.e., self-host), which is the cheapest option (some NoSQL, like CosmosDB or DynamoDB, can’t be self-hosted). On the complexity side, relational databases are seen as easier for developers to understand (until you see things like this).
Stream Processing	Gone except for truly big data	Stream Processing is expensive and complex. Most businesses won’t care enough about latency to pay for it, nor will have volumes that require it.
Kubernetes	Cloud-specific container solutions	We should see a transition towards more “Heroklu-like” execution platforms. It will be a tradeoff between flexibility (with K8S offers bucketloads) and cost/simplicity. Sometimes, containers will be ditched too and replaced by language-specific solutions (like Azure Spring Apps) to raise the abstraction bar even higher.
Multi-region / Multi-AZ deployments	No multi-region unless compliance requirement. Fewer multi-AZ deployments	Elon has proved that a semi-broken Twitter is still good enough, so why wouldn’t companies building less critical software aim for 3-5 9s?
Event-Driven Architecture	Here to stay	This approach isn’t more or less expensive than Batch Processing (if anything, it’s cheaper) and still models business flows more accurately.

What are we gaining and losing?

Microservices are neither the silver bullet nor the worst idea ever. As with most things, they have PROs and CONs. If we ditch them (or push back harder against their adoption), we will win things and lose things.

What do we win?

It is easier to develop against a single codebase.
Local testing is simpler because running a single service in your machine is more straightforward than running ten. Remote testing is also more accessible, as hitting one API is less complicated than hitting many across the network.
It is also easier to deploy a single service than many.
Easier maintainability/evolvability. When a business process has been incorrectly modelled, it is easier to fix on a monolith (with, ideally, single data storage) than across many services with public APIs and different data storages.

What do we lose?

Once a codebase is large enough, it is tough to work against it. Software is fractal, which is also valid for “build systems”: you want to divide and conquer.
Deploying a single service can be more challenging if multiple people (or, even worse, teams) need to release changes simultaneously. More frequent deployments can alleviate the problem, but most companies don’t go from a dev branch to PROD in hours but days/weeks.
The blast radius for incorrect changes will be higher. Systems are more resilient when they are appropriately compartmentalized.
Organisations growing (are there any left?) will struggle to increase their team’s productivity linearly with the headcount when the monolith becomes the bottleneck for all software engineering activities.
FinOps and general cost observability against business value will massively suffer. A single monolith will lump everything together. With multiple teams involved, it will be harder to understand who is making good implementation decisions and who isn’t, as the cost will be amalgamated into a single data point.

Summary

Microservices are not dead. However, they are suspicious because they are expensive in terms of infrastructure cost and, indirectly, engineering hours due to their increased complexity. However, they are also crucial to unlocking organisational productivity as the engineering team grows beyond a bunch of guys sitting together.

As the industry turns its back to FAANG practices and we sacrifice various “-ilities” on the altar of cost savings, the future of microservices will be decided based on how often we identify when they are the absolute right solution and how well we articulate its case. When in doubt, the answer will be (and perhaps it should have always been) ‘NO’.

As a parting thought, I have been involved in 3 large-scale monolith refactors/rewrites to microservices. All these projects were incredibly complex, significantly delayed and more of a failure than a success (some never entirely completed). Starting with a monolith is, most of the time, the correct answer. However, delaying a transition to smaller, independent services is almost always as bad (if not worse) than starting with microservices would have been in the first place. We are entering a new era where short-time thinking will be even more prevalent than before.

What is a microservice?

Rule 1 – Teams don’t share microservices

Rule 2 – Microservices don’t share (private) data storage

Public vs private data storage

Avoiding resource contention

Rule 3 – Avoid distributing transactions through the network

Rule 4 – Network latency adds up

Rule 5 – No Service to Service calls between services

Conclusions

Kafka Connect in a few words

Naming Conventions

connector.name

Do we really need [environment]?

consumer.override.group.id

consumer/producer.override.client.id

Conclusion

Producers first

client.id

Organising your JMX metrics

Naming convention

Why do we need an entity/event name?

Don’t we need a ‘version’ part?

transactional.id

Consumers second

group.id

Why do we need an entity/event name?

Why versioning the group.id value?

If I’m versioning my application, should I use it for the ‘version’ value?

group.instance.id

client.id

Conclusions

Microservices are so 2010s

It’s the economy, stupid

What is next?

What are we gaining and losing?

Summary

`connector.name`

Do we really need `[environment]`?

`client.id`

`transactional.id`

`group.id`

Why versioning the `group.id` value?

`group.instance.id`

`client.id`