A mental model for Data Integrations

If you’ve ever worked on a data integration project and felt overwhelmed by all the moving parts, you’re not alone. It’s easy to get lost in the details or to over-engineer parts that don’t really need it.

This post introduces a simple model that can help you make sense of it all. By breaking data integrations into three clear levels (protocol, format, and schema), you’ll be better equipped to focus your energy where it matters and rely on proven tools where it doesn’t.

What is a “Data Integration”?

Whenever two systems connect or integrate, they inevitably share some form of information. For example, if System A requests System B to process a payment, System A must provide certain data for the payment to proceed.

However, this kind of functional exchange is not what I define as a “Data Integration.” Instead, a Data Integration occurs when the primary goal is to share data itself, rather than invoking another system’s functionality. In other words, Data Integration refers specifically to integrations where one system consumes data provided by another, rather than utilizing that system’s behavioral capabilities.

We now live in a world dominated by data-intensive applications, making Data Integrations the most critical form of system integration. These integrations have displaced traditional functional integrations, which used to prevail when data was less ubiquitous and valuable.

A 3-level model for data integrations

Every data integration can be broken down into three levels or layers, each with distinct responsibilities (and constraints). The beauty of it is that they are (almost) completely independent of each other and exchangeable.

I like my data integrations like I like my cake: layered

Protocol Level

This is the foundational level and it is all about “exchanging bytes” between systems, without concern for anything else.

This layer is typically hidden behind higher-level abstractions and, while all abstractions leak, you can be generally confident that you won’t need to look under the hood to check out what is happening here and how it works. Things just work.

Examples of protocols:

  • HTTP(s), which powers all REST-based APIs.
  • SMTP to exchange emails.
  • (s)FTP to exchange files.
  • Proprietary protocols, like those powering technologies like SQL Server, PostgreSQL, Apache Kafka, Google’s gRPC, etc. They tend to be built on top of TCP.

As with all technologies, the more established a protocol is, the stronger and more reliable its abstractions become, requiring you to know less about underlying details.

However, technology is subject to The Lindy Effect (the idea that longevity indicates future durability); therefore, investing time learning these protocols will result in the highest ROI for your professional career.

Format Level

The next level is all about giving shape to those bytes. In other words, if we were to make an analogy with human communication, protocols are about deciding whether to talk by phone or to exchange letters, while formats are about choosing the language used (e.g., English, Spanish).

There is a long tail of formats, some more established, some more recent (and immature). There is also a surprising amount of “variability” in some well established formats when it comes to what is (generally) considered acceptable (for example, JSON’s loose specification has introduced significant variability).

Examples of formats:

  • Good old text in UTF-8 or any of its derivates (e.g., UTF-16, UTF-32).
  • Internet/Web formats like JSON or XML.
  • File exchange formats like CSV.
  • Modern formats for Big Data (Avro, Parquet), RPC (Protocol Buffers, Thrift, MessagePack), etc.
  • Proprietary formats, like what technologies like SQL Server or PostgreSQL use to send information back and forth between client and server.

While some of these formats will get regularly paired with certain protocols (for example, HTTP/JSON, CSV/SFTP, Kafka/Avro, gRPC/Proto), in many cases they are completely decoupled and can be swapped. For example, it’s entirely possible to use XML with HTTP, upload Avro files to an SFTP server or write a CSV as a UTF-8 message to Kafka (if you are crazy enough to consider it).

Schema Level

The final level, where application-specific logic and business requirements become concrete. In other words, this is the level where your specific use case, business scenario, API specification, etc. becomes reality. No abstraction, no commoditization, you are doing all the work here.

In this level you need to do things like:

  • Data modelling to create schemas and data structures representing your business entities and relationships.
  • Data transformations to convert input schemas map to output schemas, including rules for data enrichment, filtering, aggregation, or reformatting.
  • Implement non-functional requirements like:
    • Identify sensitive and non-sensitive data, ensuring compliance with security standards.
    • Clarify how frequently data is refreshed (e.g., daily, hourly, near-real time).
    • Track where data originated and how it has been transformed or moved.

How is this model useful?

There are several reasons why this three-level model is particularly useful for Data Integrations:

Clear separation of concerns

Each level has distinct responsibilities, making complex integrations easier to understand and manage. Engineers can tackle each level independently, enabling parallel work. Different engineers can focus simultaneously on different levels without conflicts.

Improved reusability

Since each level is independent and components at any level can easily be swapped, it becomes simpler to reuse existing code, expertise, and infrastructure.

For example, if an existing integration uses HTTP with JSON, it requires minimal effort to replace the JSON serializer with an XML serializer, while continuing to leverage the existing protocol-level (HTTP) and schema-level implementations.

Targeted commoditization

The Protocol level, as defined, is the most mature and heavily abstracted. Similarly, the Format level is also mature in most scenarios. This maturity drives commoditization, enabling the use of standard, off-the-shelf technologies to handle conversions between protocols and formats with minimal custom code.

For instance, technologies like Azure Data Factory or Apache Airflow can convert seamlessly between SMTP and HTTP, or between XML and JSON, using no-code or low-code interfaces, provided schema-level details remain the same.

This commoditization accelerates Data Integration development and allows engineers to concentrate on schema-level transformations, where the real business logic resides.

Summary

In this post, I shared a model for thinking about Data Integrations that has served me well over the years. It may not be the most sophisticated, but its simplicity makes it practical: it helps you distinguish the parts that truly require your attention (like the schema level) from those you can reliably delegate to off-the-shelf technology (such as the protocol and format levels).

Naming Kafka objects (III) – Kafka Connectors

We discussed naming conventions for Kafka topics and Kafka Producers/Consumers in the previous two posts. This time, we are focusing on Kafka Connect and the connectors running on it.

We will not discuss naming conventions related to Kafka Connect clusters (e.g., config/storage/offset topic names, group.id, etc.) They are normally managed by SysAdmin/DevOps teams and these posts are zooming in developer-related naming conventions.

Kafka Connect in a few words

What is Kafka Connect?

Kafka Connect is a tool for streaming data between Apache Kafka and external systems, such as databases, cloud services, or file systems. It simplifies data integration by providing a scalable and fault-tolerant framework for connecting Kafka with other data sources and sinks, without the need to write custom code.

In other words, Kafka doesn’t exist in a vacuum; there are different “non-Kafka” systems that it needs to interact with, i.e., consume from and produce to. Kafka Connect simplifies this task massively by offering “connector plugins” that will translate between:

  • Protocols: SFTP->Kafka, Kafka->SFTP, Kafka->HTTP, Salesforce API->Kafka, etc.
  • Formats: CSV->Avro, Avro->JSON, Avro->Parquet, etc.

Theoretically, Kafka Connect can also translate between schemas (i.e., data mapping) via Single Message Transformations. However, I advise against using them except for the most trivial transformations.

Kafka Connect defines two types of connectors:

  • Source connectors consume data from non-Kafka systems (e.g., databases via CDC, file systems, other message brokers) and produce it for Kafka topics. These connectors are “Kafka producers” with a client connection to the source data system.
  • Sink connectors consume data from Kafka topics and produce it to non-Kafka systems (e.g., databases, file systems, APIs). Internally, they work as “Kafka Consumers” with a client connection to the destination system.

Naming Conventions

Now that we have a basic understanding of Kafka Connect let’s examine the most relevant settings that require precise, meaningful naming conventions.

connector.name

This is obviously the “number one” setting to define. A few things to consider:

  • It has to be globally unique within the Connect cluster. In other words, no connector in the cluster can share the same name.
  • It is part of the path to access information about the connector config and/or status via the Connect REST API (unless you use the expand option and get them all at once).
  • For Sink connectors, it serves as the default underlying consumer group (plus a connect- prefix). In other words, if your connector is called my-connector, the underlying consumer group will be called connect-my-connector by default.

With that in mind, the proposed naming convention is as follows:

[environment].[domain].[subdomain(s)].[connector name]-[connector version]

ComponentDescription
environment(Logical) environment that the connector is part of.

For more details, see https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#environment
domain.subdomain(s)Leveraging DDD to organise your system “logically” based on business/domain components.

Break down into a domain and subdomains as explained in https://javierholguera.com/2024/08/20/naming-kafka-objects-i-topics/#domain-subdomain-s
connector-nameA descriptive name for what the connector is meant to do.
connector-versionAs the connector evolves, we might need to run side-by-side versions of it or reset the connector completely giving it a new version. Format: vXY (e.g., ‘v01’, ‘v14’).

This field is not mandatory; you can skip it if this is the first deployment.

Do we really need [environment]?

This is a legit question. We said that the connector name must be (globally) unique in a given Kafka Connect cluster where it is deployed. A Kafka Connect cluster can only be deployed against a single Kafka cluster. Therefore, it can only sit in a single (physical) environment. If that is the case, isn’t the “environment” implicit?

Not necessarily:

  1. Your Kafka cluster might be serving multiple logical environments (DEV1, DEV2, etc.). As a result, a single Kafka Connect cluster might be sitting across multiple logical environments even if it belongs to a single physical environment. In this deployment topology, you might have the same connector in multiple logical environments, which would require the [environment] component to disambiguate and guarantee uniqueness.
  2. Alternatively, you might deploy multiple Kafka Connect clusters serving single logical environments against a single (physical) Kafka cluster. You might be tempted to think in this scenario [environment] is not needed since the connector name will be unique within its cluster. However, “behind the scenes”, sink connectors create a Kafka Consumer whose name matches the connector name (plus a connect- prefix). Therefore, if multiple Connect clusters with the same connector name create the same Kafka Consumer consumer group, all sort of “issues” will arise (in practice, they end up either forming a big consumer group targeting the topics across all logical environments in that physical Kafka cluster).

In summary, if you don’t use any concept of “logical environment(s)” and can guarantee that a given connector will be globally unique in the Kafka cluster, you don’t need the [environment] component.

consumer.override.group.id

Starting with 2.3.0, client configuration overrides can be configured individually per connector by using the prefixes producer.override. and consumer.override. for Kafka sources or Kafka sinks respectively. These overrides are included with the rest of the connector’s configuration properties.

Generally, I don’t recommend playing with consumer.override.group.id. Instead, it is better to give an appropriate name to your connector (via connector.name), as per the previous section.

However, there might be scenarios where you can’t or don’t want to change your connector.name yet you still need to alter your default sink connector’s consumer group. Some examples:

  • You have already deployed your connectors without [environment] in your connector.name (or other components) and now you want to retrofit them into your consumer group.
  • You have strict consumer group or connector.name naming conventions that aren’t compatible with each other.
  • You want to “rewind” your consumer group but, for whatever reason, don’t want to change the connector.name.

In terms of a naming convention, I would recommend the simplest option possible:

[environment or any-other-component].[connector.name]

In other words, I believe your consumer group name should track as closely as possible your connector.name to avoid misunderstandings.

consumer/producer.override.client.id

client.id was discussed in a previous post about Producer’s client.id and Consumer’s client.id.

As discussed in that post, it is responsible for a few things:

  • Shows up in logs to make it easier to correlate them with specific producer/consumer instances in an application with many of them (like a Kafka Streams app or a Kafka Connect cluster).
  • It shows up in the namespace/path for JMX metrics coming from producers and consumers.

With that in mind, knowing that we already have a pretty solid, meaningful and (globally) unique connector.name convention, this is how we can name our producer/consumer client.id values.

Connector TypeProperty OverrideValue
Source connectorproducer.override.client.id{connector.name}-producer
Sink connectorconsumer.override.client.id{connector.name}-consumer

Conclusion

We have discussed most relevant properties that require naming conventions in Kafka Connect connectors. As usual, we aim to have semantically meaningful values that we can use to “reconcile” what’s running in our systems and what every team (and developer) owns and maintains.

By now, we can see emerging a consistent naming approach rooted around environments, DDD naming conventions and some level of versioning (when required).

Data modelling recommendations for Apache Avro

I am a big fan of Apache Avro, said nobody ever…

Actually, in the context of Kafka/Streaming/Messaging, I like Apache Avro quite a bit, especially when paired with a Schema Registry that “protects” your topics/streams from “pollution”. Other alternatives are either too complex (JSON Schema) or designed for ephemeral data (Protobuf).

That said, Avro tends to confuse people with a data modelling background in SQL/Relational databases. In this post, I want to share some general modelling recommendations that will make your schemas more robust, semantically meaningful and easier to maintain.

Avoid nullable fields

This is a classic recommendation that I extend to EVERYTHING:

  • Table fields
  • Objects
  • API schemas

After all, null is the root of all evil, right?

Reducing the number of nullable fields in your schemas is also not complicated. Let’s take the following example:

record MyRecord {
  // NO!!
  union { null, string } country;

  // BETTER!
  string country;

  // PERFECT!
  string country = "United Kingdom";
}


In many cases, fields can be made “not null” when using a reasonable default (e.g. country can default to UK, currency can default to GBP).

While Java represents these fields as nullable by default, when using the Builder pattern implementation in code-generated POJOs, calling build() fails if a non-null value doesn’t have a value (and there is no default), guaranteeing that incorrect data isn’t instantiated (and published).

Assign defaults whenever possible

This recommendation complements the previous recommendation regarding nulls.

While not all fields will have natural candidates for a default (e.g., name), you should strive to define one whenever possible.

This is particularly important if new fields are added to an existing schema. The only way to maintain backward compatibility (i.e., reading old data with the new schema) is to give the new field a default. Otherwise, the new schema won’t be able to read old data because it will be missing the new field.

Use Logical Types

nulls are poor data modelling; they don’t define a type. Instead, they represent an absence of type altogether.

Programs are safer when we can constrain what is representable via restrictive types. If we apply this logic, embracing Avro’s logical types is the next sensible step.

record MyRecord {
  // NO!!
  float amount;

  // YES!
  decimal(18,8) amount;
}

Logical Types make schemas more semantically meaningful and reduce the possibility of error in downstream consumers. Using Logical Types, we guarantee that producers and consumers can restrict the data range to the most specific cases.

This recommendation includes using UTC-referencing Logical Types like date or timestamp-millis, which captures more semantical information than using a simple number type.

Be aware of union quirks

Avro unions are an implementation of tagged/discriminated unions: they represent a data structure that can “hold a value that could take on several different, but fixed, types”. They are used to represent nullability for Avro fields:

union { null, actualType } yourField);

They can also represent a field that might contain different data types.

record MyRecord {
  union {
    PaymentRequested,
    PaymentAccepted,
    PaymentReceived,
    PaymentReconciled
} event;

Unions are excellent as part of rich schemas that capture all data types a field might contain, aiming to constrain the space of representable data/state.

However, unions have two counterpoints that should be considered:

  • Evolving the list of allowed types follows particular rules to maintain backward compatibility (see below).
  • Java doesn’t support them naturally. Instead, union fields are represented as Objects in the Java POJOs and require the use of instanceof to route logic based on the actual data type.

While using them in Java is not very ergonomic, unions are still the correct data type to represent multi-datatype fields and Java limitations shouldn’t stop us from leveraging them when appropriate.

Be careful when evolving Enums and Unions

Enums are poor man’s unions: they represent multiple data types stripped of all their content except a “name” (the enum values). In other words, they are just tags, while Unions can define specific fields for each type.

This approach is common in languages where creating types (i.e., classes) is costly. Developers will optimise for speed and use enums to create a class hierarchy quickly, while sacrifiding type safety in the process

In both cases, they require some care when evolving the list of allowed values to maintain backward compatibility:

  • Reader schemas (and consumers) must be changed first to incorporate new allowed values. If we didn’t and we changed producers (and their schemas) first, we risk producing values that downstream consumers won’t understand and break them.
  • Only add new values at the end of the list, never in the middle or at the beginning. While this sounds like weird advice, some languages treat enums differently from Java. For example, C# is “order aware” based on how enum values are defined in the corresponding C# file, and a number is assigned to them that is used during (de)serialization. Changing the order of values will break this order and make consumers fail. The solution is adding new elements at the end.
  • Never remove allowed elements from the list for the exact same reason explained above, but also because doing so would prevent consumers using the new schema from reading old data (that was using the removed elements).