Data modelling recommendations for Apache Avro

I am a big fan of Apache Avro, said nobody ever…

Actually, in the context of Kafka/Streaming/Messaging, I like Apache Avro quite a bit, especially when paired with a Schema Registry that “protects” your topics/streams from “pollution”. Other alternatives are either too complex (JSON Schema) or designed for ephemeral data (Protobuf).

That said, Avro tends to confuse people with a data modelling background in SQL/Relational databases. In this post, I want to share some general modelling recommendations that will make your schemas more robust, semantically meaningful and easier to maintain.

Avoid nullable fields

This is a classic recommendation that I extend to EVERYTHING:

Table fields
Objects
API schemas

After all, null is the root of all evil, right?

Reducing the number of nullable fields in your schemas is also not complicated. Let’s take the following example:

record MyRecord {
  // NO!!
  union { null, string } country;

  // BETTER!
  string country;

  // PERFECT!
  string country = "United Kingdom";
}

In many cases, fields can be made “not null” when using a reasonable default (e.g. country can default to UK, currency can default to GBP).

While Java represents these fields as nullable by default, when using the Builder pattern implementation in code-generated POJOs, calling build() fails if a non-null value doesn’t have a value (and there is no default), guaranteeing that incorrect data isn’t instantiated (and published).

Assign defaults whenever possible

This recommendation complements the previous recommendation regarding nulls.

While not all fields will have natural candidates for a default (e.g., name), you should strive to define one whenever possible.

This is particularly important if new fields are added to an existing schema. The only way to maintain backward compatibility (i.e., reading old data with the new schema) is to give the new field a default. Otherwise, the new schema won’t be able to read old data because it will be missing the new field.

Use Logical Types

nulls are poor data modelling; they don’t define a type. Instead, they represent an absence of type altogether.

Programs are safer when we can constrain what is representable via restrictive types. If we apply this logic, embracing Avro’s logical types is the next sensible step.

record MyRecord {
  // NO!!
  float amount;

  // YES!
  decimal(18,8) amount;
}

Logical Types make schemas more semantically meaningful and reduce the possibility of error in downstream consumers. Using Logical Types, we guarantee that producers and consumers can restrict the data range to the most specific cases.

This recommendation includes using UTC-referencing Logical Types like date or timestamp-millis, which captures more semantical information than using a simple number type.

Be aware of `union` quirks

Avro unions are an implementation of tagged/discriminated unions: they represent a data structure that can “hold a value that could take on several different, but fixed, types”. They are used to represent nullability for Avro fields:

union { null, actualType } yourField);

They can also represent a field that might contain different data types.

record MyRecord {
  union {
    PaymentRequested,
    PaymentAccepted,
    PaymentReceived,
    PaymentReconciled
} event;

Unions are excellent as part of rich schemas that capture all data types a field might contain, aiming to constrain the space of representable data/state.

However, unions have two counterpoints that should be considered:

Evolving the list of allowed types follows particular rules to maintain backward compatibility (see below).
Java doesn’t support them naturally. Instead, union fields are represented as Objects in the Java POJOs and require the use of instanceof to route logic based on the actual data type.

While using them in Java is not very ergonomic, unions are still the correct data type to represent multi-datatype fields and Java limitations shouldn’t stop us from leveraging them when appropriate.

Be careful when evolving Enums and Unions

Enums are poor man’s unions: they represent multiple data types stripped of all their content except a “name” (the enum values). In other words, they are just tags, while Unions can define specific fields for each type.

This approach is common in languages where creating types (i.e., classes) is costly. Developers will optimise for speed and use enums to create a class hierarchy quickly, while sacrifiding type safety in the process

In both cases, they require some care when evolving the list of allowed values to maintain backward compatibility:

Reader schemas (and consumers) must be changed first to incorporate new allowed values. If we didn’t and we changed producers (and their schemas) first, we risk producing values that downstream consumers won’t understand and break them.
Only add new values at the end of the list, never in the middle or at the beginning. While this sounds like weird advice, some languages treat enums differently from Java. For example, C# is “order aware” based on how enum values are defined in the corresponding C# file, and a number is assigned to them that is used during (de)serialization. Changing the order of values will break this order and make consumers fail. The solution is adding new elements at the end.
Never remove allowed elements from the list for the exact same reason explained above, but also because doing so would prevent consumers using the new schema from reading old data (that was using the removed elements).

Java type hinting in avro-maven-plugin

Recently, somebody shared with me the following problem: an Avro schema in Schema Registry has magically evolved into a slightly different version, albeit still backward compatible.

Schemas changing…

The first version of the schema looked like this:

{
    "type":"record",
    "name":"Key",
    "namespace":"my-namespace",
    "fields" [
        {
            "name":"UUID",
            "type":"string"
        }
    ],
    "connect.name":"my-topic.Key"
}

After it changed, the schema looked like this:

{
    "type":"record",
    "name":"Key",
    "namespace":"my-namespace",
    "fields" [
        {
            "name":"UUID",
            "type":{
                "type":"string",
                "avro.java.string":"String"
            }
        }
    ],
    "connect.name":"my-topic.Key"
}

To add up to the confusing, this is a topic published by Kafka Connect using MySQL Debezium plugin. Neither the database schema nor Connect or Debezium versions had changed anywhere close to when the schema evolved.

How could this have happened?

The mystery guest…

Although nothing had changed in the stack that was polling record changes from the MySQL database and sending them to Kafka… there was a new element to consider.

After some conversations, it was apparent that there was a new application publishing records to the same topic, for testing. This application was:

Downloading the schema from Schema Registry.
Doing code-generation using avro-maven-plugin against the downloaded .asvc files from Schema Registry.
Producing some records using the newly created Java POJO classes.

Those seem like the right steps. However, looking into the options of avro-maven-plugin, once stood up:

  /**  The Java type to use for Avro strings.  May be one of CharSequence,
   * String or Utf8.  CharSequence by default.
   *
   * @parameter property="stringType"
   */
  protected String stringType = "CharSequence";

Could it be the culprit?

`stringType` does more than you expect

While the description of the property suggests something as naive as instructing the Avro code generator what class to use for Avro strings… it does more than just that.

Comparing the code for POJOs generated using maven-avro-plugin two things are different. Firstly, fields like the UUID in the schema above change their type from java.lang.CharSequence to java.lang.String; this is as expected.

However, it also changes the internal Avro schema that every Java POJO stores in:

public static final org.apache.avro.Schema SCHEMA$;

Upon changing stringType to String the resulting schema in SCHEMA$ contains the extended type definition that we saw at the beginning. The Java POJOs define this property because it is sent to Schema Registry when producing records (only once, from there one it uses the returned schema id).

Since there is no canonical representation of an Avro schema, Schema Registry chooses to take the schema as is, ignoring that both schemas are semantically identical and it should not create a new version for it.

A solution?

Can we not use stringType = String? Yes, but then all POJOs are generated using CharSequence. In my opinion, that is the best option for mixed environments. After all, this extra hint in the schema only makes sense for Java consumers.

However, if you control the topic end to end (e.g., both producers and consumers), you might as well use with stringType = String by default and guarantee that every client uses String instead of CharSequence.

In any case, both schemas are backward compatible between themselves. A correct Avro library should result in the same schema representation in whatever language you have chosen to use.

Incompatible AVRO schema in Schema Registry

My company uses Apache Kafka as the spine for its next-generation architecture. Kafka is a distributed append-only log that can be used as a pub-sub mechanism. We use Kafka to publish events once business processes have completed successfully, allowing a high degree of decoupling between producers and consumers.

These events are encoded using Avro schemas. Avro is a binary serialization format that enables a compact representation of data, much more than, for instance, JSON. Given the high volume of events we publish to kafka, using a compact format is critical.

In combination with Avro we use Confluent’s Schema Registry to manage our schemas. The registry provides a RESTful API to store and retrieve schemas.

Compatibility modes

The Schema Registry can control what schemas get registered, ensuring a certain level of compatibility between existing and new schemas. This compatibility can be set to one of the next four modes:

BACKWARD: a new schema is allowed if it can be used to read all data ever published into the corresponding topic.
FORWARD: a new schema is allowed if it can be used to write data that all previous schemas would be able to read.
FULL: a new schema that fullfils both registrations.
NONE: a schema is allowed as long as it is valid Avro.

By default, Schema Registry sets BACKWARD compatibility, which is most likely your preferred option in PROD environment, unless you want to have a hard time with your consumers not quite understanding events published with a newer, incompatible version of the schema.

Incompatible schemas

In development phase it is perfectly fine to replace schemas with others that are incompatible. Schema Registry will prevent updating the existing schema to an incompatible newer version unless we change its default setting.

Fortunately Schema Registry offers a complete API that allows to register and retrieve schemas, but also to change some of its configuration. More specifically, it offers a /config endpoint to PUT new values for its compatibility setting.

The following command would change the compatibility setting to NONE for all schemas in the Registry:

curl -X PUT http://your-schema-registry-address/config 
     -d '{"compatibility": "NONE"}'
     -H "Content-Type:application/json"

This way next registration would be allowed by the Registry as long as the newer schema were valid Avro. The configuration can be set for an specific schema too, simply appending the name (i.e., /config/subject-name).

Once the incompatible schema has been registered, the setting should be set back to a more cautious value.

Summary

The combination of Kafka, Avro and Schema Registry is a great way to store your events in the most compact way possible, while still retains the ability to evolve the corresponding schemas.

However some of the limitations that the Schema Registry imposes make less sense on a development environment. On some occassions, making incompatible changes in a simple way is necessary and recommendable.

The Schema Registry API allows changing the compatibility setting to accept schemas that, otherwise, would be rejected.