Data modelling recommendations for Apache Avro

I am a big fan of Apache Avro, said nobody ever…

Actually, in the context of Kafka/Streaming/Messaging, I like Apache Avro quite a bit, especially when paired with a Schema Registry that “protects” your topics/streams from “pollution”. Other alternatives are either too complex (JSON Schema) or designed for ephemeral data (Protobuf).

That said, Avro tends to confuse people with a data modelling background in SQL/Relational databases. In this post, I want to share some general modelling recommendations that will make your schemas more robust, semantically meaningful and easier to maintain.

Avoid nullable fields

This is a classic recommendation that I extend to EVERYTHING:

  • Table fields
  • Objects
  • API schemas

After all, null is the root of all evil, right?

Reducing the number of nullable fields in your schemas is also not complicated. Let’s take the following example:

record MyRecord {
  // NO!!
  union { null, string } country;

  // BETTER!
  string country;

  // PERFECT!
  string country = "United Kingdom";
}


In many cases, fields can be made “not null” when using a reasonable default (e.g. country can default to UK, currency can default to GBP).

While Java represents these fields as nullable by default, when using the Builder pattern implementation in code-generated POJOs, calling build() fails if a non-null value doesn’t have a value (and there is no default), guaranteeing that incorrect data isn’t instantiated (and published).

Assign defaults whenever possible

This recommendation complements the previous recommendation regarding nulls.

While not all fields will have natural candidates for a default (e.g., name), you should strive to define one whenever possible.

This is particularly important if new fields are added to an existing schema. The only way to maintain backward compatibility (i.e., reading old data with the new schema) is to give the new field a default. Otherwise, the new schema won’t be able to read old data because it will be missing the new field.

Use Logical Types

nulls are poor data modelling; they don’t define a type. Instead, they represent an absence of type altogether.

Programs are safer when we can constrain what is representable via restrictive types. If we apply this logic, embracing Avro’s logical types is the next sensible step.

record MyRecord {
  // NO!!
  float amount;

  // YES!
  decimal(18,8) amount;
}

Logical Types make schemas more semantically meaningful and reduce the possibility of error in downstream consumers. Using Logical Types, we guarantee that producers and consumers can restrict the data range to the most specific cases.

This recommendation includes using UTC-referencing Logical Types like date or timestamp-millis, which captures more semantical information than using a simple number type.

Be aware of union quirks

Avro unions are an implementation of tagged/discriminated unions: they represent a data structure that can “hold a value that could take on several different, but fixed, types”. They are used to represent nullability for Avro fields:

union { null, actualType } yourField);

They can also represent a field that might contain different data types.

record MyRecord {
  union {
    PaymentRequested,
    PaymentAccepted,
    PaymentReceived,
    PaymentReconciled
} event;

Unions are excellent as part of rich schemas that capture all data types a field might contain, aiming to constrain the space of representable data/state.

However, unions have two counterpoints that should be considered:

  • Evolving the list of allowed types follows particular rules to maintain backward compatibility (see below).
  • Java doesn’t support them naturally. Instead, union fields are represented as Objects in the Java POJOs and require the use of instanceof to route logic based on the actual data type.

While using them in Java is not very ergonomic, unions are still the correct data type to represent multi-datatype fields and Java limitations shouldn’t stop us from leveraging them when appropriate.

Be careful when evolving Enums and Unions

Enums are poor man’s unions: they represent multiple data types stripped of all their content except a “name” (the enum values). In other words, they are just tags, while Unions can define specific fields for each type.

This approach is common in languages where creating types (i.e., classes) is costly. Developers will optimise for speed and use enums to create a class hierarchy quickly, while sacrifiding type safety in the process

In both cases, they require some care when evolving the list of allowed values to maintain backward compatibility:

  • Reader schemas (and consumers) must be changed first to incorporate new allowed values. If we didn’t and we changed producers (and their schemas) first, we risk producing values that downstream consumers won’t understand and break them.
  • Only add new values at the end of the list, never in the middle or at the beginning. While this sounds like weird advice, some languages treat enums differently from Java. For example, C# is “order aware” based on how enum values are defined in the corresponding C# file, and a number is assigned to them that is used during (de)serialization. Changing the order of values will break this order and make consumers fail. The solution is adding new elements at the end.
  • Never remove allowed elements from the list for the exact same reason explained above, but also because doing so would prevent consumers using the new schema from reading old data (that was using the removed elements).

Leave a comment