16 Mar

Data Mesh 101: How to Move Beyond ETL and Create a Modern, Real-Time Data Engineering Paradise

LN
Lorenzo Nicora

Many enterprises try to bite off more than they can chew when it comes to data and don't get the results they were looking for as a result.

They might buy the most bedazzling artificial intelligence and machine learning (AI/ML) tools but they get nowhere because they haven’t nailed the data engineering.

Their data house is built on sand.

The foundation of your data house has got to be an absolutely rock-solid process for getting data in, making it suitable for your needs and then making it available to the business (e.g. new products, ML models etc.).

In this blog, I’ll explore why traditional data engineering approaches aren’t capable of building a strong enough foundation and how the data mesh can help you turn your data frown upside down.

The Sandy Foundation of Traditional Data Engineering

The heart of the traditional paradigm is extract, transform and load (ETL): a data integration process that combines data from multiple sources, cleans and organises it and then makes it available in a single, monolithic data repository.

Over time, however, the limitations of the approach have become more and more obvious.

The core flaw lies in how responsibility for data is split.

Let’s explain this with an example. Say you have an application team building a payments platform. They will have their own local database that is managing those transactions.

But in order to make that data available to the wider business, a separate data team has to come along, build some form of integration (this is the ETL), yoink that data from under the app team’s noses and chuck it in a central data platform.

The data team is extracting from someone else’s database: this is the original data sin!

The problem here is that the data team is external to that whole business domain (payments, in this case), so they don’t intimately understand the data or its context. In this way, many of the subtleties and nuances of the data are lost. For example, under the hood of a payments platform is a lot of complexity. A single payment can go through several stages, including interacting with a third party or being retried if a payment fails, for example. These details will not make sense to the external data team, who are liable to generate bad insights if they try to generate a report without the proper understanding.

What’s more, this split in responsibility means that the data team is effectively ‘stealing’ the data from the app team. They sneakily build their own pipeline and ETL process, which the app team might not even know exists!

As a result, the data team is totally exposed to whatever changes occur in that database, which they can neither control nor predict . The app team can make a change whenever they like, without realising that it will break the data team’s pipeline and completely mess up all the downstream systems.

It’s like trying to fix someone’s plumbing without them knowing and then they suddenly start running a bath.

Another major issue lies in how data is made available.

The primary view of data in an ETL world is data at rest.

Data is at rest, then is extracted, transformed and then placed at rest somewhere else. The tendency is always for data to come to rest somewhere fixed.

This is fine if you want to query a static dataset. But in a modern enterprise, many use cases and systems need to be updated continuously.

The data needs to be in continuous motion. But the ETL only moves data in coarse-grained batches. So when you have downstream systems that need to be continuously updated you have only one option: you have to fudge it.

So people use something called ‘reverse ETL’, which basically involves setting up another ETL pipeline from your central data repository (where you just moved your data) to business domains.

This is like turning on your tap to fill your bath with water, then turning on another tap to move the water from the bath to the sink so you can wash your hands.

You also need to build a tool to look at the data you just put at rest to watch for when it changes so it can be sent downstream again to another database. But then you need a tool to watch that new database for updates and to send it through to the next system. You end up with a whole network of tools watching databases.

The overall outcome is that advanced data use cases are nigh on impossible to realise, because the data itself is unreliable (due to the artificially split responsibility) and it moves through the organisation in an awkward stop-start fashion (due to the emphasis on data at rest).

For advanced data use cases that require trustworthy data sources and near-real-time streaming we need a better approach.

This is where the data mesh comes in.

The Data Mesh

The big shift that the data mesh enables is being able to decentralise data, organising it instead along domain-driven lines, with each domain owning its own data that it treats as a product that is consumed by the rest of the organisation.

(If you want a more detailed introduction to the data mesh, check out this article).

There are two aspects of the data mesh that are relevant for our discussion here.

The first is that it inverts the traditional lines of responsibility.

The team that owns the app is also responsible for making that data available to others in a controlled way.

So, if we took our app team example at the beginning, they would be building the payments platform using data from their database, but would also be prepping that data and making it available to the rest of the business, instead of an external data team doing so. Because they understand the data!

The way they control standards is by introducing data contracts. It’s like a legal contract that obliges the service provider (i.e. the app team) to provide data as a service to the consumer in line with the contract: guaranteeing data of a specific quality, with certain attributes, such as being up-to-date to within an hour or being available at a certain latency.

By making each team in the business contractually obliged to make their data available in a standardised way, you suddenly create a vast mesh of high-quality, consistent data streams!

The second aspect is that, once you have realigned the lines of responsibility and accountability, the data mesh is the perfect opportunity to shift your primary view of data from ‘at rest’ to ‘in motion’.

While you technically don’t need data in motion to have a data mesh, because you have matched your data streams up with your team structures you can take advantage of this powerful way of making data available.

Let’s use the analogy of pools and rivers to unpack what we mean by data in motion.

With ETL, you take data from one pool and move it to another pool, where it can be queried. If it’s needed elsewhere, however, you have to move it again to a third pool. And a fourth. And so on. And each pool is updated one by one in a discontinuous fashion as fresh data ‘hops’ from pool to pool.

With the data mesh, data is a continuous river, with users ‘dipping their buckets’ into the river at various points to get the data they need. These local buckets contain data at rest that can be queried by business users and they are all continuously updated in real time from the main river.

But it goes further: if you have ten business domains, each streaming their own data, these are like the ten branches of the data river.

These branches can then be combined to create new branches. In the transformation of combining rivers you create new information. For example, you could combine the data from payments with customer data to create a new river of data suitable for personalised marketing campaigns.

The result is that data is constantly in motion throughout the whole system, so users and systems constantly have access to near-real-time information. Data is only at rest when it is being queried.


Going Beyond ETL

These two shifts in perspective have incredibly powerful downstream consequences. They completely shatter the limits of traditional ETL and provide a rock-solid foundation for much more advanced data use cases.

It’s important to note that how you approach data engineering isn’t a technicality. It opens up completely new horizons for what your business can do with data!

With high-quality, trustworthy data delivered as a product through these real-time streaming rivers, anyone in your business can start drawing on these resources to create novel streams of business value. Not least your data scientists who will have the data they need to feed their real-time analytics projects and hungry AI/ML models!

This is a truly modern data engineering paradigm that will be an insane source of competitive advantage for enterprises that are able to execute on it successfully.

If you want to be competitive, you need to sort your data constraints, and that's where Mesh-AI can help. Identify the areas in your organisation that require the most attention and solve your most crucial data bottlenecks. Download our Data Maturity Assessment & Strategy Accelerator eBook.

Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn.

Latest Stories

See More
This website uses cookies to maximize your experience and help us to understand how we can improve it. By clicking 'Accept', you consent to the use of these cookies. If you would like to manage your cookie settings, you can control this in your internet browser. Find out more in our Privacy Policy