22 Sep

Data Mesh 101: Transcending the Piggies-in-the-Middle With Scalable Metadata Management

TJ
Tom Jenkin
“Metadata is your friend”

- Sarita Bakst, JPMorgan

The mantra of modern business is to be customer-obsessed. 

Even internally within your organisation, if you provide a service or product to other business units, success is measured by how well you fulfil the needs of the end user.

And then there’s your data department. 

Traditional data programs are almost always designed with a strong focus on the producers of the data, without considering how the end user consumes data in an organisation. 

As such, once enterprise data sets have been produced, they are either siloed within their respective teams or ploughed into a giant data lake somewhere in the organisation. 

We run into problems either way. If the data is siloed, there’s a tendency towards inconsistency and divergence in standards between different siloes. If the data is dumped in one place to try to manage that inconsistency, the quantity becomes overwhelming and the discovery process becomes bloated and slow. 

In both cases, it is left to what I like to call a ‘piggy-in-the-middle’ team of data engineers to plough through all this data (which varies greatly in terms of quality and how well-catalogued it is), making sense of it and preparing it for consumption by the business.

The issue is that these little piggies have no domain expertise: they have no idea where the data came from, nor where it’s going. So they huff and they puff...and do the best that they can. 

But the end result is a higgledy-piggledy jumble of different data sets that is varied in quality, uncertain of provenance and inconsistent in documentation. 

The result is that data consumers aren’t sure what data is available, where it is and how they can get access to it. 

So they have to either badger the central data team or go round the houses to the original teams that produced the data, hoping that they find what they need. 

You end up in situations where if a product development team needs a certain data set (for example), they end up having to talk to seven different teams to get what they want and then do loads of manual work to make it usable for their purposes. And even then, they aren’t sure if they have the latest or best data available for the job. 

This creates lots of unplanned work for all teams involved, as everyone is running around,  making and responding to last-minute requests for data. 

This situation is totally inefficient and not remotely scalable. And we've seen this state of affairs stifle innovation over and over again in many different companies. Some organisations try to remedy this situation by piling on technological solutions, but you cannot technologise your way out of poor data management.

The Power of Metadata

This is where a strong metadata capability is absolutely essential in helping teams to find and consume data. 

What that means at a high level is that all data owners need to make their data available  with the consumer front-of-mind: we start producing data with the goal of turning it into a product the consumer can easily find and use. 

This requires a good, standardised data catalogue that aggregates and categorises all available data products with their meta information such as their owners, source of origin, lineage, sample datasets and so on. As well as global standards for how these are carried out. 

But to maintain the quality of the data catalogue, you can’t leave it up to the piggies-in-the-middle: they don’t have the domain expertise to execute properly on the domain-specific metadata requirements. Nor do they have the time and energy to do it for the whole business! 

You need the domains themselves to do it. 

So, we find ourselves in a position where:

  • Firstly, metadata can only be scalable and efficiently managed locally (otherwise the piggies-in-the-middle gets overwhelmed) but…
  • Secondly, there needs to be consistency of metadata governance across all domains (otherwise the domain teams will be overwhelmed by different formats, tools and processes and data discovery will not be consistent). 

This is where data mesh comes in. 

Metadata Management in Data Mesh

Data mesh is an approach to data that supports distributed, democratised, self-serve access to data organised by business domain, not by pipeline stage.

If you want to read more about the core principles of data mesh, you can check out my introductory blog here

The fact that data mesh is organised—both in terms of the platform and the team structure—around the concept of domain-driven design, means that domain teams are accountable from end-to-end for the production and consumption of their data. 

When organised this way, you can create a federated structure for your data estate, where data governance standards are defined centrally, but local domain teams have the autonomy to execute these standards however is most appropriate for their particular environment. 

That means that the governance standards for metadata management can be upheld, along with standardised tools and processes to help domains to execute on these.

At the same time, the domains have end-to-end responsibility for making those standards a reality in whatever way is best for them and are capable of introducing these at an early stage in the product development lifecycle, following data all the way from production to consumption.

This sounds simple but, historically, data has never been organised this way, so it will take some effort up front to get rolling, probably as part of a larger data transformation. 

Customer-Obsession in Data Mesh

When set up this way, your domains can start to become customer-obsessed. 

Each domain has the means and the mandate to create highly scalable and repeatable data discovery capabilities and processes for their domain. 

As each domain establishes its data kingdom, the end result is a network (perhaps you could say...a mesh) of data nodes, each producing and consuming data in a highly-organised, well-documented fashion. 

A massive benefit of this arrangement is that data quality is higher in this network, as it is no longer necessary to copy and slightly change the same data sources multiple times every time a new requirement arises. 

Consumers, then, can ditch the piggies-in-the-middle and go straight to the relevant node on the mesh for their data needs. The tools and processes to enable them to do this are also provided from on high by the central hub in the data federation. 

Overall, this means that there is a much greater capacity to focus on providing a first-class service to the consumer, while relieving the producers of all the unplanned work that resulted from the previously crappy consumer experience (with consumers having to harass producers to get what they needed). 

Final Thoughts

The biggest impact this will have is that business units can finally TRUST their data. 

They know that the domains have the capacity and the means to provide high-quality, standardised data discovery at scale, which gives them much greater confidence when moving forward with their own projects that the data they need will be there when they need it. 

As data increasingly becomes the starting point for product development and digital innovation, getting the foundations right, like metadata management, can mean the difference between rapid innovation and a lot of tired little piggies.



Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn: https://www.linkedin.com/company/wearemesh-ai/

Latest Stories

See More
This website uses cookies to maximize your experience and help us to understand how we can improve it. By clicking 'Accept', you consent to the use of these cookies. If you would like to manage your cookie settings, you can control this in your internet browser. Find out more in our Privacy Policy