“Centralization doesn’t always work. We know we can create these huge data lakes if we want to. But sometimes, they end up becoming data swamps.”
Global Head of Data Management
Data in the enterprise is a tangled goat’s nest of complexity.
Dive in and we find data from thousands of different sources, located in hundreds of corners of the business, in a hundred formats from all corners of the globe, potentially reaching back decades.
And with each new business region, merger or acquisition the complexity increases by an order of magnitude as a host of new platforms and technologies get tacked on to the existing system.
And you can’t be sure what’s useful and what's not straight off the bat. Consider a large, international mortgage company. If your customers are renewing their mortgages, your data on these customers from 10, 20 or even 30 years ago is still very much relevant.
In amongst this goat’s-nest complexity, the question looms: what the hell is all this data, where is it and how can I use it? This is the question of data discovery.
If your data is discoverable then you know what it is, where it is, and it has been successfully transmuted into something meaningful, secure, private and easily consumable that your business teams can access directly and make use of.
This is astonishingly hard to do well, given the aforementioned complex goat’s nest issue.
But the goat is guarding a great treasure: your business teams being able to work with data unimpeded, spending their time directly on extracting insights and turning those into next-best-actions, data-driven product ideas and deeper understanding of customer and business alike.
So how can we slay the data goat and make our data deliciously discoverable for our teams?
Traditionally, folks have tried to deal with the scale and heterogeneity of enterprise data by centralising and dumping it all in one place: a data warehouse or data lake.
This approach can work in smaller organisations where there are fewer data sources and fewer data consumers (i.e. where the goat’s nest has not reached monstrous proportions). But when you get to the level of massive enterprises you run into issues.
So many different kinds of data accumulate that being able to ingest and make sense of it gets harder and harder. As more people pile into the platform, response times get slower. And as more teams create undocumented and slightly different copies of the same data, quality takes a hit.
The engineers stuck in the middle become a bottleneck, with limited knowledge of the business-relevance of the different data sets streaming into their platform.
Ultimately, your sophisticated platform ends up being a data jumble sale where it’s hard for the prospective buyer to find the valuable nugget they need.
The data mesh is a paradigm that questions the foundational assumption of data in the enterprise: that the data sources, the platform itself and the data team have to be centralised.
Instead, the data mesh approach decentralises the whole thing, supporting distributed, democratised, self-serve access to data organised by business domain, not by pipeline stage.
Critically, data is treated as a product to enable full end-to-end accountability across each domain, with one team responsible for the whole lifecycle of a given data product.
I’ve written in much more detail about the core principles and benefits of the data mesh, so for more details go check out that post here.
When data is widespread and decentralised, as in the enterprise, treating it as a product and organising by domain, there’s a lot more leeway for teams to sort themselves out and for technologies to be quickly changed and scaled.
This translates into much faster and more democratised data discovery.
Here are a few of the ways that a data mesh enables superior data discovery:
The mechanism for discovering data in a data mesh is the data contract, which is created whenever data is published.
It covers core properties such as data type, schema, frequency, quality, lifecycle and so on.
These properties are immutable across the data lifecycle, which allows consumers to discover and subscribe to specific data contract properties so they are continually updated with relevant data as new numbers are crunched.
The governance for data contracts can be federated across data owners but managed in a centralised location i.e. a data discovery platform. Data experts support the various domains to execute said contracts in their own way and with their specific domain expertise, but in alignment with global standards.
The result is the enablement of standardised data cataloguing (and discovery) across the data mesh, while maintaining decentralisation of the data itself.
Prior to the data mesh, all data had to be processed via a central team, which functions as a massive bottleneck for both data producers and consumers. It’s a terrible design in terms of scalability.
In a data mesh, the dependence on a centralised team from the side of both the producer and the consumer is removed. And thinking about data in terms of products, creates end-to-end accountability for specific domains.
So, producers are responsible for their data from source to consumer, including quality assurance as well as specifying access and control policies for the data in their domain, which they can do on their own terms.
Equally, consumers can go straight to the relevant domain endpoint to source their data, with no interference.
Often consumers of one domain are producers of another. The result is a network of data-producing and data-consuming nodes, each of which is responsible for keeping the flow of their data open for the other domains.
This creates a decentralised, highly scalable web, which is accountable and capable of maximising the production and consumption of data across the organisation. The result is a very scalable and highly-resilient web of data discovery processes.
Teams can then autonomously create and consume data products or even bring together independent data products in combination by themselves. The granularity of what consumers are able to achieve is massively broadened.
Just as the move from monoliths to microservices in application development resulted in a giant leap in granularity, scalability and resilience, the move to decentralised, domain-specific responsibility and communication brings these qualities to data discovery also.
In this way, and in concert with data contracts, producers and consumers can easily discover, understand and securely use all the data they need across the enterprise. And if you want to scale the platform, you just add more nodes on the mesh.
Another consequence of decentralisation is that relevant domain expertise is now omnipresent, which rapidly accelerates the data discovery lifecycle. I.e. each domain has a cross-functional team with the data engineering and domain expertise it requires.
In a centralised data platform, all data is fed into one central team to be processed and prepared. While this team are experts in handling data, they don’t actually know anything about where this data has come from, nor where it’s going.
Imagine if developers had to send all their features to a central ‘feature team’ to be processed before they could be delivered to consumers. The central team would have thousands of features from all sorts of different apps to process...how well would they be able to do that job?
When data is turned into domain-specific products, with each domain taking on end-to-end responsibility, they have both the data expertise plus the domain expertise.
The whole process of finding, sorting, transforming, analysing and publishing data as products to be consumed is scaled and accelerated across teams with the accountability and expertise to get the job done as quickly and effectively as possible.
Critically, because responsibility is end-to-end, producers are directly incentivised to make their data highly discoverable. There’s no just throwing data over the wall and hoping for the best!
This is how enterprises are then able to deal with the mammoth amount of data of all different kinds they have: divide and conquer!
To bring it back to our example at the beginning, this is the way that our mortgage company might start to get a handle on all their legacy data. By dividing it up into domains, each of which can draw on global standards and infrastructure to get the job done as they see fit, as they know their domain best!
The data mesh is both distributed and discoverable.
With centrally-governed-yet-locally-executed data contracts, end-to-end product thinking, cross-functional domain teams and scalable infrastructure the whole data discovery enterprise can be made incredibly granular and scaled across your entire business.
The various nodes on the mesh are responsible for holding their respective data forts, drawing on centralised resources of infrastructure and standards, but managing their own domain in the way that they deem best as domain experts.
What emerges is more than just data discovery.
Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn: https://www.linkedin.com/company/wearemesh-ai/