The company is an e-commerce startup in Germany. It has grown to become one of Europe’s leading online retailers, with over 14,000 employees, over 2,000 of which work in their tech department.
The company has been a data-driven company from the start, collecting data from many different sources and using it to make business decisions.
They began with a huge, on-premises big data warehouse to integrate all the data for the purposes of analytics. But as the company grew it ran into issues of scale, flexibility and communication.
They solved this, initially, they built a data lake and a company-wide messaging bus in the cloud that had three purposes: firstly, archiving all data that flows through the bus in the data lake, secondly, to connect to the legacy data warehouse (where there was still much valuable data) and, thirdly, tracking web data.
The major challenge with this messaging bus was the fact that it’s centralised.
This created a disconnect between the producers and the consumers - with a central data pipeline separating them.
The producers had no sense of ownership over their data and have no idea what happens to it once they have created it. Many data producers were not even aware that their dataset is being stored and archived, never mind being used by others.
Likewise, the users are not familiar with the data either and are not sure how to use it. And the central data team cannot ensure the quality of each of the thousands of datasets that stream through the central data pipeline. Their main concern is just keeping the data flowing, which leads to low quality data.
This leads to the central infrastructure team becoming a key bottleneck that limits the scalability of this setup. They are getting hammered with requests from confused users and are trying to juggle those last-minute requests with fire-fighting.
The organisation decided to opt for a data mesh approach to move from centralised to decentralised ownership of their data.
This involves multiple shifts in mindset, especially shifting focus from pipelines as a priority, to domains as the key focus; from a centralised data lake to an ecosystem of data products; and, moving away from siloed data engineering teams to cross-functional domain data teams.
A key change was to make data producers responsible for the data they store in the system, so they have to opt-in to making more conscious decisions around data. If they store it they know they have to support that dataset.
Equally important was the creation of cross-functional teams of business analysts, domain experts and data engineers to create teams that had all the knowledge necessary to create high-quality data products.
A universal interoperability layer was provided to serve central infrastructure in a data-agnostic way.
They provided Spark clusters for consumers to do data analytics work without having to worry about the underlying infrastructure. In this way, the organisation is providing centralised services that are globally interoperable from any point.
These innovations combined massively simplified data sharing. Central infrastructure provisioning alongside decentralised data ownership and cross-functional teams means users can access data autonomously and plug in whatever tools they want.
Using the data mesh, the organisation has managed to turn their ‘data swamp’ into a decentralised, domain-driven data architecture that is free from major bottlenecks.
The way they organise their data and their teams means that much more conscious decisions are being made around their data, resulting in much higher degree of visibility, ownership and responsibility around datasets in the organisation.
This allows for much more fluid and dynamic movement of valuable data within the organisation.
The data products that emerge from this mesh for consumption are well-understood, high-quality and highly-available.