Let’s say you want to adopt and scale modern data capabilities (fancy analytics, AI/ML and such).
Enterprises have thrown a lot of time and money at this problem, often with dubious results.
Firstly, they treat it as a technology exercise.
There is often a temptation to assume that a more advanced technology would resolve wide-spread issues with data, but one of my learnings from the Big Data era is that new technology platforms are not exempt from pre-existing data problems.
Instead, they could compound the problem if the data continues to be fragmented across the organisation and siloed from its legitimate users. For any new technology to deliver on its promise, the underlying data architecture issues need to be addressed.
Secondly, they don’t prepare the data with the consumer in mind, resulting in data being integrated in an ad hoc fashion.
Instead, they focus on the producer, helping them to produce data in the way that is easiest for them (and the machines that read the data), regardless of whether it is beneficial for the business users who are going to turn it into business value!
With many enterprises gearing up for an increase in their data capabilities and the adoption of ML/AI, the real question is how do you do this in a way that delivers the anticipated value and not end up being another expensive technology-only exercise?
I propose that addressing data problems at the source, so near where they actually occur, is an essential part of enabling data innovation at scale.
Enterprises need to move from an ad-hoc integration logic to identifying and exposing trustworthy and secure data sources to their legitimate consumers.
In this blog, I’ll explain what that means, why it’s so critical and suggest a three-step process for getting started.
Addressing data at the source means that you prepare your data for consumption from the ground up. This needs to happen near where and when the data is produced to increase the (re)usability of the data for as many consumers as possible in the organisation.
By this, I don’t mean that you just add a shiny layer on top of your existing data silos. A certain amount of effort needs to be invested in identifying, defining and making these data sources available upfront.
Why would you want to do this?
One of the main challenges of building a new product or introducing a new capability for which data is central, is the ability to find the relevant data sources, connect to them and then make them available in a way that is comprehensible to the consumer.
This is easier said than done as in practice many organisations have a multitude of highly utilised data sets where the lineage and quality of the data are not necessarily well understood or even suitable.
In addition, it is typical in many organisations for the data to be maintained by technical custodians, instead of the real data owners. Technical custodians frequently have limited understanding of the semantics of the data and control over its lifecycle.
Instead, what we need to have is data that matches the following criteria:
When you can do this, your whole data setup is transformed. It becomes, firstly, massively scalable and, secondly, can be freely used by anyone across the business for any use case they can imagine.
Critical to moving towards working with data at the source is what I am calling ‘fundamental data sources’.
The traditional way businesses look at data is by making a distinction between operational data and analytical data.
This distinction, however, is artificial.
Data itself is neutral. It’s only the data use cases that are either analytical or operational.
Following from their initial (false!) distinction, organisations have tailored systems (people and software) to support these segregated use cases. But in today’s world, the same data could be used to serve a wide range of use cases that span across the analytical and operational spectrum and utilise multiple data processing paradigms (machine learning, realtime streaming, etc.).
A more useful way to look at the problem is by introducing the idea of fundamental data sources.
My definition of a fundamental data source is when you have not only a discoverable, available data set, but that it has a clear data owner who is responsible for maintaining it as well as a number of specialised views that are available to serve consumers.
A fundamental data source is essentially a highly-utilised, consumer-friendly data source.
Fundamental Data Source = Data + Data Owners + Specialised Views
We do this because we have found that there are typically a limited number of data sources that underpin an organisation or a business domain, and out of which other useful data sets are derived. In financial trading, for example, customers’ portfolios, or profit and loss statements represent core business data but they are often derived from fundamental data sources such as customers’ trades and stock prices.
It is not only about the raw data though. It is equally important to identify the real Data Owners of these data sources. These are the people or teams who belong to the same business domain, who understand the data and are tasked with maintaining it and making it available to the rest of the organisation.
Because raw data at the source is often not helpful for consumers. It is usually formatted in a way that foregrounds technical concerns related to how it originated, rather than the concerns of the potential consumer.
For example, business events can take the form of text-based files fetched from a third-party service, or binary encoded messages over a low-latency transport. These formats are often not adequate for the higher-level use cases that consumers within the organisation need, for example a time series projection of the business events. This is where specialised views are created to turn raw data into usable data or information.
The fundamental data sources can then be identified and made available and consumable across the organisation to support all sorts of use cases.
To achieve this, I suggest a three-part approach:
This can take the form of a deliberate and concrete mapping activities to identify what data sources are essential to deliver the desired use cases.
It can be daunting or impractical to conduct such a mapping exercise top-down, at the scale of the whole organisation. Instead, starting from a set of existing use cases, such as a new product development, is more practical.
By following all the data sources needed to deliver the product across different layers, teams and transformations, and all the way down to the source, fundamental data sources will emerge if they were not clear from the outset. When this exercise is iterated through across different data sources and use cases, commonalities emerge and the bigger picture starts taking shape.
It is important to emphasise that identifying the data sources needs to be accompanied by identifying their Data Owners as those are the people who will have the job of managing the data source and making them available for the consumption of the rest of the organisation by building a range of reusable data views.
Identifying fundamental data sources is just the first step. To draw any utility from them, the data need to be made available to product teams to consume in a native way.
“Native” here implies that the data is accessible from the product's environment (think of a multi-account or a hybrid setup), and only minimal effort is required to enable consumption of the data.
In addition to that, the data needs to be made available in a number of specialised and reusable views, with a strong data contract. The views typically capture the requirements and needs of a class of use cases such as a streaming view or a timeseries view of business events.
Contrast this to ad-hoc point-to-point data integration that is a common but unscalable way to handle new data requirements.
The last piece of the jigsaw is to make sure that the people (emphasis on people!) who need the data in the organisation to do their job can actually find it. If the data is not discoverable, it cannot be used. It is that simple.
Data discovery needs to focus on the needs of data consumers, who are people and teams who need the data but who do not necessarily have a full understanding of it yet. This is very different from traditional data catalogues that tend to present deeply technical views of select databases but are not optimised for the type of users who need to consume the data to build new products or business capabilities.
Addressing data issues at the source is a more fit-for-purpose approach to scaling data consumption that enterprises seeking to adopt modern data capabilities should consider.
Traditional approaches assume siloed and limited use cases when handling data, which is no longer true in today’s world. By putting the data at the centre and shaping systems around it (rather than the other way around), business value and return on investment can be achieved.
This approach is not at all a quick fix so a long-term commitment and strong sponsorship need to be secured at the outset.