Data is the deepest well of business value that businesses have access to.
But organisations are realising the growing importance of not just producing valuable data, but of what I call data utilisation: the capacity to turn that data into something useful.
So not just deriving insights into your customers, for example, but automatically integrating those insights into your next batch of products.
In my experience working across many large enterprises, I have seen a number of common anti-patterns that get in the way of successfully scaling data utilisation.
These anti-patterns are sneaky, seemingly-innocuous and often go completely unnoticed, or are even celebrated as a necessary aspect of a functioning data programme!
In this blog, I’m going to highlight five of these ‘silent blockers’ that are preventing enterprises from turning their data into powerful business value.
Due to the way that companies have developed their data programs historically, focusing on narrow and limited use cases, many preside over extremely fragmented datasets that are locked away in discrete siloes.
This siloed data rarely exists in a format that is well defined or actually suited for consumption by the larger enterprise. And because the siloes were built on top of pre-existing structures, they rarely match business domain boundaries (and so are misaligned with the business they are meant to serve).
The consequence is that, even though the data could be incredibly valuable for new products, finding, accessing and utilising this data becomes virtually impossible.
In general, people who are in most need of consuming data do not necessarily have a deep understanding of the various attributes of the data set, such as the exact semantics, quality attributes or schema.
But traditional data catalogues seldom address this need as they tend to focus on narrow data silos, technical users who are not the intended data consumers, and as a technical reference for data workloads (e.g. an ETL process fetching a schema from a data catalogue).
This means that data discovery is not built around the needs of the consumers of the data!
And if the consumers of the data can’t find the data...they can’t use it! Simples.
Data discovery should, instead, be a human-centric capability that allows those who need the data in an organisation, for example product owners or stakeholders of business functions, to find it.
Another way of blocking people from using data is to turn accessing it into a hurdle to be overcome.
And it’s not a question of just giving someone access to some complex, ad hoc, data discovery process. This is a multifaceted issue: where consumers don’t have access to the right data, in the right environment, presented in the required views, with the right latency/performance...then that data isn’t usable!
For example, we need to consider the environments where the data dwell e.g. (multi)cloud, on-premise or even hybrid. Sometimes you might be able to migrate a whole chunk of data processing to the cloud, including both the data sets and the workloads. Some other situations are hybrid in nature: a consumer has to access data that lives in a different location, such as a cloud-based third-party service or an in-house data set that lives in a different data centre.
Additionally, different consumers need to access different data sets with the least friction possible and even have different views of the same data. For example, market data needs to be accessed by co-located processes if low latency is a concern, whereas an end-user with cloud-based application might need access to a more convenient but less performant view of the same data.
Access to data might look like a technicality but, in fact, it needs to be thought of strategically. The risk otherwise is locking out users and teams from the data they need to deliver value.
Discovering and accessing the data is only half of the problem.
Even if data is discoverable and accessible, the data can be of poor or indeterminate quality and therefore untrustworthy.
This is a much more common problem than you would hope.
Data quality itself is a dynamic attribute. As a dataset evolves over time, normal values can change and drift and thereby might evade the scrutiny of a static approach to measuring or verifying its quality and trustworthiness.
A more modern and intelligent approach is needed to determine if a dataset is of an adequate quality, such as using machine learning to dynamically detect data trustworthiness issues.
Additionally, data lineage—understanding where the data comes from and what derivations it underwent—is another vital component to data quality as it enables trust in a data set, but equally the ability to derive it further to meet more requirements.
Traditionally batch based processing has reigned supreme in enterprises, which was OK to meet the needs of traditional use cases but is totally inadequate for modern responsive use cases where users’ and customers’ expectations of a high quality of service are the norm.
This batch-based model is an unnecessary limit on what can be achieved with data, massively constricting data utilisation, yet is widely accepted as standard in many enterprises!
There are currently fewer reasons to restrict all processing to a batch-based model, especially when the benefits are limited.
For example, modern data processing frameworks try to unify the batch paradigm with the streaming paradigm to make both options viable while reducing the complexity of writing and operating data pipelines.
The icing on the cake is that many of these services are much more accessible today through managed or cloud-native flavours that all major cloud vendors provide (a good example is Apache Flink, which is available as a fully managed service on AWS and Google Cloud Dataflow).
Familiarity with an existing tool or model is definitely advantageous but it should not override a better approach if that would result in a net increase in incidental complexity (e.g. managing a collection of dependent batch jobs manually). Over time, the added architectural and operational complexity of inadequate tooling will restrict what can be achieved with your data.
The issues we discussed are part of the reality of many enterprises today who are trying to do more with their data. By their nature, these blockers can impact data-centric innovation dramatically and detract teams from delivering value.
There is no silver bullet here unfortunately; a strategic and federated approach to data that addresses these challenges in small steps, in addition to adding modern fit-for-purpose data capabilities will help put things back on track.
Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn.