I am having lots of discussions with colleagues and customers about the Data Mesh concept first articulated by Zhamak Dehghani
– so many, in fact, that I am currently working on a white paper with a small group of colleagues. White papers are mighty endeavors accompanied by multiple rounds of reviews and feedback. So, whilst we work through that process, I thought that I would take this opportunity to quickly share some headline thoughts with you about why we are enthusiastic about the Data Mesh.
Designing and building analytic solutions is hard, for at least three reasons. First, because requirements are often ambiguous and fluid. Second, because these solutions rely on the repurposing of data that may have been generated by processes – and for purposes – unrelated to the current business problem. And thirdly, because integrating analytic insights into business processes requires complex trade-offs to be discovered, understood and assessed.
For these reasons, successful data and analytic platforms are – and arguably always have been – constructed incrementally and in stages. This is why successful data-driven organisations focus on the rapid delivery of data products aligned with real-world requirements.
Data practitioners have been relatively slow to adopt Agile software development methods – but where these methods are adopted and are combined with automation tools and DevOps processes, we have often seen 10X improvements in time-to-market for data products. This is the motivation for the development of Teradata’s DataOps frameworks and tooling.
The Data Mesh concept and Domain-Driven Design (DDD) principles give us a framework and approach for the intelligent decomposition of a large problem space (development of the data platform) into a set of smaller problems (individual data products) that are tractable using Agile development methods and “two pizza” development teams.
Fundamental to DDD is the idea of bounded context
, i.e.: the definition of explicit interrelationships between domains. Because “data love data” and frequently need to be combined across functional and domain boundaries, lightweight governance and data management processes
that ensure that these interrelationships are “designed-in” to individual data products is critical. Understanding, defining and enforcing the minimum set of Primary Key / Foreign Key relationships required to reliably and accurately join and compare data across different domains is vitally important in this process, as are appropriate business, technical and operational meta-data that enable data and data products to be discovered and re-used.
It will often be appropriate to create enterprise domains to support the realization of cross-functional data products – and where interoperability has been designed in to underlying data products, these cross-functional data products can be built better, cheaper and faster.
“Lightweight” is a crucial qualifier. Over-engineering and over-modelling can slow the development of data products to a crawl. Especially when it is unclear which data will be frequently shared-and-compared – as it often is when developing MVP data products - “light integration” approaches like Teradata’s LIMA framework should often be preferred. “Bi-modal” analytics
and “Data Labs” also have a role to play here.
Technical debt is a major drag on digital transformation initiatives. Re-use of data products is critical to the reduction of technical debt. Most data has very little value until it has gone through a process of cleansing and refinement. Wherever possible and practical we should do this once, rather than constructing “pipeline jungles” of redundant, overlapping data transformation processes
to apply essentially the same transformations to essentially the same data over-and-over. Very many organisations are moving towards the use of Feature Stores to support their Machine Learning initiatives
for precisely this reason.
Some commentators would have us believe that the most important part of the Data Mesh concept is the ability to rapidly provision containerised infrastructure. Bluntly, it isn’t. Provisioning infrastructure was never the “long pole in the tent”, even before Cloud deployment models made it even simpler and even quicker. The long pole in the tent is cleansing and semantically aligning data so that they can be reliably shared-and-compared. See this cautionary tale of wrangling fairly basic COVID metrics
for one recent example of how complex this can be even within the context of a single domain (and just how negative the consequences can be).
Federating the development of complex data products does not automatically imply the federation of their deployment. In fact, a spectrum of deployment options is available to organisations deploying Data Mesh solutions. Because these different strategies are associated with fundamentally different engineering trade-offs it is important that organisations frame these choices correctly and are intentional about their decisions. In general terms, there are three different strategies for data mesh deployment: (1) schema co-location, (2) schema connection, and (3) schema isolation. Note that these choices are not mutually exclusive and that most real-world implementations will continue to use a combination of these approaches.
Even at the low end, the data platforms in Global 3,000 organisations typically support 50+ analytic applications and run over a billion queries per year - with up to two orders of magnitude increases in query volumes likely during the next decade. Very many enterprise analytic workloads are characterised by: complex, stateful processing; repeated execution against continuously changing data; and embedded deployment in mission-critical business processes. In addition, improvements in the performance of multi-core CPUs continue to outpace improvements in the performance of network and storage sub-systems. For all of these reasons, the schema co-location and schema connection strategies continue to offer important performance, scalability and TCO advantages in very many scenarios. Note that schema connection strategies assume the use of a high-performance and scalable data fabric, like Teradata’s QueryGrid technology.
We are enthusiastic about the Data Mesh concept because it places intelligent decomposition front-and-centre in the rapid development of data platforms and complex data products. Our recommended approach to implementation of Data Mesh based architectures is to create separate schemas for each domain. Responsibility for data stewardship, data modelling, and population of the schema content is owned by experts with business knowledge about the specific domain under construction. This approach removes many of the bottlenecks associated with attempting to implement a single, centralized consolidation of all enterprise data into a single schema. The domain-oriented (and semantically linked, where appropriate) schemas provide a collection of data products aligned to areas of business focus within the enterprise.
Most large enterprises already operate across multiple geographies - and are increasingly leveraging multiple Cloud Service Providers (CSPs). That makes the Connected Data Warehouse fundamental to at-scale Data Mesh implementation. Within a CSP and within a geography, co-location of multiple schemas aligned to specific business domains within a single, scalable database instance gives the best of two worlds: agility in implementation and high-performance in execution.
More on this topic in a fully-fledged white paper soon.