Article

A Plague on Both Their Houses

Who are we to trust in the “architecture wars” that have broken out among some of the data and analytic platform and consultancy vendors? To Warehouse, to Lakehouse or to Mesh: that is the question.

Martin Willcox

10 février 2022 7 min de lecture

“Love all, trust a few”, one of Shakespeare's characters advises her son. But who are we to trust in the “architecture wars” that have broken out among some of the data and analytic platform and consultancy vendors? To Warehouse, to Lakehouse or to Mesh: that is the question.

For example, there’s an influential vendor in data and analytics that takes the view that essentially all Enterprise analytic data, structured and unstructured, high and low value, belongs in proprietary database file formats – the “Cloud Data Warehouse” design pattern.

And there’s another influential vendor in data and analytics that takes the view that interoperability is always more important than performance, throughput, scalability and reliability - and has therefore convinced itself that essentially all Enterprise analytic data should be stored in open file formats, the so-called “Data Lakehouse” design pattern.

They can’t both be right. And in fact, I would argue that they are both wrong. Or perhaps more precisely, that neither of them is right all of the time.

Flakey and Lumpy, part 1

The first vendor – let’s call them “Flakey” – is wrong because they overlook the importance of multi-temperature data management. Raw data is sometimes multi-structured, mostly useless to most Enterprise users until it has been refined and aligned - and is often accessed extremely infrequently (precisely because it is mostly useless to most Enterprise users). Regardless of how we process it and where we store it, there is a cost associated with transforming data from its source system representation to an alternate representation that enables a good analytic DBMS to efficiently provide all of the goodness that it provides. That cost is one that we should pay gladly for the refined data products that are accessed by millions of users and automated processes daily. But equally there may be little sense in paying it for raw sensor data that will be accessed exactly once by an ETL process - and maybe another half-a-dozen times afterwards by a handful of specialist engineers.

The second vendor – let’s call them “Lumpy” – is wrong because they underestimate the scope, diversity and criticality of the analytic workloads that successful production data platforms already support in large enterprises. Redundant data pipelines and data silos are a problem in large and complex organisations, because they drive cost and technical debt. And when you are operating at the kind of scale that consolidation of analytic applications and data demand in these large organisations, then performance, throughput, scalability and reliability matter. And I mean really matter. Bragging about your performance in a decades obsolete,dumb-as-nails benchmark in which you needed 256 servers just to support four concurrent query streams eloquently makes the point that Lumpy doesn’t understand the importance of optimised data structures for these kinds of workloads, because it isn’t running these kinds of workloads.

The truth, as is so often the case, lies somewhere between the positions taken by the two extremists.

Layering and abstraction and the data platform

To understand why, let’s remind ourselves why layering and abstraction are so important in building successful data platforms.

When we build an analytic data platform, we are typically re-using data for purposes that were not envisaged when the data were created. We can characterise the raw data that arrives from the source systems where it is created as a tier 1 data product; a foundation that is necessary, but not sufficient. It may be incomplete – or duplicate. It may be inaccurate. It may be keyed on attributes that exist only in the source system, so that it can’t be usefully compared with any other data from across the Enterprise. And those keys may very well encapsulate PII data that we would prefer not to share too widely. In short: the raw data is a mess and typically needs to be cleaned-up before it is useful for analytics.

When we clean-up the data, we typically align or integrate it so that it can be usefully compared with other data harvested from around the Enterprise, enabling us to optimise complex, end-to-end business processes. We might prefer “light integration” where we anticipate that the data are going to be re-used and shared only infrequently. Or we might prefer a more normalised representation, if we anticipate that the data are going to be re-used over-and-over in multiple different applications. Regardless, we have refined the data – and in the process converted the raw data to a new representation. Let’s call these data “tier 2” data products.

Our ultimate goal is not tidy data for tidy data’s sake, it’s business value. We can often build useful applications directly against tier 2 data – and frequently we can and we should build minimum-viable-product prototype analytic applications directly against tier 1 (although we may thank ourselves later if we at least go to the trouble of lightly integrating the data first - tier one-and-a-half, if you will). Equally, it may make sense to create an use case-specific representation of a subset of the data to optimise the usability and performance of a given, mission-critical application – what we might call a “tier 3” data product.

Flakey and Lumpy, part 2

Lumpy are absolutely right that in general we should think carefully before lavishing the trappings of a proprietary database file format on the tier 1 data. These are typically “cold” data, in the sense that they are accessed only infrequently. Prioritising economy and interoperability make absolute sense in tier 1, so long as we take care to manage the data appropriately so that the reliability of higher value data products built on the tier 1 foundation are not jeopardised. In a Cloud-first world, these data increasingly belong in open formats on object storage.

And Flakey are absolutely right that the cost of loading data to read-optimised file formats is absolutely the right choice for the vast majority of tier 1.5, 2 and 3 data products. By definition, Data Warehouses are read-intensive environments. Mostly we write once, update relatively infrequently – and read over-and-over. And repeatedly incurring the performance overhead of reading from open data formats is a poor strategy where that is what we are doing.

Towards the connected data platform

A smarter strategy is to leverage open file formats where they make sense – for the source image and other tier 1 data products – and to leverage optimised file formats everywhere else. Some call this the “Logical Data Warehouse” pattern. Others prefer “Enterprise Data Operating System” or “Connected Data Platform”. But regardless of what we call it, the implication is that we need a high-performance analytic DBMS with an optimised filesystem that can also dynamically query data in open file formats on object storage as efficiently as possible. This is a multi-temperature and also a multi-format data strategy – because we place data in different tiers and locations based on both how frequently it is accessed (“temperature”) and on its complexity.

(By the way, “as efficiently as possible” is an important qualifier. And it is important to understand that there are two general approaches to reading data in open file formats on object storage from an analytic DBMS or “engine”: engine-to-storage; and engine-to-engine. Each has their place. More on that in another blog, another time.)

Flakey and Lumpy, part 3

Lumpy will tell you that their technology is so advanced that the performance overhead of reading from unoptimized open file formats is negligible, but the concurrency limitations revealed by even their own favourite benchmarks suggests otherwise. And their commitment to data management doesn’t go much further than loudly claiming ACID compliance for their own favourite open file format, even if that perhaps ought to more accurately be characterised as multi-version read consistency.

Flakey want world-domination: interoperability and multi-temperature data management are not high on their priority list. But their lack of support for sophisticated query optimisation and mixed-workload management makes it tough for their customers to build sophisticated analytic applications on the tier 1 and tier 2 data layers, so that they typically end-up creating multiple, redundant tier 3 data products. And then spinning-up multiple compute clusters every time more than eight users want to access one of those copies. There’s absolutely nothing wrong with a well-designed Data Mart or six, especially if those Marts are virtual. But tens and hundreds of Data Marts can quickly become a data management and a TCO disaster. The truth is that Flakey’s casual attitude to efficiency and “throw hardware at the problem” auto-scaling model works better for their shareholders than it does for yours. Or for the planet. Although to give them their due, even they can do better than a miserable 64-servers-per-query-stream ratio.

“Both and” beats “either / or”

As the Bard famously has another of his characters say: a plague on both their houses. Your data strategy should be driven by your business strategy, not the other way around – and technology should be subordinate to both. Show vendors that tell you “my way or the highway” the door - and ask them to close it quietly on their way out. Interoperability and multi-temperature data management matter. So do performance and scalability. And so does minimising the number of pipelines, platforms and silos that you need to develop, deploy, manage and – crucially – maintain and change. You shouldn’t have to trade one for the other – and when you combine the right design pattern with the right technology platform, you don’t have to.

Next time: overlaying the Data Mesh on the Connected Data Platform. Is the Data Mesh decentralized development and governance, architecture - or both?