As companies strive to become more agile in today’s ever-changing business world, a common theme is getting data faster and, in turn, getting insights from data faster. That’s where the notion of schema-free queries often comes in, where all sorts of unstructured data goes into files in Hadoop (Hadoop Distributed File System w/ e.g. HIVE or Drill querying), SQL and NoSQL databases that support late binding. Late binding, to get on the same page, is the practice of transforming and binding data based on relationships at program runtime, vs. early binding where transformations are done when data moves from source systems into the database.
These databases or data stores often enable rapid exploration via schema-free queries. And, it’s true that rapid exploration is a key piece of any agile company’s foundation, just as it’s true that some corners of the technology world are evolving so quickly that having to slow down and put governance and forethought into data storage and structure can be the difference between success and failure.
But with schema-free queries, it also pays to be prudent. If you’re not careful, they can make your data dishonest.
The fact that data isn’t wrapped in governance is fine (and preferred) for just poking around. We opt for schema-free queries in the first place because a lot is changing around us and new data sources are emerging regularly. The fact is that schema less/free is great for an initial prototype, but once we move past the prototype stage, the lack of schema quickly becomes a governance nightmare.
A Crumbling Analytics House Built on Schema-Free
Otherwise, whatever you produce – whether it’s a dashboard, or some metric read-out – could begin lying to you. This is the exact problem we faced in the mid 2000s during my tenure at eBay when an entire experimentation platform, with hundreds of experiments built on late binding, was starting to fold like a house of cards. The reason was that the incoming data started changing on us without any controls in place, but there was no governance to catch the change.
It only takes one developer upstream going about his day-to-day work to change the meaning of a tag, thinking he is the only one using it. Once that happens, everything built with that data could produce slightly to completely different results. Plus, there is no lineage with schema-free queries, so you won’t even know that anything has been changed!
Put simply, schema-free queries can quickly become a foundation for a house that crumbles after it’s built.
Don’t get me wrong: Late binding is a must have capability in today’s data infrastructure. We have long been working on getting more and more late binding features into our various products with the latest example being high performance and binary JSON storage and processing natively within the Teradata database.
Building Trust in Your Data
While systems need to support both late and early binding, tight and loose coupling, the evolution towards schema (even if only for subsets of data) is a must have step for any data product development process.