I recently passed 17 years at Teradata and a quarter of a century in the industry. In no particular order, here are ten things I’ve learned in those 20-odd years.
#1: Data-driven organisations are out-competing their peers and eating the world (witness Apple, Amazon, eBay, Facebook, Google, PayPal, etc., etc.).
#2: Connecting, integrating and sharing data is (mostly) a virtuous circle; managing them in silos is (almost always) a vicious cycle. Putting detailed sales, order and inventory data together and sharing it with partners and suppliers enabled Wal-Mart to dominate grocery Retail in the 90s, by creating a demand-driven supply chain that simultaneously improved sales and customer experience whilst crushing costs. And Amazon similarly dominates Retail today by combining purchase data with behavioural data to understand what customers want better than its competitors do - and by enabling partners to leverage the platform that it has created, generating even more data about even more customers. If we are not optimising an end-to-end value chain, I have learned that mostly we are doing it wrong.
#3: Managing, connecting, integrating and sharing data is often hard and never comes for free. That effort and expense needs to be aligned with company strategy and cost-justified, because whilst all data have value, some data are more valuable than others - and the value of many datasets varies over time. I have learned that there is always, always, always (at least) one schema - but equally that it is a mistake to over-model data, especially where the value of that data and the extent to which it will be re-used are unclear. I have also learnt that integration is not the goal – and to leave well alone when the cost of integration exceeds the benefit.
#4: The only constant in modern business is change - and data and data products delayed are business and societal value denied. I have learned that the simplest way to deliver value faster is to: start with the end in mind and build what is necessary, avoiding over-engineering; re-use and extend existing data assets and services wherever practicable, avoiding the creation of expensive and difficult-to-maintain data silos through repeated reimplementation; and to automate wherever possible.
#5: Taking processing to data scales and performs better than shipping data to processing nine times out of ten – and optimising for acquisition and loading in a read intensive environment is wrong, wrong, wrong. Design for access!
#6: Large and complex organisations are just that: large, complex and diverse – so a successful data platform is an open data platform that supports multiple tools, technologies, and languages. That said, tools and technologies that are simple to deploy, use, manage, maintain and optimise should often be preferred. Good old SQL may have its limitations, but there’s an awful lot to be said for a simple, declarative language when it comes to the optimisation of complex queries that feature join, merge, aggregation and sort processing. And all worthwhile analytics features a lot of join, merge, aggregation and sort processing.
#7: Data are a force for both good and ill - and ethical considerations should underpin the way that data are collected, managed, exploited and, crucially, protected and secured.
#8: Machine Learning will be ubiquitous and the basis of competitive advantage in many industries in the very near future – and Machine Learning is first-and-foremost a data problem. At the same time, organisations can’t Machine Learn everything – and whilst Machine Learning relies upon good data, good data underpin more than just Machine Learning. John Snow did not need a Convolutional Neural Network to change how we think about Cholera - and simple A/B testing remains a powerful tool for even the most sophisticated e-Commerce platforms.
#9: Data are moving to the Cloud. And we should increasingly think of the Cloud less as a place, more as a next-generation computing paradigm that provides: a rich ecosystem of composable services; on-demand infrastructure; API-driven everything; automated operations; usability and simplicity. In particular, object storage technologies have the potential to succeed where Hadoop failed and to provide Enterprises with a “data operating system” that will enable radical architectural simplification.
#10: Data architecture and data management aren’t cool right now – and since digitalisation will never achieve its full potential without them, as an industry we need to fix that.