Hindsight is a wonderful thing. Looking back
, the promise of Hadoop-based Data Lakes was always going to be hard to live up to. Slurping up oceans of raw data and applying schema-on-read to analyse those data as and when needed was – and still is – appealing to those of us who wanted to move faster and to support a wider range of analytics. Alas, our visions of pristine Data Lakes just waiting to be fished for insights proved in many cases to be little more than a mirage. Not every Data Lake is a data swamp – like all technologies, the Hadoop stack has a sweet spot. But many organisations report that it has become harder and harder to plumb the murky depths of the Data Lake to find anything of value – and that few, other than dedicated data scientists, have the skill, time or inclination to plunge in.
I vividly remember describing the joys of schema-less data management at an industry event during the go-go Data Lake days. A 30-something guy at the back of the room cleared his throat, raised his hand - and politely enquired if I knew how many people at the multi-national, multi-billion-dollar online travel company he worked for could make sense of raw, un-sessionized web-log data, un-filtered for bot traffic. As I recall, his best guess was about five - all of them in the website engineering group.
None of which is to say that the problems that the Lake crowd were trying to fix weren’t real; data volume, velocity and variety were increasing rapidly - they still are. Organisations need to be able to ingest, refine, exploit and leverage them (a) according to the different demands of different use-cases, (b) cost-effectively and (c) before the Sun runs out of hydrogen fuel. Traditional waterfall-approaches to the management and integration of structured data add significant value - but also significant effort, cost and time. The result is not just tension between business and IT - but at least a three-way tension between business and what we might characterise as “application IT” and “data IT” or the CDO.
I was in the audience at another industry conference around the same time when a fully paid-up member of ‘Team Big Data’ described a fairly basic online reporting application he’d built. It only took him 12 months with a team of six developers! All of the data required were already in the company’s Data Warehouse, but his first move was to copy all of that data out into the Data Lake - and then to jump through hoops to overcome the performance and concurrency limitations of the Hadoop ecosystem. Leaving the data where it was, I think I could have built the same application in a couple of weeks with a good DBA and two good application developers!
Some Data Lakes did at least provide Data Scientists with the ultimate R&D environment, allowing them to collect diverse data and perform experiments, unconstrained by the need to create robust, consistent and re-useable processing rules and data structures. But in many organisations, tens-of-millions of dollars were spent meeting the needs of this tiny user audience - without any clear strategy or plan for how these insights could or would be used to change the business. With no clear route to production, many of these R&D Lakes continue to slide into quiet irrelevance.
Data and analytics only have real value when they are used by organisations to improve performance by reducing costs, increasing customer satisfaction or driving new growth. In a time of huge economic uncertainty, what matters are time-to-value and agility. The best way I know to go faster is to eliminate unnecessary work – and to automate as much as possible of the rest. Re-using data products is the ultimate “eliminate unnecessary work” play – and it is how successful organisations
are able to move rapidly from experimentation and testing to the deployment of predictive analytics in production and at scale.
As data and analytics migrate to the Cloud, organisations that continue to take a laissez-faire approach to data management are likely to fail for a second time with Cloud Object Store based Data Lakes. Used appropriately, Cloud Object Storage has the potential to enable radical architectural simplification at scale – to become the “Enterprise Data Operating System” that Hadoop once aspired to become.
Over the next few weeks, I’ll be sharing the Teradata approach to building a Cloud Analytic Architecture
that leverages the benefits of Cloud ecosystems to enable us to both optimise complex, end-to-end business processes and
move quickly. Stay tuned to learn how we see the Cloud not as a place, but rather as a new computing paradigm that will enable the deployment of better data products that can be accessed and exploited more widely.