記事

Governing Data Across the Analytical Ecosystem

The proliferation of multiple data platforms and shared responsibilities between IT and the Business requires a renewed focus on Data Governance

Mike Dampier

2018年11月6日 5 分で読める

Historically, data governance has been a challenging topic of conversation for decision makers. Technologists have faced an uphill battle trying to engage business consumers to actively take ownership of “their” data.

Organizations require a huge cultural shift to move from a single integrated Data Warehouse to an analytic ecosystem optimized to deliver analytic results while minimizing complexity in the cloud and on-premises. The rush to enable advanced analytics without proper planning leads to new ways to duplicate, orphan and silo data.

The proliferation of multiple data platforms and shared responsibilities between IT and the Business requires a renewed focus on Data Governance that most organizations haven’t even contemplated.

Figure 1. Tightly coupled data at the core of the Analytical Ecosystem digresses to loosely and non-coupled on the edge

What is the Cultural Shift?

Data scientists, business users, analysts and casual data consumers all want fast access to clean, related, pertinent, and timely data. Enabling and deploying new advanced analytics, machine and deep learning capabilities in a cost-effective manner across the entire analytical ecosystem places new demands on Business and IT. Explicitly, the shift is in who deploys data, where it is deployed, how tightly is it integrated, who provides business context and who governs what. That’s a lot of change from the days of a “Single Version of the Truth”.

Who deploys data?
This is the easy part. With the introduction of the Analytical Ecosystem, both IT and the Business have the ability to deploy data. But data deployed by IT and Business likely has very different lineage, provenance, and lifecycles. IT, you are still the owner of enterprise data (application sourced, trusted, transactional, and master data). It is IT's job to deploy that information in appropriate forms. Business, it is your job to integrate this data with our own sources, like external public domain data (weather) or purchased data (demographics/psychographics) in your sandboxes, lakes, etc…

Who governs what?
This is a bit more complex. If you guessed the group that ingested the data, then you would be partially correct. The business still owns all of Data Governance from a quality perspective. The Business also still owns the business-related metadata associated with enterprise data. But now the Business owns governing the business context (business and technical metadata) of the data they are ingesting to ensure that it can be integrated (if needed) with enterprise data. Did IT just shirk some of their responsibility? Not exactly, let’s discuss their new responsibilities next.

Where is data deployed?
This topic is actively debated in nearly all of my clients and across entire industries. Like the messaging coming from Hadoop vendors a few years ago, now Cloud vendors are advocating that all your data should be on their platform. Every time there is a new technology (and they are coming at us faster now), they have the same old one-liner, “All your data should be here”. Should your data be in a cloud vendor’s Data Lake? Yes, some of your data should absolutely be in the Data Lake (on-premises or in the cloud). Also, some of your data should also be in Sandboxes, Labs and Data Marts. Additionally, your clean, curated and trusted data should be in your Data Warehouse.

Most importantly, the platforms should be interconnected via a high speed fabric that enables physical and virtual projections. IT this is your new governance responsibility. Deploying centrally managed scalable data virtualization technologies enables sharing data at runtime without having to build complicated data synchronization processes. This has real business value for certain analytical use cases and should be deployed as a standard capability within your analytical ecosystem.

IT, it is also your responsibility to know where all the data is deployed, what the data’s technical metadata is, what the data’s lineage and provenance is, who is using the data, how the data is being used, and how frequently. Why? Because it is your responsibility to manage the overall cost to deploy and manage the Analytical Ecosystem. By enabling Self Service (including Data Ingestion) for the Business, you must take on new governance roles to ensure a fully functioning, self-provisioning and cost-effective environment.