記事

Self-Service Analytics: Classifying Data and Analytic States

Learn how to better classify data & analytics within the analytic ecosystem by analyzing the various states of data & analytics within organizations. Read more.

2019年9月17日 4 分で読める

Dwayne Johnson

Paul Huibers, Ph.D.

In part 1 of our series on self-service analytics, we defined user personas and importance of self-provisioning. In part 2, we described the value of enabling each persona and the need to get our arms around the data sprawl, which is further complicated by self-service analytics. In part 3, we’ll step back to look at the various states of data and analytics within organizations to help provide a better way to classify data and analytics within the analytic ecosystem.

Because the problem involves the entire analytic ecosystem, it is best to approach it from a conceptual perspective, starting with the data. At the highest level, data resides in one of two zones, operational or explorational.

The Operational Zone is a “production” area which has clearly defined, measurable and monitored business-driven service level agreements (SLAs), such as expectations around data quality, security, availability, accessibility, recoverability and data retention.

The Operational Zone can be broken down into three layers: Acquisition, Integration and Access. The Acquisition layer is used to acquire raw, unchanged data from source systems. It is also used to standardize and conform the data into minimal viable products. The Integration layer establishes common keys and performs transformations, typically to near 3^rd normal structures to enable reuse of high-quality trusted data. The Access layer provides easier, optimized access, typically in flattened or star structures, for specific business needs via views, materialized views and application program interfaces (APIs).

The Exploration Zone is a data test bed for doing experimental analytics on a variety of data in a less governed manner. Sandboxes (aka data labs) can be set-up quickly and monitored. New experimental data loaded into these sandboxes can be joined with other analytic data stores in the Operational Zone. The exploration environment is not be used for highly governed analytic delivery. As data and analytics within the exploration zone are operationalized, they move through the standard SDLC development path into the Operational Zone.

Data fulfills business needs at different levels of data quality, reliability and integrity, which we will classify as Bronze, Silver and Gold.

Bronze data is “bring your own” user data. Users load the raw data, perform data cleansing and conform it into something more consumable for their analytic efforts.

Silver data comes into the analytic ecosystem through a formal ingest process into the acquisition layer. Data is generally cleansed and conformed into usable structures, sometimes referred to as minimal-viable-data-products. The data may be used by one or several users.

Gold data is silver data that has been further refined (i.e. integrated) into highly reusable and trusted data, often referred to as trusted data products. Gold data can be leveraged across the enterprise as it has a high degree of data integrity and quality. As discussed in part 2, gold data can be used by all user personas, but General Consumers and Data Analysts, which make up 90-95% of the total user population, use it exclusively.

It is important not to associate the classification of data to data storage technologies. Bronze, Silver and Gold data may reside in object stores, distributed file systems (such as HDFS) or relational databases. It is less important where Bronze data is kept, as it is experimental. Performance is not the user’s primary concern, it is speed-to-insight. The user can decide what data storage technology they would like to use.

For Silver and Gold data, non-functional requirements drive the decisions on data storage technology, such as ingest volume, ingest latency, query latency, user concurrency and total-cost-of-ownership. IT makes these determinations as part of the operationalization process, as discussed in part 1.

Data states are a foundational concept, but analytics have states as well. The diagram below depicts our two zones: Exploration and Operational. Both zones have data and analytics.

Exploration Data is typically one-time or intermittent loads which are not recoverable, i.e. limited SLAs.

Operational Data which have gone through a formal operationalization process to automate, and ensure data quality, security, metadata and other business defined SLAs are met. Operational data includes raw source data, minimal viable data products and trusted data products.

Exploration Analytics have limited SLAs, which are more focused on resource availability, i.e. sandbox storage, CPU, GPU and memory. Exploration analytics may use both exploration and operational data, but the priority of the analytics will run at a lower priority than operational analytics, when accessing operational data.

Operational Analytics refers to analytic code which have gone through a formal operationalization process to automate and ensure tool, regulatory compliance and all respective SLAs are met, e.g. availability, accessibility performance. Operational analytics depends solely on operational data and therefore SLAs for operational data is critical to operational analytics.

The diagram below provides a few examples of analytics to better describe how the various states of analytics interact with the various states of data within the analytic ecosystem.

Bring-Your-Own-Data Exploration – Only leverages exploration data
AI Research - Leverages exploration data, but also leverage operational data
Ad-hoc Analysis – In this example, only utilizes operational data, but may also bring their own
BI Analytics – Utilizes only operational data, exclusively leveraging Gold level data (trusted-data-products)
Predictive Analytics – Utilizes only operational data, leveraging Silver data (minimal-viable-data-products) and Gold data (trusted-data-products) to maximize data quality and minimize data movement

The diagram below shows the three states of data and three states of analytics within the analytic ecosystem, denoting which are exploration and operational. The checks indicate what class of data can be accessed by which class of analytics. As data and analytics become reusable they require more formal support, i.e. they need Operationalization.
Picture1-(2).png

By classifying the various states of data and analytics, we can begin to get our arms around the problem. We now have a conceptual view of all data and analytics within the analytic ecosystem. That’s powerful.

We can conceptually classify what we need to manage, but the question remains, how do we use these classifications to better manage them within the analytic ecosystem? In part 4, Enabling and Managing the Self-Service Analytics, we’ll discuss just that.

Self-Service Analytics: Classifying Data and Analytic States

Dwayne Johnson について

Paul Huibers, Ph.D. について