記事

Self-Service Analytics: Classifying Data and Analytic States

Learn how to better classify data & analytics within the analytic ecosystem by analyzing the various states of data & analytics within organizations. Read more.

2019年9月17日 4 分で読める
self-service analytics: classifying data and analytic states
In part 1 of our series on self-service analytics, we defined user personas and importance of self-provisioning. In part 2, we described the value of enabling each persona and the need to get our arms around the data sprawl, which is further complicated by self-service analytics. In part 3, we’ll step back to look at the various states of data and analytics within organizations to help provide a better way to classify data and analytics within the analytic ecosystem.
 
Because the problem involves the entire analytic ecosystem, it is best to approach it from a conceptual perspective, starting with the data. At the highest level, data resides in one of two zones, operational or explorational.  
 
The Operational Zone is a “production” area which has clearly defined, measurable and monitored business-driven service level agreements (SLAs), such as expectations around data quality, security, availability, accessibility, recoverability and data retention.    
 
The Operational Zone can be broken down into three layers: Acquisition, Integration and Access. The Acquisition layer is used to acquire raw, unchanged data from source systems. It is also used to standardize and conform the data into minimal viable products. The Integration layer establishes common keys and performs transformations, typically to near 3rd normal structures to enable reuse of high-quality trusted data. The Access layer provides easier, optimized access, typically in flattened or star structures, for specific business needs via views, materialized views and application program interfaces (APIs).
 
The Exploration Zone is a data test bed for doing experimental analytics on a variety of data in a less governed manner. Sandboxes (aka data labs) can be set-up quickly and monitored. New experimental data loaded into these sandboxes can be joined with other analytic data stores in the Operational Zone. The exploration environment is not be used for highly governed analytic delivery. As data and analytics within the exploration zone are operationalized, they move through the standard SDLC development path into the Operational Zone.
 
Data fulfills business needs at different levels of data quality, reliability and integrity, which we will classify as Bronze, Silver and Gold.      
 
Bronze data is “bring your own” user data. Users load the raw data, perform data cleansing and conform it into something more consumable for their analytic efforts. 
 
Silver data comes into the analytic ecosystem through a formal ingest process into the acquisition layer. Data is generally cleansed and conformed into usable structures, sometimes referred to as minimal-viable-data-products.  The data may be used by one or several users.
 
Gold data is silver data that has been further refined (i.e. integrated) into highly reusable and trusted data, often referred to as trusted data products. Gold data can be leveraged across the enterprise as it has a high degree of data integrity and quality. As discussed in part 2, gold data can be used by all user personas, but General Consumers and Data Analysts, which make up 90-95% of the total user population, use it exclusively. 
 
It is important not to associate the classification of data to data storage technologies. Bronze, Silver and Gold data may reside in object stores, distributed file systems (such as HDFS) or relational databases. It is less important where Bronze data is kept, as it is experimental. Performance is not the user’s primary concern, it is speed-to-insight. The user can decide what data storage technology they would like to use.  
 
For Silver and Gold data, non-functional requirements drive the decisions on data storage technology, such as ingest volume, ingest latency, query latency, user concurrency and total-cost-of-ownership. IT makes these determinations as part of the operationalization process, as discussed in part 1.
 
Data states are a foundational concept, but analytics have states as well. The diagram below depicts our two zones: Exploration and Operational. Both zones have data and analytics.   
 
Exploration Data is typically one-time or intermittent loads which are not recoverable, i.e. limited SLAs.   
 
Operational Data which have gone through a formal operationalization process to automate, and ensure data quality, security, metadata and other business defined SLAs are met. Operational data includes raw source data, minimal viable data products and trusted data products. 
 
Exploration Analytics have limited SLAs, which are more focused on resource availability, i.e. sandbox storage, CPU, GPU and memory. Exploration analytics may use both exploration and operational data, but the priority of the analytics will run at a lower priority than operational analytics, when accessing operational data.
 
Operational Analytics refers to analytic code which have gone through a formal operationalization process to automate and ensure tool, regulatory compliance and all respective SLAs are met, e.g. availability, accessibility performance. Operational analytics depends solely on operational data and therefore SLAs for operational data is critical to operational analytics.
Picture1.png
The diagram below provides a few examples of analytics to better describe how the various states of analytics interact with the various states of data within the analytic ecosystem. 
 
  1. Bring-Your-Own-Data Exploration – Only leverages exploration data
  2. AI Research - Leverages exploration data, but also leverage operational data 
  3. Ad-hoc Analysis – In this example, only utilizes operational data, but may also bring their own
  4. BI Analytics – Utilizes only operational data, exclusively leveraging Gold level data (trusted-data-products)
  5. Predictive Analytics – Utilizes only operational data, leveraging Silver data (minimal-viable-data-products) and Gold data (trusted-data-products) to maximize data quality and minimize data movement
Picture1-(1).png
The diagram below shows the three states of data and three states of analytics within the analytic ecosystem, denoting which are exploration and operational. The checks indicate what class of data can be accessed by which class of analytics. As data and analytics become reusable they require more formal support, i.e. they need Operationalization.  
Picture1-(2).png
 
By classifying the various states of data and analytics, we can begin to get our arms around the problem. We now have a conceptual view of all data and analytics within the analytic ecosystem. That’s powerful.
By classifying the various states of data and analytics, we can begin to get our arms around the problem. We now have a conceptual view of all data and analytics within the analytic ecosystem. That’s powerful.
We can conceptually classify what we need to manage, but the question remains, how do we use these classifications to better manage them within the analytic ecosystem? In part 4, Enabling and Managing the Self-Service Analytics, we’ll discuss just that.
Tags

Dwayne Johnson について

Dwayne Johnson is a Principal Ecosystem Architect at Teradata, with over 20 years' experience in designing and implementing enterprise architecture for large analytic ecosystems. He has worked with many Fortune 500 companies in the management of data architecture, master data, metadata, data quality, security and privacy, and data integration. He takes a pragmatic, business-led and architecture-driven approach to solving the business needs of an organization.

Dwayne Johnsonの投稿一覧はこちら

Paul Huibers, Ph.D. について

Paul Huibers is a principal data scientist at Teradata, with many years’ experience identifying business value in data, and defining projects that lead to operationalization of analytics to achieve significant return on investment. With his background in chemical engineering, Paul focuses on the Industrial Intelligence, manufacturing, and high technology areas. His recent activities include the application of Artificial Intelligence and Deep Learning to business problems.
  Paul Huibers, Ph.D.の投稿一覧はこちら

最新情報をお受け取りください

メールアドレスをご登録ください。ブログの最新情報をお届けします。



テラデータはソリューションやセミナーに関する最新情報をメールにてご案内する場合があります。 なお、お送りするメールにあるリンクからいつでも配信停止できます。 以上をご理解・ご同意いただける場合には「はい」を選択ください。

テラデータはお客様の個人情報を、Teradata Global Privacy Policyに従って適切に管理します。