Though vastly important, dealing with and analyzing time-based data pose several challenges, especially in today’s world where organizations seek answers in an agile manner while taking advantage of massive influx of data due to the advancement of IoT. If you think basic analytics are hard, just wait until you start dealing with IoT data!
Teradata Vantage is here to the rescue. Vantage allows you to overcome those challenges through our time-aware aggregations and optimizations features. Let’s take a look at what some of the common challenges are and how Vantage addresses those through our new innovations:
Aggregating time-based data could be trickyMany analytics are done by aggregating data at a certain time dimension level – Like every day, every week, or every year. Writing a time arithmetic SQL statement using TIMESTAMP or DATE format data could be very complex when you need to “bucket” the time values. One of the classic techniques to get around it is to create a date dimension table with which each row in a fact / transaction table is associated using a specific date key (see Figure 1). This makes it easy to bucket data based on the predefined time dimensionality.
Figure 1: Example of Time Dimension and Fact / Transaction Tables
But what if you want to aggregate data for every 5 minutes, 5 seconds, or even 5 milliseconds? Would you create a time dimension table that keeps track of each millisecond? A mere 1 day equates to 86.4 million of milliseconds. Imagine joining your fact table against that dimension table.
Teradata Vantage’s new GROUP BY TIME aggregation function makes it a snap to perform time arithmetic, complete with begin and end points to the interval. You want data aggregated by a day? Just specify your aggregation like SUM, COUNT, MAX, etc. and aggregate it using a new syntax, GROUP BY TIME (DAYS(1)). That’s it. Your business users want the aggregation to be changed to, say, every 13 minutes? Just change it to GROUP BY TIME (MINUTES(13)). Every microsecond? No problem. It’s that easy. It is this ease of use that radically simplifies the act of time analytics which then encourages and spurs even greater usage of time dimension in your business answers.
Visualizing time-based data could be frustrating
Many of us use BI and reporting tools to visualize our data, using bar charts and line charts. I am sure many of you faced a challenge of visualizing sparsely populated data resulting in missing time dimension values; when you are missing data for specific time dimension values, your chart skips the time dimension values resulting in an inconsistent-looking chart. A common technique to get around it may be performing a FULL OUTER JOIN between your fact / transaction table with aforementioned time dimension table so that you can force creation of the missing time dimension values. However this solution is far from optimized in terms of performance.
Teradata Vantage’s GROUP BY TIME function has a perfect answer to this challenge by its associated FILL clause option. The FILL option basically allows you to specify how to deal with the annoying missing values. Do you want to fill the missing values with nulls? Use FILL(NULLS). Want to use a constant value of, say, 10? Use FILL(10). It’s that easy. It also provides the option to simply replicate the last or next value to help you “smooth the curve” rather than having jagged drops or missing intervals.
Optimizing time-series and time-based data queries could be daunting
With the advance of IoT, the amount and granularity of data is increasing. More and more organizations are analyzing and discovering new insights from “time series” data. Time series data generally consists of a series of data sequenced by time, commonly in a set time interval. For example, data coming from a sensor data that pings every 5 seconds. Time series data is prevalent in IoT as sensors and devices generate and collect this type of data.
When you try analyzing this type of data which obviously could be of a large volume, one of the first challenges is determining how to store the data with an eye on query optimization. For example, would I always be accessing the data by hour, by sensor? One of the age-old techniques to deal with the challenge is to use aggregate (summary) tables in which values are pre-aggregated based on predefined combination of attributes, one of which is time. But why should you pre-aggregate data and lose precious insights that can exist in the raw data? What if you want to analyze the data outside the predefined set of attributes?
Teradata Vantage now offers a Primary Time Index (PTI) option, which allows you to distribute your data based on time buckets, in addition to other columns, so that you can perform time-based queries more efficiently. PTI uses the same data distribution technique of hash algorithm as our Primary Index (PI); a big difference is that PTI uses the hash on a column in which time dimension data is stored. You can specify the bucketing strategy based on your data and how it is analyzed. Do you analyze the data based on hour? Then you bucket based on HOUR. Do you analyze based on minute and another attribute, like each sensor? Then you can bucket based on MINUTE and Sensor ID.
The beauty of it is that Vantage’s optimizer knows where (which AMP) each row of the data is stored based on time. This makes accessing the data based on the time criterion very efficient.
Now, it’s important to emphasize here that the GROUP BY TIME and PTI capabilities I mention here are not only for time series or IoT data – They provide the same efficiency and benefits to ANY data as long as it is time-based. It also is important that GROUP BY TIME function does not require ANY change to your existing data as long as you are on Vantage.
All of these new time-aware aggregations and optimizations features available in Teradata Vantage ultimately help you with one thing: Getting the answers you need without spending a lot of time massaging and pre-aggregating the data or spending a lot of time writing complex queries. It’s time for you to stop wasting your precious time and start getting more values out of your time-based data.