記事

Why Object Storage Is Essential for Analytics

Companies now have choice as to the storage tiers they can use for analytics. Use these guidelines to pick the right mix of storage tiers for your use case.

Brian Wood

2020年8月6日 4 分で読める

“The best answers come from the best analytics operating on the most data.”

The first part of this statement is self-evident and in fact some might think it complete on its own. But having world-class analytics won’t do much good if there are only 11 data points to analyze.

What we really need for great analytics are gobs of data so that the inevitable outliers and any missing values do not skew the outcomes. Lots of data is also needed for being able to “pan out” and see leading and trailing indicators, correlated variables, completely independent factors, and so on. The more data available, the higher confidence we have in the results, especially when looking at complex systems.

Consider the analysis of a jet engine for preventive maintenance purposes. With multiple measurements per second across numerous sensors, the number of simultaneous inputs is incredible. Here is just a small sample:

Air temperature
Air density
Air humidity
Air flow angle
Air flow rate
Fuel type
Fuel quality
Fuel flow rate
Fuel temperature
Oil temperature
Oil volume
Oil quality
Exhaust flow
Engine temperature
Revolutions per minute

Having as many data points in as many combinations as possible provides the highest likelihood that “normal” behavior can be modeled, and the root cause of anomalies can be isolated – hence more data yields better answers. Just like with images on computer monitors and television screens, the more pixels (data points) there are, the higher the resolution when zooming in to see what precisely is occurring, and why, and to whom, over what period, and on and on.

So, where should all this data be stored so that it can be analyzed?

“Enterprises should use a combination of block storage and object storage for analytics.”

A cursory look at cloud storage pricing shows a broad range of capabilities at widely varying price points. The following table shows per-GB list prices for three tiers of storage from Amazon Web Services (AWS):

Storage Type	Product Name	$/GB/month	~$/TB/month
Block	Amazon Elastic Block Store	$0.10	$102
Object	Amazon Simple Storage Service	$0.023	$23
Archive	Amazon S3 Glacier	$0.004	$4

There’s a big price range across the three tiers – over 25X difference! – which begs the question, “Which should be used for what?”

Block storage is the most commonly used storage type for most business applications. Block storage is ideal for databases because it provides consistent I/O performance and low-latency connectivity. It provides fixed-sized raw storage capacity and each storage volume can be treated as an independent disk drive.

With object storage – of which archive storage is a subset – one can store any kind of data in virtually unlimited quantity, including backup files, database dumps, and log files as well as unstructured data such as photos, videos, and music. Such objects remain protected by storing multiple copies of data (usually at least three) over a distributed system.

Read/write performance is the key difference between the storage types, and as the saying goes, one never gets more than what one pays for. In fact, Teradata used to recommend that most data for analysis be kept in high-performance block storage and only backups be kept in object storage.

With the introduction of Native Object Store, however, users of Teradata Vantage can now easily query and join with data in Amazon S3, Azure Blob, and soon also Google Cloud Storage, making it much easier to leverage ALL relevant data but at a much lower price point.

So, what’s a cost-conscious Teradata DBA to do? Follow these new guidelines:

Use block storage for data accessed most frequently, which is likely some of the most recent data. For our scenario, this might be all jet engine diagnostics and corporate information in the past 12 months. Let’s say it’s 10% of all corporate data for our scenario.
Pick object storage for data that is less frequently accessed or of low / unknown value. Queries of object store data won’t be as fast as with block storage data but the financial savings (a quarter of the cost!) is worth it. Let’s say it’s 40% of all our data.
Put all remaining corporate data in archive storage and accept that on the rare occasions when it needs to be accessed, one could end up waiting a while (e.g., 3-5 hours is typical for AWS S3 Glacier retrieval) – but hey, it’s 25 times less expensive than block storage, so that’s OK.

Thus, in our sample scenario we have a 10:40:50 ratio between block, object, and archive storage.

Said differently, the data that can be easily queried here is in a 1:4 ratio, or 20% in block storage and 80% in object storage. (We’ll consider archive data to be beyond the scope of everyday enterprise analytics but entirely appropriate for long-term retention.)

Financially, the distribution of where data is stored means the effective storage price for our scenario goes from $102 per TB per month for all-block storage down to only $39:

(20% * $102) + (80% * $23) = $39

Thus, by capitalizing on new capabilities inherent in the latest release of Teradata Vantage and being prudent about where data is stored, we can reduce the effective price per TB per month by over half.

So, what have we learned?

The best answers come from the best analytics operating on the most data.
Companies now have choice as to the storage tiers they can use for analytics.
Picking the appropriate mix of storage tiers is a simple, effective way to align costs with value.

Bottom line: Saving money and being smart are essential for business success – and using object storage as part of the mix with block storage is essential for enterprise analytic success.

To learn more about how to get started with Vantage on AWS and Azure using NOS, tune into this LinkedIn livestream on August 27 at 9:00 AM PT. See details here.

Why Object Storage Is Essential for Analytics

Brian Wood について