記事

How to Tackle Data Skew

Learn how to use use Teradata's Global Space Accounting to counter our biggest villain: data skew.

2020年12月1日 5 分で読める
Data Skew in Relational Databases

Teradata has been the best performing and leading relational data warehouse consistently for over 40 years, and the credit goes to its massively parallel architecture (MPP). Teradata is a Relational Database Management System (RDBMS) catering to data warehousing needs and offers high speed, high efficiency, and multi-user access by utilizing the concept of ‘Parallelism.’ Teradata is based on Massively Parallel Processing (MPP) Architecture. An MPP architecture follows a “Divide and Conquer” approach which means that to execute a task, Teradata’s processor divides the task evenly across the entire system.  

AMP (Access Module Processor) is virtual processor which is responsible for parallelism in Teradata and each AMP is associated with a virtual disk which is subset of physical disk. Each AMP receives equal share of physical disk and is responsible for all operations on that data, independent of other AMPs. This architecture is called Shared Nothing Architecture. 

In ideal scenarios, data should be distributed equally amongst all AMPs and they process equal amounts of data. But due to asymmetry in data, certain values may have higher frequency compared to others. In such cases the Teradata hashing algorithm responsible for data distribution on AMPs allocates different amounts of data to each AMP -- this is called Data Skewness. Data skewness results in asymmetric permanent space utilization on each AMP causing Permanent Space Skew.  

In addition to data skewness, there is one more type of skew called Processing Skew. For any operation to happen between data rows they must be on same AMP and in order to do that data is redistributed in spool. Redistributed data in spool may be skewed making some AMPs work on more rows compared to others, causing processing skew. Processing skew results in asymmetric spool space utilization on each AMP causing Spool Space Skew.  

In addition to permanent and spool skew, temporary space which is used for Global temporary tables could get skewed if data loaded in them is skewed causing Temporary Space Skew. 

To sum up, we can have skew in permanent, spool and temporary space due to uneven data distribution. With Teradata 15.10 and before (legacy), space accounting assumes uniform distribution of rows to the AMPs. Database and user space limits are set at the AMP level to total space divided by the number of AMPs in the system. Space limits were hard limits and were managed at the AMP level. Transactions where usage exceeded the hard limit even on a single AMP, such as long running load jobs, had to be resubmitted after increasing the space resulting in missing SLAs and wasted resources. 

The Saviour: Global Space Accounting 

Starting with version 16.10, Teradata has come up with a new feature named Global Space Accounting (GSA). Teradata recommends Global Space Accounting (GSA) at the database or user level when data is expected to be non-uniform. For example, large, unknown data sets that may be skewed are often loaded. Space hard limits can now be converted to soft limits at two levels i.e., soft limit at AMP level and System level.  These are controlled by two parameters: 

  • For AMP soft limit we have a parameter named SKEW FACTOR  

  • For Global soft limit we have parameter named GlobalSpaceSoftLimitPercent.  

SKEW can be defined for permanent, spool and temporary space and ranges between 0 to 100%. Skew at 0% allocates equal space to all AMPs and cannot extend their per AMP quota; skew at 100% can consume double the AMP quota. 

GlobalSpaceSoftLimitPercent is the percentage by which a database or user can exceed the maximum space limit (total space at system level) and is applied to PERM, SPOOL, or TEMP space. 

Consider the below hypothetical scenario to understand these parameters: 

  • Global space limit 1000 MB 

  • Soft Limit 10% 

  • Skew Factor 20% 

  • Number of Amps 5 

With these numbers, the global limit can exceed the 1000 MB limit by 10 percent (to 1100 MB). In addition, an AMP can exceed the 200 MB limit by 20% (to 240 MB) if the total across all AMPs does not exceed 1100 MB. Each of the AMPs can also receive an additional 20 MB because of the global soft limit. The cumulative effects of the skew factor and global soft limit can increase the maximum possible space allotment to an AMP to 240 + 20 = 260 MB, as long as the total space used across all AMPs does not exceed 1100. The user will receive an error message if either of the following limits are exceeded: 

  • The global limit of 1100,  

  • The per-AMP limit of 260. 

In this case, with Legacy Space Accounting the AMP limit of the 200 MB job would have failed.  

But with Global Space Accounting AMP 1,2,3 have crossed their 200 MB limits without the job failing. While space is increased for AMPS 1,2 and 3, AMPs 4 and 5 will not be able to utilize space up to 200 MB. It is the need that bases allocation which provides extra space to AMPs when required by temporarily reducing space for AMPs that do not need the space at that moment. 

We also receive alerts when the soft limit is breached so that we can react in a timely manner to add space as needed to avoid job interruptions. 
Picture1.png
“Legacy Space Accounting” vs “Global Space Accounting”

We cannot blindly select one of these two approaches. And a wise decision that considers the underlining data is required as both cases have their pros and cons.

Legacy Space accounting does not have the overhead of dynamic space allocation as separate tables are maintained in case of GSA. It is recommended when the database is uniformly distributed across all AMPs in the system or when the global space is high enough to tackle skew in data.

When we are unaware of the dataset involved, GSA helps in cases where data is skewed, and the DBA can take actions as per alerts received on breach of soft limits, thus preventing job aborts. Skew limits should be selected judiciously as too high limits in the case of poor design or lack of stats may cause excessive spool utilization on some AMPs, while too low limits may cause requests using excessive spool space to abort. Skew limits cannot solve space problems, especially when actual use is close to the limits. If there is not enough space, it is best to increase the space limits instead of allowing more skew.

With the introduction of Multiple Hash Maps from Teradata 16.10, tables can now reside on a subset of AMPs and thus some AMPs may consume more space relative to others. GSA may be helpful in such scenarios where it is providing more space to AMPs when required.

Regardless, it is best to consult the Teradata Support Centre before making changes to space accounting.

Tags

Abhinav Gupta について

Abhinav Gupta has been associated with Teradata since 2015. He is a Teradata Vantage 16.20 certified master and works as a Teradata Performance DBA. His expertise includes performance tuning wherein he adds value to customers by optimizing their jobs, including reports and loads to run efficiently and meet SLAs. Prior to Teradata, he worked as an Oracle operations DBA.  Abhinav Guptaの投稿一覧はこちら

Shilpa Agrawal について

Shilpa Agrawal has been associated with Teradata since 2015. She is a Teradata Vantage 16.10 certified Advance DBA and is working as a Teradata Performance DBA. In her journey as a performance DBA, she has helped customers make their applications and overall system run efficiently and meet SLA. She is also part of the Performance COP and Configuration teams. Prior to Teradata, she worked as an Oracle operations DBA.
  Shilpa Agrawalの投稿一覧はこちら

最新情報をお受け取りください

メールアドレスをご登録ください。ブログの最新情報をお届けします。



テラデータはソリューションやセミナーに関する最新情報をメールにてご案内する場合があります。 なお、お送りするメールにあるリンクからいつでも配信停止できます。 以上をご理解・ご同意いただける場合には「はい」を選択ください。

テラデータはお客様の個人情報を、Teradata Global Privacy Policyに従って適切に管理します。