Data Merge made easy with the APS Upsert

Comparing and merging datasets are typically time consuming operations when data volumes increase: the traditional approaches to either load data into a staging table and compare the content against its destination table or using an ETL tool to run the compare both have a downside: there is a lot of data movement going on. You have to touch the data at least twice. When you have to design a solution that can handle large volumes of data or when you are running out of time already there is an alternative!

The Microsoft Analytics Platform System (APS) offers a very powerful feature to Upsert data (=insert or update), read from a flat file or ETL data source directly, into a distributed table with just a single command: data gets loaded at maximum speed into multiple small temporary tables and each of them will get merged in parallel against the final destination table distributions without locking the destination table!

How does it work?

To start at the beginning, let’s look into the merge concept in its basic form: imagine you have to update a table to accurate reflect sales information details; when products have been returned to a store and money has been returned to your customers this means the sales history table needs to be updated accordingly. Typically you will load all incoming sales records into a staging table or ODS and update the DWH at a later stage with a merge command (a great feature that has been introduced in SQL2008). This means you will read all the new sales transactions you have received and for each new order number a record will get inserted into the destination table and, when an order number already exists, the entire record needs to be compared and updated accordingly.

 

CREATE TABLE [Blog].[dbo].[lineitem_CCI_Blog] (
    [l_ordernumber] int NOT NULL, 
    [l_shipdate] datetime NOT NULL, 
    [l_orderkey] int NOT NULL, 
    [l_quantity] int NOT NULL, 
    [l_returnflag] char(1) )
WITH ( Clustered Columnstore Index, DISTRIBUTION = HASH([l_ordernumber]));

 

Source data:

image

Imagine the following 2 records have to be processed what results in 1 updated and 1 new record:

2015001|2015-05-20|4|8|2| – Update

2015004|2015-05-22|1|5|0| –- Insert

With a single APS DWLoader command you can start a direct merge on any given key column specified with the -K option to base the merge on: If the merge key exists in the destination table, the row is updated. If the merge key doesn’t exist in the destination table, the row is appended.

dwloader.exe -i h:\blog\delta.tbl -M UPSERT -K L_ordernumber -S 172.16.254.1 -E -c -rt value -rv 100 -R G:\newdata\lineItem.tbl.rejects -e ascii -t "|" -r \r\n  -T blog.dbo.lineitem_blog 

Command output:

[2015-05-24 11:36:18] Starting Load

[2015-05-24 11:36:18] Connected to Microsoft SQL Server 2012 Parallel Data Warehouse (10.0.5108.1)

[2015-05-24 11:36:18] Load has started

[2015-05-24 11:36:18] Status: Running, Run Id: 112 – Total Rows Processed: 0, Total Rows Rejected: 0

[2015-05-24 11:36:20] Status: Completed, Run Id: 112 – Total Rows Processed: 2, Total Rows Rejected: 0

[2015-05-24 11:36:20] Load is complete

Result:

image

How to handle multiple updates to the same record

Assume we would like to process the following 2 changes: an update the shipdate column from 5/20 to 5/21, followed by an update to both shipdate and l_orderkey:

2015001|2015-05-21|4|8|2|  –Update 1

2015001|2015-05-20|3|8|2|  –Update 2

This means multiple changes to the same record have to be applied; if they are both within the same source file the DWLoader Upsert command will abort with the following error message:

Status: Error, Run Id: 119 – Error Code: 8672 – Message: The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

[2015-06-18 22:03:07] Load has Failed.

The chance that this will happen in the real world is pretty large so we will have to take precautions by selecting multiple columns to form a unique key which should represent a unique row. This way, when there are multiple updates to the same record we will record and keep them both.

Putting the APS Upsert to the test

It’s time to pull out the 75GB/600 Million lineitem flat file once more and merge the contents with a 763GB CCI table. With the upsert –m option specified APS performs and commits loads in parallel and a merge operation will be executed for each distribution table.

(Note: without this option only a single merge operation will get started per APS compute node and to process the 8 distributions sequentially, the merge phase will take 7 times longer).

Let’s create a unique merge key based upon 4 columns:

dwloader.exe -i h:\75\lineitem.tbl -M UPSERT –m

-K l_shipdate,l_orderkey,L_discount,l_suppkey

-S 172.16.254.1 -E –c -rt value -rv 100

-R G:\TPCH\lineItem.tbl.rejects -e ascii -t "|" -r \r\n

-T blog.dbo.lineitem_763GB_2

 

Via the APS admin console we can track the progress and the merge operation on 600 million rows only took 31 minutes in total to complete! 5minutes 17seconds to stage the data (DMS_LOAD phase), followed by 26 minutes to merge the data and insert 600 Million rows (LOAD_INSERT phase) on a 4 node APS. The duration of the LOAD_INSERT phase will depend on the size of the destination table and the quality of the data (please read the paragraph below on trailing spaces!).

APS Admin Console - Easy tracking of data merge operations

During the CREATE STAGING phase an empty table with the DDL copied of the destination table is created which will be dropped during the LOAD_CLEANUP phase.

a separate distributed temp table is created to prestage data

All 8 distributions in each compute node will run a merge statement in parallel. Most data will typically be in memory already with APS, but just by monitoring SSMS I’ve seen 270+ MB/sec of data being read from disk per compute node when needed.

APS merge operation uncovered

 

ETL integration

Besides the dwloader utility to read from (gzipped) flatfiles you can also use the SSIS PDW Destination adapter or Informatica’s Powercenter to integrate the upsert functionality into existing ETL processes.

To compose the merge key hit the tick boxes in the ‘K’ column in front of the input column:

SSIS PDW Destination Adapter

 

For distributed tables the table distribution key has to be part of the merge key. If you don’t include it the upsert operation will fail and the follow error message will be displayed:

[SQL Server PDW Destination [75]] Error: An error occurred with the following error message: "The server could not initialize the load. Details: The KeyColumn parameter must include the Distribution Column when doing an UPSERT operation".  

Trailing Spaces

The most frequently transmitted character –ever- across a wire over the last decades must be –by far- the ‘space’ character; the APS DWLoader utility is very much forgiving for data type incompatibilities and is optimized to trim unnecessary characters at the source before starting to load the data into the APS database:

Merging a ~400GB file with 1Billion rows that primarily contains record updates and many spaces, like shown in the picture above, against a 7.2 Billion row table completes within 4.5 hours;  where ~3 hours extra is needed to remove the spaces at the source.

Admin console - load pane details

 

Wrap up

If you spend a lot of time to execute data merge operations it might be worth checking the source data of your most time consuming jobs for trailing spaces; a quick win might be around the corner! When you see the amount of data to process coming your way is steadily increasing or you when have to reduce the amount of time it takes to merge data quickly, consider the Massive parallel processing (MPP) data merge capabilities of the Microsoft Analytics Platform System (APS) to handle lots of data, all in one shot, for you!

GD Star Rating
loading...
GD Star Rating
loading...

Restoring a database onto a larger APS

For benchmarking purposes I was restoring a backup taken on a 4 node APS onto an 8 node APS (Microsoft Analytics Platform System). And the unexpected happened; I started to receive warnings that the system was running low on available disk space! (For those of you who never worked with APS; an 8 node full rack APS can store hundreds of terabytes!). 
It turned out that some colleagues had been way too generous in allocating database space. To me, this was an opportunity to explore the storage management and I learned something new about restoring APS databases which makes perfectly sense when you get it! Below a walk through and recommendations in case you have to restore an APS database yourself.

The APS database to Backup

The database to move onto a larger APS is 24 TB in size and is called ‘Blog’:

Fig1.

The database is distributed in 4 equally sized portions across the 4 compute nodes SQLCMP01-SQLCMP04:

image

 

Backup of an APS database

To backup the database we have to assign access permissions to a windows account that has write access to a file share on a target server via the ‘sp_pdw_add_network_credentials’ command. On the backup server you have to set up a file share called ‘backups’ and grant R/W permissions. And then, start the ‘backup database’ command.

Backup Database

The Full backup to the remote file share completed in 36minutes 14seconds and resulted in a backup set of 1.3TB on disk. The dataset from each compute node was saved in parallel into separate userdata_x_0.bak files. (Primary bottleneck is the write throughput of the file share we are writing to: ~640 MB/sec throughput). (See Fig.4) 

Fig. 3- 4 Node Backup Set

 

Restoring an APS database

To restore this backup onto another APS, we just have to do the same: assign access permissions to the backup server and start the ‘Restore database’ command.

sp_pdw_add_network_credentials

 

Although it was a straight forward backup & restore exercise executed according the book, I noticed the following: a backup set that gets restored onto an appliance with a larger number of compute nodes will grow in allocated database size. Its in proportion to the number of Compute nodes: SQL Server APS initially restores the database initially onto 4 nodes to match the source configuration, and then internally redistributes the data onto all 8 nodes. On the extra nodes identical datasets are allocated. (See Fig.6).

image

The restore operation resulted in a database with twice the original size on the larger destination system:

Restore

 

How to check data volume in use

There are a couple of simple steps to check on the actual disk space allocated within the database and how to free up over-allocated space in the database.

 

DBCC PDW_SHOWSPACEUSED

To check on the actual allocated Data you can run the DBCC PDW_Showspaceused command.

USE BLOG;
DBCC PDW_SHOWSPACEUSED;

 

This command outputs per compute node the allocated space in kilobytes. In our case approx. 1.6 TB of compressed Clustered Columnstore data resides inside the database; the allocated 24TB for the Database is therefor a bit overkill and we can safely free up some of the space.

4 Node

To free up space we can use the ALTER DATABASE command and define the desired new space allocations. Also, we can shrink down the transaction log to a minimal size (also to minimize the time spend on log files recreation). To change the database size to a max. of 5TB and 5GB log:

ALTER DATABASE BLOG SET ( AUTOGROW = ON ,  DISTRIBUTED_SIZE = 5000 GB   )  
DBCC SHRINKLOG ( SIZE = 5000 MB );

 

 

Auto redistribution of data during Restore

After the initial restore into the 4 compute nodes completed, a ‘Create table as select’ CTAS command is issued and all the distributed tables are automatically redistributed across all 8 compute nodes; the DBCC PDW_Showspaceused  output shows  the data got nicely spreads across all the nodes:

Result of the Restore onto 8 compute nodes

image

Wrap-up

Even on an APS system that has the capacity to store hundreds of terabytes it may be worth to check on the overall space allocated to databases every now and then. Also when planning for restoring a SQL database onto a larger APS please check the overall database size upfront to calculate the space needed on the destination APS since it will grow –at ratio-. If needed, you can free up unused space from the database and its log files upfront by running the ‘ALTER Database’ and DBCC shrinklog command.

GD Star Rating
loading...
GD Star Rating
loading...
Better Tag Cloud