Comparing and merging datasets are typically time consuming operations when data volumes increase: the traditional approaches to either load data into a staging table and compare the content against its destination table or using an ETL tool to run the compare both have a downside: there is a lot of data movement going on. You have to touch the data at least twice. When you have to design a solution that can handle large volumes of data or when you are running out of time already there is an alternative!
The Microsoft Analytics Platform System (APS) offers a very powerful feature to Upsert data (=insert or update), read from a flat file or ETL data source directly, into a distributed table with just a single command: data gets loaded at maximum speed into multiple small temporary tables and each of them will get merged in parallel against the final destination table distributions without locking the destination table!
How does it work?
To start at the beginning, let’s look into the merge concept in its basic form: imagine you have to update a table to accurate reflect sales information details; when products have been returned to a store and money has been returned to your customers this means the sales history table needs to be updated accordingly. Typically you will load all incoming sales records into a staging table or ODS and update the DWH at a later stage with a merge command (a great feature that has been introduced in SQL2008). This means you will read all the new sales transactions you have received and for each new order number a record will get inserted into the destination table and, when an order number already exists, the entire record needs to be compared and updated accordingly.
CREATE TABLE [Blog].[dbo].[lineitem_CCI_Blog] ( [l_ordernumber] int NOT NULL, [l_shipdate] datetime NOT NULL, [l_orderkey] int NOT NULL, [l_quantity] int NOT NULL, [l_returnflag] char(1) ) WITH ( Clustered Columnstore Index, DISTRIBUTION = HASH([l_ordernumber]));
Imagine the following 2 records have to be processed what results in 1 updated and 1 new record:
2015001|2015-05-20|4|8|2| – Update
2015004|2015-05-22|1|5|0| –- Insert
With a single APS DWLoader command you can start a direct merge on any given key column specified with the -K option to base the merge on: If the merge key exists in the destination table, the row is updated. If the merge key doesn’t exist in the destination table, the row is appended.
dwloader.exe -i h:\blog\delta.tbl -M UPSERT -K L_ordernumber -S 172.16.254.1 -E -c -rt value -rv 100 -R G:\newdata\lineItem.tbl.rejects -e ascii -t "|" -r \r\n -T blog.dbo.lineitem_blog
[2015-05-24 11:36:18] Starting Load
[2015-05-24 11:36:18] Connected to Microsoft SQL Server 2012 Parallel Data Warehouse (10.0.5108.1)
[2015-05-24 11:36:18] Load has started
[2015-05-24 11:36:18] Status: Running, Run Id: 112 – Total Rows Processed: 0, Total Rows Rejected: 0
[2015-05-24 11:36:20] Status: Completed, Run Id: 112 – Total Rows Processed: 2, Total Rows Rejected: 0
[2015-05-24 11:36:20] Load is complete
How to handle multiple updates to the same record
Assume we would like to process the following 2 changes: an update the shipdate column from 5/20 to 5/21, followed by an update to both shipdate and l_orderkey:
2015001|2015-05-21|4|8|2| –Update 1
2015001|2015-05-20|3|8|2| –Update 2
This means multiple changes to the same record have to be applied; if they are both within the same source file the DWLoader Upsert command will abort with the following error message:
Status: Error, Run Id: 119 – Error Code: 8672 – Message: The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
[2015-06-18 22:03:07] Load has Failed.
The chance that this will happen in the real world is pretty large so we will have to take precautions by selecting multiple columns to form a unique key which should represent a unique row. This way, when there are multiple updates to the same record we will record and keep them both.
Putting the APS Upsert to the test
It’s time to pull out the 75GB/600 Million lineitem flat file once more and merge the contents with a 763GB CCI table. With the upsert –m option specified APS performs and commits loads in parallel and a merge operation will be executed for each distribution table.
(Note: without this option only a single merge operation will get started per APS compute node and to process the 8 distributions sequentially, the merge phase will take 7 times longer).
Let’s create a unique merge key based upon 4 columns:
dwloader.exe -i h:\75\lineitem.tbl -M UPSERT –m
-S 172.16.254.1 -E –c -rt value -rv 100
-R G:\TPCH\lineItem.tbl.rejects -e ascii -t "|" -r \r\n
Via the APS admin console we can track the progress and the merge operation on 600 million rows only took 31 minutes in total to complete! 5minutes 17seconds to stage the data (DMS_LOAD phase), followed by 26 minutes to merge the data and insert 600 Million rows (LOAD_INSERT phase) on a 4 node APS. The duration of the LOAD_INSERT phase will depend on the size of the destination table and the quality of the data (please read the paragraph below on trailing spaces!).
All 8 distributions in each compute node will run a merge statement in parallel. Most data will typically be in memory already with APS, but just by monitoring SSMS I’ve seen 270+ MB/sec of data being read from disk per compute node when needed.
Besides the dwloader utility to read from (gzipped) flatfiles you can also use the SSIS PDW Destination adapter or Informatica’s Powercenter to integrate the upsert functionality into existing ETL processes.
To compose the merge key hit the tick boxes in the ‘K’ column in front of the input column:
For distributed tables the table distribution key has to be part of the merge key. If you don’t include it the upsert operation will fail and the follow error message will be displayed:
[SQL Server PDW Destination ] Error: An error occurred with the following error message: "The server could not initialize the load. Details: The KeyColumn parameter must include the Distribution Column when doing an UPSERT operation".
The most frequently transmitted character –ever- across a wire over the last decades must be –by far- the ‘space’ character; the APS DWLoader utility is very much forgiving for data type incompatibilities and is optimized to trim unnecessary characters at the source before starting to load the data into the APS database:
Merging a ~400GB file with 1Billion rows that primarily contains record updates and many spaces, like shown in the picture above, against a 7.2 Billion row table completes within 4.5 hours; where ~3 hours extra is needed to remove the spaces at the source.
If you spend a lot of time to execute data merge operations it might be worth checking the source data of your most time consuming jobs for trailing spaces; a quick win might be around the corner! When you see the amount of data to process coming your way is steadily increasing or you when have to reduce the amount of time it takes to merge data quickly, consider the Massive parallel processing (MPP) data merge capabilities of the Microsoft Analytics Platform System (APS) to handle lots of data, all in one shot, for you!