How to Process a SSAS MOLAP cube as fast as possible – Part 2

Special thanks to Dirk Gubbels for his contribution on the data type optimization part and review!(Follow Dirk on Twitter: @QualityQueries).

In part 1 we looked at a method to quantify the work that gets done by SQL Server Analysis Server and found that the OLE DB provider with a network packet size of 32767 brings best throughput while processing a single partition and maxing out the contribution per single CPU. 

In this 2nd part we will focus on how to leverage 10 cores or more (64!) and benefit from every of these CPU’s available in your server while processing multiple partitions in parallel; hope the tips and approach will help you to test and determine the maximum processing capacity of the cubes on your SSAS server and process them as fast as possible!

Quick Wins

If you have more than 10 cores in your SSAS server the first thing you’ll notice when you start processing multiple partitions in parallel is that Windows performance counter ‘% Processor time’ of the msmdsrv process is steady at 1000% which means 10 full CPU’s are 100% busy processing. Also the ‘Rows read/sec’ counter will top and produce a steady flat line similar to the one below at 2 million Rows read/sec (==200K rows read/sec per CPU):

Flatline @ 2 Million Rows read/sec 

In our search for maximum processing performance we will increase the number to reflect the # Cores by modifying the Data Source Properties. Change the ‘Maximum number of connection’ from 10 into the # Cores in your server. In our test server we have 32 logical- and 32 Hyperthreaded = 64 cores available.

1) # Connections

By default each cube will open up a maximum of 10 connections to a data source. This means that up to 10 partitions are processed at the same time. See picture below: 10x status ‘In Progress- ’ for the AdventureWorks cubes which is slightly enpanded to span multiple years:

By default Maximum 10 connections

Just by changing the number of connections to 64 the processing of 64 partitions in parallel results in an average throughput of over 5 million Rows read/sec, utilizing 40 cores (yellow line)

This seems a great number already but its effective (5 million rows/40 cores =) 125K Rows per core and we do still see a flat line when looking at the effective throughput; this tells us that we are hitting the next bottleneck. Also the CPU usage as visible in Windows Task Manager isn’t at its full capacity yet!

Increasing the number of connections from 10 to 64 active connections. 

Time to fire up another Xperf or Kernrate session to dig a bit deeper and zoom into the CPU ticks that are spend by the data provider:

Command syntax:

Kernrate -s 60 -w -v 0 -i 80000 -z sqlncli11 -z msmdsrv -z oleaut32 -z sqloledb -nv msmdsrv.exe -a -x -j c:\websymbols > SSAS_trace.txt

Kernrate base trace

This shows an almost identical result as the profiling of a single partition in blog part I.

By profiling around a bit and checking on both the OLEDB and also some SQL native client sessions surprisingly you will find that most of the CPU ticks are spend  on… data type conversions.

zooming into SQLNCLI 

The other steps make sense and include lots of data validation; like, while it fetches new rows it checks for invalid characters etc. before the data gets pushed into an AS buffer. But the number 1 CPU consumer, CDataSource::DataConvert is an area that we can optimize!

(To download a local copy of the symbol files yourselves, just install the Windows Debugger by searching the net for ‘windbg download’  and run the symchk.exe utility to download all symbols that belong to all resident processes into the folder c:\websymbols\;

C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x64\symchk.exe /r /ip *  /s SRV*c:\websymbols\*http://msdl.microsoft.com/download/symbols )

2) Eliminate Data type conversions

This is an important topic; if the data types between your data source and the cube don’t match the transport driver will need a lot of time to do the conversions and this affects the overall processing capacity; Basically Analysis Server has to wait for the conversion to complete before it can process the new incoming data and this should be avoided. 

Let’s go over an AdventureWorksDW2012 Internet_sales partition as example:

image

By looking at the table or query that is the source for the partition, we can determine it uses a range from the FactInternetSales table. But what data types are defined under the hood?

To get to all data type information just ‘right click’ on the SSAS Database name and script the entire DB into a new query Editor Window.

Search through the xml for the query source name that is used for the partition, like: msprop:DbTableName="FactInternetSales"

Script the entire SSAS DB for easy lookup of datatypes.

 
These should match the SQL Server data types; check especially for unsignedByte, short, String lengths and Doubles (slow) vs floats (fast).  (We do have to warn you about the difference between an exact data type like Double vs an approximate like Float here).

A link to a list of how to map the Data types is available here.

How can we quickly check and align the data types best because to go over them all manually one by one isn’t funny as you probably just found out. By searching the net I ran into a really nice and useful utility written by John Tunnicliffe called ‘CheckCubeDataTypesthat does the job for us; it compares a cube’s data source view with the data types/sizes of the corresponding dimensional attribute. (Kudos John!) But unfortunately even after making sure the datatypes are aligned and running Kernrate again shows that DataConvert is still the number one consumer of CPU  ticks on the SSAS side.

3) Optimize the data types at the source

To proof that this conversion is our next bottleneck we can also create a view on the database source side and explicitly cast all fields to make sure they match the cube definition. (This will also be an option to test environments where you don’t own the cube source & databases)

Maybe as best-practice CAST all columns even if you think the data types are right and exclude also the ones that are not used for processing the Measure group from the View. (For example, to process the FactInternetSales Measure Group from the AdventureWorks2012 DW cube  we don’t need  [CarrierTrackingNumber], [SalesOrderNumber], [PromotionKey] and [CustomerPONumber]) ; every bit that we don’t have push over the wire and process from the database source is a pure win.  Just create a view with the name ‘Speed’ like to give it a try.

Create a database view with all fields casted explicitly

(Note: always be careful when changing data types!

For example,  in the picture above,  using the ‘Money’ data type is Okay because it is used for  FactInternetSales, but Money is not a replacement for all Decimals (as it will only keep 4 digits behind the decimal point and doesn’t provide the same range) so be careful when casting data types and double check you don’t lose any data!)

Result: by using the data type optimized Speed view as source the total throughput increased from  5 to 6.6-6.8 Million rows Read/sec and 4600% CPU usage (== 147K rows/CPU).  That’s 36% faster. We’re getting there! 

The picture also shows that one of the physical CPU sockets (look at the 2nd line of 16 cores in Numa Node 1) is completely max’d out:

With the Data types aligned  an extra 1.6 Million Rows Read/sec are processed 

4) Create a ‘Static Speed’ View for testing

If you would like to take the database performance out of the equation something I found useful is to create a static view in the database with all the values pre-populated this way there will still be a few logical reads from the database but significant less physical IO.

Approach:

1) Copy the original query from the cube:

Doubleclick on a cube Partition

2) Request just the SELECT TOP (1):

SELECT TOP (1) From...

3) Create a Static view:

Add these values to a view named ‘Static_Speed’ and cast them all:

Cast all static values 

4) Create an additional test partition that queries the new Static_view

Add as many Partitions as the number of CPU's in the SSAS system

5) Copy this test partition multiple times

Create at least as many test partitions equal to the number of cores in your server, or more:

Script the test partition as created in step 4):

Scripted test partition

Create multiple new partitions from it by just changing the <ID> and <Name> ; these will run the same query using just the static view. This way you can test the impact of your modifications to the view quickly and at scale!

6) Processing the test partitions

Process all these newly created test partitions who will only query the statics view and  select as many of them or more as the number of CPU’s you have available in your SSAS server.

Determine the maximum processing capacity of your cube server
by monitoring the ‘Rows Read/sec’!

 

Wrap Up

If you have a spare moment to check out the workload performance counters of your most demanding cube servers you may find that there is room for improvement. If you see flat lines during the Cube processing I hope your eyes will now start to blink; by increasing the number of connections or checking if you don’t spend your CPU cycles on data type conversions you may get a similar of over 3x improvement, like shown in the example above. By looking at the Task Manager CPU utilization where just one of the NUMA nodes is completely max’d out might indicate its time to start looking into some of the msmdsrv.ini file settings…

GD Star Rating
loading...
GD Star Rating
loading...

How to Process a SSAS MOLAP cube as fast as possible – Part 1

Recently, with some colleagues, I was working on a project with a serious challenge; there was this Analysis Server 2012 system with 40 physical cores, half a Terabyte of RAM and 10TB of SSD storage waiting to get pushed to its limits but it was installed via the famous ‘next,next finish’ setup approach and we had to tune the box from scratch. Also we had to pull the data from a database running on another box which means the data processing will be impacted by the network round-tripping overhead. 

With a few simple but effective tricks for tuning the basics and a methodology on how to check upon the effective workload processed by Analysis Server you will see there’s a lot to gain! If you take the time to optimize the basic throughput, your cubes will process faster and I’m sure, one day, your end-users will be thankful! This Part 1 is about tuning just the processing of a single partition.

Quantifying a baseline

So, where to start? Well to quantify the effective processing throughput, just looking at Windows Task Manager and check if the CPU’s run at 100% full load isn’t enough; the metric that works best for me is the ‘Rows read/sec’ counter that you can find in the Windows Performance monitor MSOLAP Processing object. 

Just for fun… looking back in history, the first SSAS 2000 cube I ever processed was capable of handling 75.000 Rows read/sec, but that was before partitioning was introduced; 8 years ago, on a 64 CPU Unisys ES7000 server with SQL- and SSAS 2005 running side by side I managed to process many partitions in parallel and effective process 5+ Million Rows reads/sec (== 85K Rows read/sec per core).

The year 2006: Processing 300+ billion rows of retail sales data with SSAS 2005. 

Establishing a baseline – Process a single Partition

Today, with SSAS 2012 your server should be able to process much more data; if you run SQL and SSAS side by side on a server or on your laptop you will be surprise on how fast you can process a single partition;  expect 250-450K Rows read/sec while maxing out a single CPU at 100%.

As an impression of processing a single partition on a server running SSAS 2012 and SQL 2012 side by side using the SQL Server Native Client:  the % processor time of the SSAS process (MSMDSRV.exe) is at 100% flatline. Does this mean we reached maximum processing capacity? Well… no!  There is an area where we will find a lot of quick wins;  lets try if we can move data from A (the SQL Server) to B (the Analysis Server) faster.

Sample baseline:  290K rows read/sec - 100% Processor time of the SSAS process (msmdsrv) , maxing out a single CPU.

100% CPU?

Max’ing out with a flatline on a 100% load == a single CPU may look like we are limited by a hardware bottleneck. But just to be sure lets profile for a minute where we really spend our CPU ticks. My favorite tool for a quick & dirty check is Kernrate (or Xperf if you prefer).

Command line:

Kernrate -s 60 -w -v 0 -i 80000 -z sqlncli11 -z msmdsrv -nv msmdsrv.exe -a -x -j c:\symbols;

Surprisingly more than half of our time isn’t spend in Analysis Server (or SQL server) at all, but in the SQL native Client data provider! Lets see what we can do to improve this.

Kernrate output of profiling 60 seconds of MOLAP cube processing. 

Quick Wins

1) Tune the Bios settings & Operating system

Quick wins come sometimes from something that you may overlook completely, like checking the BIOS settings of the server. There is a lot to gain there; expect 30% improvement -or more-  if you disable a couple of energy saving options. (its up to you to revert them and save the planet when testing is done…)

For example: 

– Enter the Bios Power options menu and see if you can disable settings like ‘Processor Power Idle state’. 

– In the Windows Control Panel, set the Server Power Plan to max. throughput (up to Windows 2008R2 this is like pressing the turbo switch but on Windows 2012 the effect is marginal but still worth it).

Control Panel- Power Options 

2) Testing multiple data providers

Like the kernrate profiling shows, a lot of time is spend in the network stack for reading the data from the source. This applies to both side by side (local) processing as well as when you pull the data in over the network.

Since the data provider has a significant impact on how fast SSAS can consume incoming data, lets check for a moment what other choices we have available; just double click on the cube Data Source

Configuring Connection String options.

 

Switching from the SQL Native Client to the Native OLE DB\ Microsoft OLE DB Provider for SQL Server brings the best result: 32% higher throughput!

Use the Native OLE DB Provider for best throughput. 

SSAS is still using a single CPU to process a single partition but the overall throughput is significant higher when using the OLE DB Provider for SQL Server:

OLE DB Provider for SQL Server 

To summarize; with just a couple of changes the overall throughput per core just doubled!Summary

 

Reading source data from a remote Server faster

if you run SSAS on a separate server and you have to pull all the data from a database running on another box, expect the base throughput to be significant less due to processing on the network stack and round tripping overhead. The tricks that apply to the side by side processing also apply in this scenario:

1) Process the Partition processing baseline against the remote server.

Less rows are processed when reading from a remote server (see fig.); also the MSMDSRV process is effective utilizing only 1/2 of a CPU. The impact of transporting the data from A to B over the network is significant and worth optimize. Lets focus our efforts on optimizing this first. 

Reading from a data source over the network.

 

2)  Increase the network Packet Size from 4096 bytes  to 32 Kbyte.

Get more work done with each network packet send over the wire by increasing the packet size from 4096 to 32767;  this property can be set via the Data Source – Connection String too; just select on the left ‘All’  and scroll down till you see the ‘Packet Size’ field.

Change the packet size in via the Cube Data Source Connection 'All' section.

The throughput gain is significant:

Quick win : increase network packet size from 4KB to 32KB. 

Summary

Summary

When you have a lot of data to process with your SQL Server Analysis Server cubes, every second you spend less in updating and processing may count for your end-users; by monitoring the throughput while processing a single partition from a Measure Group you can set the foundation for further optimizations. With the tips described above the effective processing capacity on a standard server more  than doubled. Every performance gain achieved in the basis will pay back later while processing multiple partitions in parallel and helps you to provide information faster! 

In part II we will zoom into optimizing the processing of multiple partitions in parallel.

GD Star Rating
loading...
GD Star Rating
loading...
Better Tag Cloud