Maximizing SQL Server 2008 R2 Table scan speed from DSI Solid State Storage

Scenario:

Imagine you have a single table with 400 Million rows and 46 GByte in size that you want to read as fast as possible. How can you check if you run @ top speed?

This blog post is all about squeezing the maximum out of both SQL Server and the available hardware and showing you how to measure the achieved throughput and durations the way I do.  in the previous post I showed what throughput figures are possible with the SQLIO utility, now it’s time to see if we can get the same throughput with SQL Server 2008 R2 !

 

Test 3a) Table scan speed from a single table stored in a single database file

image

Step 1 is to test with an ‘out of the box’ configuration and see that reading a table with 400 million rows / 46 GByte of size with LineItem data takes 31.2 seconds.

The Windows Performance monitor Logical Disk counter “Disk Read Bytes/sec” shows an average of  around 1400 MB/sec reading from the DSI 3500 Solid State box:

 clip_image002

The SQL Profiler  shows that approx. 5.5 million logical (8KB) reads are needed to retrieve the 46 GByte of data.

clip_image004

 

 

 

 

 

What are we waiting for?

The top 2 from the Waitstats dmv (select * from sys.dm_os_wait_stats)   shows 80.1% of the time SQL was waiting, is spend by LATCH_EX and 9.3% by PAGEIOLATCH_SH: 

clip_image006

To read a page from disk and put this into cache, the worker thread will take out a EX (Exclusive) latch on the BUF associated with the page. 
(See also this CSS post for more information: http://blogs.msdn.com/psssql/archive/2009/07/08/q-a-on-latches-in-the-sql-server-engine.aspx)

Or looking into the exclusive LATCH_EX  wait type as described in BOL:  a LATCH_EX  occurs when waiting for a EX (exclusive) latch.

This does not include buffer latches or transaction mark latches.  A breakdown of LATCH_* waits is available in sys.dm_os_latch_stats. 

Following the books and drilling down into the latch waits with the sys.dm_os_latch_stats dmv shows the following:

clip_image008

On the "ACCESS_METHODS_DATASET_PARENT”  latch wait type isn’t much information available, but think of this as a memory semaphore that is used to distribute the pages scanned from disk to the operators downstream in the query plan, synchronizing child dataset access to the parent dataset during parallel operations.

From several experiments I have noticed that this type of  latch wait time doesn’t expose a direct relation to the overall duration of the query execution while table scanning a single table , so lets call it SQL internal housekeeping.

The PAGEIOLATCH_SH wait event occurs when a task is waiting for a latch for a buffer that is in an I/O request. When a page is not in the buffer cache, SQL Server has to first allocate a buffer page, and then puts a exclusive PageIOLatch_ex latch on the buffer while the page is transferred from disk to cache.

Meanwhile, SQL Server puts a PageIOLatch_sh request on the buffer on behalf of the user. After the write to cache finishes, the PageIOLatch_ex latch is released. This allows the user to read the buffer page after which the PageIOLatch_sh is released. The 12.1% contribution of the Buffer Latches  as primary wait reason makes more sense.

Since we know from the SQLIO tests that we use only half the available throughput bandwidth an easy option to get more I/O requests being serviced  is to add  more database files and more LUN’s to distribute the IO load. 

 

Test 3b)  Adding a second LUN and database file

Surprisingly with the table striped over 2 database files and 2 LUNS in the same filegroup , in my tests it takes longer to read the same 400 Million rows; on average 36.9  seconds instead of  31 seconds. That’s strange… as shown with the SQLIO tests (and also by common sense) you would expect that adding a second LUN  and more files would speed up data retrieval.  … interesting.  The graph below the average throughput (Disk Read Bytes/sec) slightly dropped to about an average of 1300 MB/sec and the disk transfer timings are average 1 millisecond. 

image

image 

 

Test 3c) Table scan speed from a partitoned table with 96 partitions/24 database files and 24 FileGroups

What else can we do to speed up table scanning ? About a year ago while Thomas Kejser (from he SQLCAT team http://sqlcat.com/ ) and I where preparing a presentation for the PASS European Conference 2009 conference (btw please checkout the new 2010 conference website @ http://www.sqlpass.org/summit/eu2010/ ) when we came up with the following trick to speed up the tablescanning significantly :  partition the table and apply a special hashing algorithm to equally distribute the data across the partitions.

This will enable multi threading data retrieval from multiple partitions.

 

Steps to take:

1)  Optimal number of partitions should be equal to the number of available cores (96)

2) Create multiple Filegroups to hold the partitions (24)
3) Partition the tables into #cores partitions 

4) Use the CREATE PARTITION FUNCTION command

image 

5) Use the CREATE PARTITION SCHEME command  and bind the partition function to Filegroups

image 

6) Add a hash column to the table (tinyint, just one byte per row)
7) Calculate a good hash distribution , for example, use hashbytes with a modulo function
8) Build a clustered index on the table, on the partition scheme

image

9) Insert data  with some magic hashing algorithm key:

image

 

10)  Check if all partitions are about equally filled: 

Run the  Select * from sys.partitions where  object_name(object_id)=’LINEITEM_HASH96key_SSD’

and check out the amount of rows per partition:

image

 

11) Rerun the command :  SELECT count (*) from LINEITEM_Hash96Key_SSD

Result of table partitioning on overall throughput and duration is shown below: it takes only 19.88 seconds  instead of 31.2 seconds,

so 11.3 seconds or 36% faster.  That’s an interesting gain!

image

image

Waitstats

the no. 1 waittype , with 78% , is time spend by waiting for PageIOLatch_SH,  with an average of 180 Millsec per event, means a  disk bottleneck; 

We are maxing out on our single LUN.

image

Latchwaits

image

 

 

Test 3d)  Table scanning from a partitoned table with 96 partitions, 24 db files, 24 FileGroups and PAGE Compression

 

A last tip: enabling Page compression on the table will also speed up  scanning: 

command:  ALTER TABLE [dbo].LINEITEM_hash96key_SSD REBUILD WITH  (DATA_COMPRESSION = PAGE )

 

Result:  we can now scan 400 million rows / 46 GByte of data in only  6.2  seconds..   or in only 20% of the time.

clip_image010

 image

Wait_types

image

Latches:

image 

Additional information

World’s Fastest Flash-Based SSD:  The DSI 3500 Solid State:  http://www.dynamicsolutions.com/main-menu/dsi3500 

Unisys whitepaper: Microsoft SQL Server 2008 Data and Backup Compression:   http://www9.unisys.com/eprise/main/admin/corporate/doc/41371394.pdf 

Wrap up

The examples above show that SQL table scan operations can be optimized greatly; partitioning a table or applying some sort of compression  makes a significant difference in overall scan speed.

Henk posted at 2010-1-2 Category: SQL 2008 R2, Storage: SAN & SSD | Tags: , ,

4 Responses Leave a comment

  1. #1mbourgon @ 2010-5-25 16:27

    Question: how long does it take with just the compression and not the partitioning & randomization?

  2. Henk @ 2010-5-25 22:30

    Hi Michael,
    scanning a regular page compressed table with 400 mill rows (18.39 GB) from SSD takes 11.8 seconds which is also a significant gain (vs the original 31.2 seconds…)

  3. #2robert towne @ 2011-9-27 01:28

    Any chance you could post the actual script? I am curious if the files are all on the same LUN or spread out across multiple LUN’s. The filegroup to file mapping is typically something not shown in articles but could be a very important detail.

    Thanks,
    Robert

  4. Henk @ 2013-4-11 18:53

    Hi Daniel,

    “ACCESS_METHODS_DATASET_PARENT” WAITS (to me) typically means that the Resource Governor (even when RG is not enabled!) is queying some queries (they will show up as status ‘runnable’ instead of ‘running’ ;
    if you enable the Resource Gov. and reduce the ‘memory Grant %’ from 25% to say 10% you will see that the “ACCESS_METHODS_DATASET_PARENT” Waits will go down.

    – Henk

Leave a Reply

You must be logged in to post a comment.