Speeding up Reading from a SQL OLE DB or ADO.Net Data Source

With small datasets on a test system you will probably hardly notice the difference, but when you are building SSIS packages that need to process Gigabytes of data in a production environment on a day to day basis, the following Performance tip to optimize reading from the SQL data source is one to remember. 

1) Data Flow OLE DB Source creation

Very likely when you select your OLE DB (or ADO.Net) Source,  you pick the default data access mode “Table or view”  and the table with the data you need;

OLE DB Source Editor 

In our case,  it’s a TestTable which has 4 columns,   and lets deselect the columns we don’t need ;

 clip_image006

2) Measuring the throughput

To check quickly the throughput of the Data Source  use the Row Count trick; count the rows and measure the elapsed time.

When we execute the package it takes 61.2 seconds to run the SSIS package, which is like 490 000 rows /sec.

Use the Row Count to  measure throughput

Surprisingly if you look at the SQL Profiler trace, you will notice that SQL Server executed from a  SSIS package a “Select * from [dbo].[TestTable] request which took 61.2 seconds !

This is strange; we just selected Column1 to execute the Rowcount on?  Or may be not…

Select * from Heaven

3) Be Selective

The default Table or View selection will always execute a “Select * from Heaven” and can become very costly if your database contains many columns (I will show you later the impact) but what’s important to realize is that the SSIS Progress tab will not warn you for this behavior… Typically if there are unused data columns or any room for improvement along the line you will see a warning in the Progress tab tune your package to avoid unnecessary data movement.

Pipeline warning

To increase the processing speed of you packages overcome the full table scans and transfer only the data that you really need. Be as selective as possible. In this case to retrieve just Col1:  Select a different Data Access mode, pick “SQL Command” (instead of “Table or view”)  and build a query SELECT Col1 from dbo.TestTable”. 

clip_image009 

This query will only create 1 column as output which will show up in  the SSIS OLE DB Source Editor as a single available external  Column:

Only Columns needed available 

Result: with just column1  being retrieved,  it still takes 179148 reads, but only 26.9 seconds to get all the data transferred, instead of 61.2 seconds :

 Profiler trace of narrowed down column selection

Make it a habit to question yourselves where the untagged columns come from.

3a) Impact on Large Datasets

Just to show you the impact of what it take to scan a table with 400 million row , and 17 column wide data set..:

Howto create Bad Performance...

Duration: 29 minutes 55 seconds to scroll through all 400 Million columns:

image

When you build a much more selective query by typing a statement or by using the Query Builder :

Retrieving the same L_Shipdates date takes only  6 minutes 17 seconds , which is  23 minutes and 38 seconds , or 79%… LESS Time!

SSIS Query Builder

Of course, by building you package using WHERE clauses and filter out the specific ranges you really need will make the optimization complete.

 

4) How to check existing SSIS Packages for the “Select * FROM heaven”

This an interesting question; at the moment I think the simplest way to get a hint where to start looking is to use a lightweight SQL Profiler trace that checks for:

1) SSIS Package executions + start date & time,
2) Table scans commands
3) The Row Count

With SQL Profiler , the trace setup will look like this:  Apply 2 filters by clicking the “Column Filters” button:

– “ApplicationName” Like:  “SSIS-%”  and 
– “TextData” Like  “Select * from %”

Applying Profiler filter to capture tablescans TextData filter

As third step, select the “Show all columns” option , look for the RowRounts column, and scroll in all the way down to the TSQL section to select SQL:BatchCompleted:

Select the RowCounts Column for SQL: BatchCompleted

Result:

Result

 

5) Additional Reading

Luckily I found one location where this tip is written down also; Great SSIS book on page 414  of the book with lots of insights “Professional Microsoft SQL Server 2008 Integration Services, written by my friend Grant Dickinson and other great guys like Brian Knight, Erik Veerman, Douglas Hinson, Darren Herbold.   

 

Wrap up

With the walkthrough example above I showed you the impact on the overall performance when using the default SSIS “Table or view’ data source selection on small and large data set, and also the effect on the overall runtime of your SSIS packages. Check your most demanding (or long running) packages on their selectivity and narrow down the number of columns you are requesting from the database where possible. Move only the data around that you really need!

Henk posted at 2010-1-20 Category: SSIS | Tags: , ,

8 Responses Leave a comment

  1. #1Shahriar Nikkhah @ 2010-8-27 17:14

    Very very good, thank you.

  2. #2Abhay Desai @ 2010-12-2 07:32

    Very good article

  3. #3bali @ 2011-10-26 11:27

    Hi,

    Instead of embedding sql query in the ssis package and even avoiding view to select required columns, can we create some custom components that just retrieves only selected columns(at design time).

  4. Henk @ 2011-10-27 13:35

    Hi Bali,

    Thanks for thinking with me on this topic; what type of custom component are you thinking of ?

    To limit the output send out to SSIS you have to be as selective as possible on the SQLServer side ;
    For example, you could also build a View into your database and query the view from SSIS ?

    Create view FASTREAD
    as
    SELECT l_Shipdate
    FROM dbo.lineitem_Has96Key

    To my best understanding this is still the best and fastest way;
    (In SQL 2012 there will be a new index type ; the column store index that significant increases querying of a specific column too ! )

    Brgds, Henk

  5. #4Manish Mahajan @ 2012-3-15 13:33

    Thank you for the informative article.
    I am going to check this out on an SSIS package which is giving me a lot of troubles and see how it goes.
    Regards,
    Manish

  6. #5Venkat @ 2013-6-26 10:49

    Hi,

    Very good article ,too informative. The topic is on “Speeding up Reading from a SQL OLE DB or ADO.Net Data Source”. I dint get the difference between OLEDB and ADO.NET source components. May i know the major difference please.

    Thanks,
    Venkat

  7. Henk @ 2013-6-26 15:24

    Hi Venkat,
    Both connection types are used quite often and the techniques/tips I describe will work with both; from what I have measured, oledb is typically (~15%) faster than ado.net (please try for yourselves with the SSIS Rowcount component as listed under 2) !)

  8. #6Lennon @ 2015-11-12 09:02

    Hi,

    Great exercise, however in my case this doesn’t seem to work that well. I need to extract a dataset +- 1.4 billion rows which I need to push to another table. Is there any suggestion I could try out?

    Many thanks,
    Lennon

Leave a Reply

You must be logged in to post a comment.