Data Warehouse Performance Part 1 - I/O
In the last couple of weeks the number of visitors to this site has exploded. This is mainly because SQLServerCentral.com featured one of my articles on their home page and in their daily newsletter on Sept 1st. I have been intending to start a series of articles on data warehouse performance optimisation and availability covering the databases, cubes, ETL and reporting, and this increase in readership has given me the incentive to get started.
I will start by describing a few options you have for optimising the I/O for reporting or data mining. It is difficult to entirely separate reporting from the data load process, as changes that benefit one may adversely impact the other. I have highlighted when this is the case.
Typical Data MartA typical data mart is the result of ETL (Extract, Transform and Load) of data from multiple sources, and is then used for reporting purposes or as the base for an Analysis Services cube.
Normally a star schema or snowflake schema is used. Both of these schemas consist of a central fact table and multiple dimension tables. The dimension tables do not tend to be very big, but the fact tables can be huge. It is not unusual to see fact tables with more than 1 billion rows.
With tables this size it is important that the amount of data pulled from disk is kept to a minimum. It is also important that disk reads are as fast as possible.
Disk SubsystemUnlike OLTP systems data warehouses often use RAID 5. This slows down the ETL process, as RAID 5 writes are considerably slower than RAID 1, but has other benefits. The read performance is much better, close to RAID 0, and the costs are lower. Cost is often the overiding factor when several terabytes of storage have to be found.
Of course, if you have a bigger budget you can go for RAID 10. This requires more disks, so is more expensive, but performs better due to the lack of parity calculation.
A SAN may use its own proprietory versions of RAID and some of these will stripe across hundreds of disks. Beware some of the performance claims from the manufacturers though. The quoted figures are normally only achievable when reading or writing to the cache. An 8GB cache is not very useful when you are querying a 2TB database. In fact, the ETL processes will often benefit from disabling the read cache, so it is all available for writing.
Another way of boosting I/O performance is through multipath Software. This dynamically balances the load across HBAs and controllers.
IndexingBefore discussing indexes, I want to reiterate a point I have made in SQL WHERE Clause Optimization. Optimize your SQL. Use SARGable queries. Think about how SQL Server will service your queries. Look at the execution plans. There is no point adding multiple indexes to a table if they are never used. Data warehouse performance is no different from OLTP performance in this respect.
The nature of a data warehouse is such that it can be indexed up to the eyeballs, to allow for every conceivable query if necessary. The impact on ETL can be reduced using table partitioning (see below).
Where there are an excessive number of RID and Key lookups you should look at creating covering indexes. A covering index includes all columns referenced in the query and can make a massive difference to query times.
Indexing can also be used to reduce the amount of explicit sorting performed by a query. An index on the column(s) being sorted brings the data back in the required order.
On very large fact tables, you should normally create a clustered index on the date column. This increases the likelihood of an efficient execution plan when a date range is specified in the query. It also reduces the chance of page splits, as often data is loaded in date order.
Table PartitioningTable partitioning is a feature introduced in SQL Server 2005 that allows the data in a table to be spread across multiple filegroups, and hence disks, based on ranges of values of a particular column.
This improves availability, as index rebuilds can be performed one partition at a time, but it also makes data loading much faster.
A non-logged bulk insert is by far the quickest method, but this cannot be done if there are indexes on the table. For smaller tables you can drop the indexes first, bulk load the data, then re-create the indexes again. But consider a 1 billion row fact table with several indexes. It would not be possible to re-create the indexes in the time available.
If the fact table is partitioned, the solution is simple:
- Create a staging table, with no indexes or constraints;
- Bulk load the data;
- Build the indexes;
- Add constraints;
- Switch the table into the main fact table with the SWITCH command;
I've strayed into ETL performance here, which is the subject of the next article in this series, but I wanted to mention it as I believe table partitioning should be designed into any sizeable data warehouse project right from the start.
CompressionWhen people think of compression, their first thought is the row or page compression built into SQL Server 2008.
You are right, but there is another kind. Use the smallest possible data types.
It is common for the source data to use an integer to store a small range of values. If so, use a tinyint or smallint instead. The same applies to other data types. smalldatetime instead of datetime, varchar instead of char.
The cumulative effect is that each row is smaller, and hence more rows will fit on a page. If you can double the number of rows on each page, your best case is a halving of I/O. Definitely worth doing, and it doesn't need SQL Server 2008 either.
If you are using SQL Server 2008, I strongly recommend applying page compression to your fact tables. The improvement in data warehouse performance can be quite impressive. The only time you wouldn't do this is if the server is already CPU bound, as there is a small CPU overhead in decompressing the data.
LockingThe locking that is so important in an OLTP system is not required in a data warehouse. The overhead of taking and escalating all the shared locks has a noticeable impact on performance.
Fortunately the solution is simple: After the data load, set the database read-only. It can be made read-write again before the next data load. SQL Server notices this when it compiles the execution plans and does not take out any locks. This is equivalent to specifying the NOLOCK hint, but avoids having to explicitly include it in each query.
MaintenanceQueries against a data warehouse typically result in range scans against the data, with SQL Server performing sequential reads with large block sizes. Fragmentation can severly impact the performance of these reads, so you should ensure that indexes are reorganized or rebuilt whenever necessary. Typically, older data may be fairly static, requiring only occasional index maintenance whereas newer data may need defragmenting after every data load. As already mentioned, this is a big advantage of partitioning, as the partition containing the most recently loaded data can be targeted leaving the rest of the table untouched.
Statistics are also important here. I recommend you turn off the automatic statistics updates, and run a regular job instead. Dimension tables can use the FULLSCAN option as these tend to be quite small and plans will benefit from the more accurate statistics.
SummaryData warehouse performance is very dependent on fast I/O. There are other areas not covered here, such as parallelism, memory, network, etc. These will be discussed in a future article.
The next article will focus on ETL performance with SSIS. Watch this space, or subscribe to my RSS feed (orange XML/RSS button at the top of each page).