SQL Server Logo

Automating the building of BI-Solutions

Coming from a "Old fashioned" development environment (that is: C/C++, CSV, Unix and the likes) I have always wondered why BI-projects don’t seem to get the idea of a structured software development.

There are some technological cornerstones of a good, well structured software development process:

  • Source control systems – Never loose code again, unlimited undo. Tag your releases
  • Automated, daily builds – Make sure you project is ready to deploy and release. Never again spend hours on manually building your project
  • Bug tracking and issue logs – Speaks for itself. A necessary tool for proper project management
  • Multiple environments – At the very least: Development, Test and Production.
  • Change management – Well documented and scripted procedures for moving your code into production

I don’t know about you – but I have yet to see a Business Intelligence projects use all of the above. Some, actually a lot, of the examples I have worked on did not even use ANY of the above. My theory is that BI-development community is in a sense "immature" to the ways of structured software development.

Continue reading…

Ancient Cube

The Many Pitfalls of Attribute Properties in AS 2005

In AS 2000 dimension design, once the database has been loaded, is a 10 minute task. Declare your hierarchies, set up a few member properties, a short look at aggregations settings and off you go. There really was no such thing as a multiple hierarchies. Member properties was a design afterthought – as anyone that has ever designed a virtual dimension can testify. Though MDX feature powerful query abilities – there were still queries that were best served by the relations database.

Enter AS 2005. Dimensions design has become much more powerful. We now have the full flexibility of the relational model build into the UDM. However, this power comes with a heavy price tag.

Continue reading…

Ancient Cube

Arbitrary Shapes in AS 2005

I have recently returned from a performance tuning assignment at one of the largest AS-cube installations in the world. Working on such large installations always teach you a thing or two about the internal workings of the server products in use. Learning this new stuff is what I love the most about performance tuning. My colleague and me had a conference call with the people at Microsoft – and they have been really helpful about sharing their knowledge. I wanted to pass on some of it here.

Continue reading…

How fast can SQL Server 2005 possibly be

I am currently making a relatively deep (read: nerdy) study of how to optimally load a huge data warehouse.

Based on my experience with ETL tools, SQL tuning and index optimization I know that there are several approaches one may take when loading a billion row data warehouse with million row dimension tables. However, very few of them are optimal or even viable…

On question that must be answered to find the optimal stratagy is the following:

Let us assume we have an optimally configured SQL Server system (all sp_configure settings and trace flags optimized for our hardware).

Futhermore, assume we have the following:

  • A 1 GB Full-Duplex network card with the best drivers available
  • The best disk system money can buy (Real spindles – not Solid state)
  • The fastest CPU of each class: x86, x64 or IA-64 architecture (there is no parallel execution happening here – yet)
  • Plenty of RAM allocated to sqlserv.exe

Now, let us run these three simple statements against our warehouse database:

  • SELECT SurrogateKey FROM DimensionTable WHERE EntityKey = @ek
  • UPDATE DimensionTable SET EntityKey = @ek WHERE SurrogateKey = @sk
  • INSERT INTO DimensionTable (SurrogateKey, EntityKey) VALUES (@sk, @ek)

These are (simplified) version of the queries used to respectively make key lookups, type 1 and type 2 dimension changes

Assume the following about the execution of the above statements

  • We are using the OLEDB driver subsystem to communicate with SQL Server
  • Optimal indexes are in place to support all queries
  • No page splitting occurs
  • The Fillfactor is 100% in the indexes
  • The query plan for all statements is in the plan cache
  • The SELECT can be serviced from the buffer pool by looking at only 3 pages in the index B-tree (optimistic guess)
  • The UPDATE and the INSERT each require only one I/O operation (optimistic guess)

For our warehouse architecture we can consider two viable scenarios:

  • Statements executed from another machine on the network to the server (crossing the 1GB Full-Duplex link)
  • Statements executed from same machine (doing in memory movement where possible)

Now my question is: How fast can these statements be if we have the best software, tuning and hardware available?…

Or in Data Warehouse terms:

"What is the absolute lowest time used pr. row when loading a dimension table using a naïve (read: straightforward) approach to dimension loading"

BTW: (An example of such a naïve approach is the one employed by the SSIS 2005 SCD transformation)

In a later post I will explore a non-naïve approach which i consider optimal for dimension loading for dimension of any size.