In this blog, I will argue for and against key architectural choices that face the data modeler who wants to build a big data warehouse. I will do my utmost to avoid hand waving and references to “Logical Models”. Indeed, I find that talking about a “Logical Model”, whatever that is nowadays, distracts from the real issues that must be solved by the designer and that such model provide very little value in a real life implementations. I will therefore move directly back and forth from between conceptual (business user understandable) design and the actual implementation.
My goal is to help the engineer/designer construct a workable data model. By “workable”, I mean a model that can both be maintained and understood by users – both business users, operation management teams, ETL developers, DBA and BI developers.
The discussion is by no means specific to the Microsoft database products – it just happens to the platform I know the best. I will therefore beg your forgiveness in advance for restricting myself to Microsoft SQL Server examples. If it is any comfort, I am sure users of Oracle, DB2, TeraData or other database platforms will find the concepts I speak about familiar. When there are platform specific differences I know about, I will try to highlight them. It is my hope that comments in this blog posted by other platform users will enrich the discussion with information specific to their platform of choice.
With these expectations set, and before we dig into the details, let me give you a quick overview of the architecture I will be arguing for:
The purple data stores here are all physical: they represent data stored on disk or other durable media.
The cube at the end is optional, it may not even be physically stored on disk. For example, tools like Business Objects, Microstrategy and Microsoft BI Semantic Model and UDM all allow a "SQL pass through” mode that exposes a relational model as if it was a multi dimensional cube.
Even the data Data Mart is optional in some cases – which I will describe.
Every orange arrow represents a physical move of the data using some ETL tool of SQL based data movement. This is not “logical” or “virtual” moves of data – I truly mean that data is physically move from one table to another.
At this point, I may already have disagreement (I imagine staging is up for debate and so is the mart strategy above). There is also a big question about the data modeling strategies to employ in each storage area.
I promised to make dig into the details, so even if you already now disagree with me, follow my blog as I explore the arguments for this particular implementation of a data warehouse.