Defining the Good Data Model

Designing data models is fun – at least if you are a geek like me. But as much as I like the academic thrill of building something that is complex – I am aware that it is often humans that eventually must see and maintain my beautiful (data) model. How can we design something that humans can understand?

Humans are buggy! In general, they don’t deal well with complexity. You can blame modern education, you can scream and shout, or languish on the fact that the IT industry is riddled with incompetence, you may even throw Kimball or Inmon books at the wall in anger. But the empirical tests all show the same: the wetware is the final test of the model.

I will venture a definition of what constitutes a “good data model” for a DW/BI solution

Good DW model ::=

  1. Data in a good model can be easily consumed
  2. Large data changes in a good model are scalable
  3. A good model provides predictable performance
  4. A good model can adapt to changes in requirements, but not at the expense of 1-3

Ad 1) This is the requirement that makes the model usable and maintainable by humans. For now, note that If you are serving data to an end user, you will most likely require something that is mostly de-normalized, or at least looks like it is. If anyone reading this has successfully implemented a large, 3rd normal form data model that non-computer science users find delightful to query for reporting purposes – please comment. I would be curious to know how you did it.

Ad 2) Our goal is to design data warehouses. This means that we have taken on the task of collecting and aggregating all information in an organisation. We can generally buy more hardware (if we put up a fight with infrastructure) – but we cannot significantly reduce incoming data sizes and still maintain flexibility. Scalability of the model means that we gracefully maintain load speeds as data grows by adding more hardware resources.

Ad 3) If there is one thing end users hate more than slow performance, it is unpredictable performance. The same is true for the poor DBA who has to get up in the middle of the night to fix a batch run that did not complete in time. The good data model seeks to minimise risk by creating predictable response times. Note that this may even happen at the cost of some performance: the predictably “good enough” is not always the fastest (though we take delight in our design when it is). For now, note that predictable performance is not always in reach – but it is a goal we aspire to.

Ad 4) I foresee that this will turn out to be one of the central contention points in the discussions to follow. Changes are a fact of life. If you have not yet experienced the intense pain of changing a beautifully designed data model half way through a project, I recommend you go work a few years for any organisation led by a guy with an MBA degree who believes 10% growth is maintainable forever, and that the organisation must react quarterly to “market changes”. I am sure you will find plenty such employers that will teach you the nature of change for good mammon. For now, I will assume that you believe that “change happens” – though we shall later look more closely at what form such change typically takes.

1-3 must often be balanced with 4. The data model must be flexible in some way; it must remain agile. Yet I will not sacrifice the usability of the model on the altar of agility. And staring into the abyss of slow performance chanting protective spells of “flexibility” is not the game I am in either. I hope to show you that doing a good job at 1-3 will often remove some of the fears bred by requirement 4.