Written by Luis Miguel – Originally Published on Sept. 14, 2014
One of the most practical reasons for the use of surrogate keys when loading dimensions in a data warehouse is the need to aggregate data from disparate systems.
Implementing a data warehouse works well because it brings together data from two or more systems that exist independent of each other but need to be analyzed together.
Very frequently, companies acquire other businesses or are a conglomerate of small companies that operate independently. Modern data management techniques allow us to see a global perspective through the use of a data warehouse.
One of the challenges for a data warehouse architect is to build a system that homogenizes those autonomous data sources. In the example I use in this post, we see that a holdings company owns two paint distribution/retail operations. One of the entities in both of the transactional systems is named “colors”, and color is a dimension in the data warehouse because we need to analyze sales and other performance indicators by color.
The color tables in each of the transactional systems uses a primary key (color_id); in this case, the primary keys are integers. As expected, these keys cannot be reused in the dimension because they overlap/conflict, but we need to resolve this inconsistency to populate the color dimension and the sales fact table.
To demonstrate, the database for the Colorama store has a color named “White” with a primary key of “1”, while the primary key for “White” in the Paintmart system is “2”.
In this very simple example I’m going to use an intermediate “keymap” table where I’m going to do the mapping (and translation) from the transactional systems’ primary keys (color_id) to the dimension table surrogate key (dim_color_key), so that we have a consistent unique key for every color no matter where it comes from. We then use this surrogate key in both the color dimension and the sales fact tables.
This same technique can be used to do the translation of other entity to dimension data, for example in this scenario the same situation is present in the “size” table.