Slowly altering dimension (SCD) is an information warehousing idea coined by the wonderful Ralph Kimball. The SCD idea offers with shifting a particular set of knowledge from one state to a different. Think about a human sources (HR) system having an Worker desk. As the next picture exhibits, Stephen Jiang is a Gross sales Supervisor having ten gross sales representatives in his staff:
At present, Stephen Jiang received his promotion to the Vice President of Gross sales position, so his staff has grown in measurement from 10 to 17. Stephen is identical individual, however his position is now modified, as proven within the following picture:
One other instance is when a buyer’s handle adjustments in a gross sales system. Once more, the client is identical, however their handle is now totally different. From an information warehousing standpoint, we now have totally different choices to cope with the information relying on the enterprise necessities, main us to several types of SDCs. It’s essential to notice that the information adjustments within the transactional supply programs (in our examples, the HR system or a gross sales system). We transfer and rework the information from the transactional programs through ETL (Extract, Transform, and Load) processes and land it in an information warehouse, the place the SCD idea kicks in. SCD is about how adjustments within the supply programs mirror the information within the information warehouse. These sorts of adjustments within the supply system don’t occur fairly often therefore the time period slowly altering. Many SCD sorts have been developed over time, which is out of the scope of this publish, however in your reference, we cowl the primary three sorts as follows.
SCD kind zero (SCD 0)
With this sort of SCD, we ignore all adjustments in a dimension. So, when an individual’s residential handle adjustments within the supply system (an HR system, in our instance), we don’t change the touchdown dimension in our information warehouse. In different phrases, we ignore the adjustments inside the information supply. SCD 0 is additionally known as mounted dimensions.
SCD kind 1 (SCD 1)
With an SCD 1 kind, we overwrite the outdated information with the brand new. A superb instance of an SCD 1 kind is when the enterprise doesn’t want the client’s outdated handle and solely must maintain the client’s present handle.
SCD kind 2 (SCD 2)
With this sort of SCD, we maintain the historical past of knowledge adjustments within the information warehouse when the enterprise must maintain the outdated and present information. In an SCD 2 state of affairs, we have to preserve the historic information, so we insert a brand new row of knowledge into the information warehouse every time a transactional system adjustments. A change within the transactional system is without doubt one of the following:
- Insertion: When a brand new row inserted into the desk
- Updating: When an current row of knowledge is up to date with new information
- Deletion: When a row of knowledge is faraway from the desk
Let’s proceed with our earlier instance of a Human Useful resource system and the Worker desk. Inserting a brand new row of knowledge into the Worker dimension within the information warehouse for each change inside the supply system causes information duplications within the Worker dimensions within the information warehouse. Due to this fact we can’t use the EmployeeKey column as the first key of the dimension. Therefore, we have to introduce a brand new set of columns to ensure the distinctiveness of each row of the information, as follows:
- A brand new key column that ensures rows’ uniqueness within the Worker dimension. This new key column is just an index representing every row of knowledge saved in an information warehouse dimension. The brand new key’s a so-called surrogate key. Whereas the Surrogate Key ensures every row within the dimension is exclusive, we nonetheless want to keep up the supply system’s major key. By definition, the supply system’s major keys at the moment are known as enterprise keys or alternate keys within the information warehousing world.
- A Begin Date and an Finish Date column signify the timeframe throughout which a row of knowledge is in its present state.
- One other column exhibits the standing of every row of knowledge.
SCD 2 is probably the most frequent kind of SCD. After we create the required columns
Let’s revisit our state of affairs when Stephen Jiang was promoted from Gross sales Supervisor to Vice President of Gross sales. The next screenshot exhibits the information within the Worker dimensions within the information warehouse earlier than Stephen received the promotion:
The EmployeeKey column is the Surrogate Key of the dimension, and the EmployeeBusinessKey column is the Enterprise Key (the first key of the client within the supply system); the Begin Date column exhibits the date Stephen Jiang began his job as North American Gross sales Supervisor, the Finish Date column has been left clean (null), and the Standing column exhibits Present. Now, let’s take a look on the information after Stephen will get the promotion, which is illustrated within the following screenshot:
Because the above picture exhibits, Stephan Jiang began his new position as Vice President of Gross sales on 13/10/2012 and completed his job as North American Gross sales Supervisor on 12/10/2012. So, the information is reworked whereas shifting from the supply system into the information warehouse. As you see, dealing with SCDs is without doubt one of the most important duties within the ETL processes.
Let’s see what SCD 2 means in relation to information modeling in Energy BI. The primary query is: Can we implement SCD 2 straight in Energy BI Desktop with out having an information warehouse? To reply this query, we should do not forget that we all the time put together the information earlier than loading it into the mannequin. Then again, we create a semantic layer when constructing an information mannequin in Energy BI. In a earlier publish, I defined the totally different parts of a BI answer, together with the ETL and the semantic layer. However I repeat it right here. In a Energy BI answer, we deal with the ETL processes utilizing Energy Question, and the information mannequin is the semantic layer. The semantic layer, by definition, is a view of the supply information (normally an information warehouse), optimised for reporting and analytical functions. The semantic layer is to not change the information warehouse and isn’t one other model of the information warehouse both. So the reply is that we can’t implement the SCD 2 performance purely in Energy BI. We have to both have an information warehouse conserving the historic information, or the transactional system has a mechanism to help sustaining the historic information, similar to a temporal mechanism. A temporal mechanism is a characteristic that some relational database administration programs similar to SQL Server supply to supply details about the information saved in a desk at any time as a substitute of conserving the present information solely. To study extra about temporal tables in SQL Server, test this out.
After we load the information into the information mannequin in Energy BI Desktop, we now have all present and historic information within the dimension tables. Due to this fact, we now have to watch out when coping with SCDs. As an illustration, the next screenshot exhibits reseller gross sales for workers:
At a primary look, the numbers appear to be right. Effectively, they might be proper; they might be fallacious. It is determined by what the enterprise expects to see on a report. Take a look at Picture 4, which exhibits Stephen’s adjustments. Stephen had some gross sales values when he was a North American Gross sales Supervisor (EmployeeKey 272). However after his promotion (EmployeeKey 277), he isn’t promoting anymore. We didn’t take into account SCD after we created the previous desk, which suggests we take into account Stephen’s gross sales values (EmployeeKey 272). However is that this what the enterprise requires? Does the enterprise count on to see all workers’ gross sales with out contemplating their standing? For extra readability, let’s add the Standing column to the desk.
What if the enterprise must solely present gross sales values just for workers when their standing is Present? In that case, we must issue the SCD into the equation and filter out Stephen’s gross sales values. Relying on the enterprise necessities, we would want so as to add the Standing column as a filter within the visualizations, whereas in different instances, we would want to change the measures by including the Begin Date, Finish Date, and Standing columns to filter the outcomes. The next screenshot exhibits the outcomes after we use visible filters to take out Stephen’s gross sales:
Coping with SCDs is just not all the time so simple as this. Typically, we have to make some adjustments to our information mannequin.
So, do all of the above imply we can’t implement any sorts of SCDs in Energy BI? The reply, as all the time, is “it relies upon.” In some situations, we will implement an answer much like the SCD 1 performance, which I clarify in one other weblog publish. However we’re out of luck in implementing the SCD 2 performance purely in Energy BI.
Have you ever used SCDs in Energy BI, I’m curious to know concerning the challenges you confronted. So please share you ideas within the feedback part beneath.