I have come across a rather, for me at least, unusual data composition, and wanted to share the solution to the business problem.
Given is one single source file that contains facts and some dimensional identification. In this case, we have a single measure, lets call it A, and several columns of data that is to become dimensions. In particular we have a consumer C and a salesman S. C is identified by a unique identifier, with which we can lookup the basic data in another system, so they are in the clear. The salesman S on the other hand, is identified not only by one (1) attribute, but by up to three (3) attributes with which we can lookup basic data in three (3) separate systems. When I say up to three (3) attributes, I mean that, sometimes only as much as one of three is present.
The fact is, that one attribute of salesman S is more important than the other two. So lets call the attributes S1, S2 and S3, where S1 is more important than S2, and S3 least important. From some of the S2 attributes, we can on a separate system deduct S1. The same goes for S3, where we can deduct some S1 entries. The goal is to have as many records with S1 populated as possible.
The idea, is to isolate the rows that already contains the populated S1 attribute in a flow of its own, we name this Flow1. Part of the data contract is, that we cannot deduct S2 or S3 from any of the rows that contains S1s, in order to use these data the populate other rows. Rows in Flow1 we tag as Type1
The rows that only contains the attribute S2 we then send away in a flow of its own, Flow2, where we lookup S1 on an external system. The rows that has a match, gets send back into the first flow Flow1. The rows that doesn’t match we send on to Flow3.
Flow3 is the flow where we use the S3 attribute to deduct the S1 attribute. S3 can have multiple S1 attributes related in the external system, hence we distinguish from the ones that has one and only one relation, and those that has more than one. If matched by one S1 attribute alone, we send the rows back to Flow1. Otherwise we tag these entries as Type2 if multiple S1 attributes are related or as Unknown if S3 is not present.
So how do we go about sorting and selecting the right rows for our dimensions?
The idea is, to tag each flow with a score, much like suggested in this excellent blog post by Todd McDermid (blog|twitter) depicted below, only with a twist:
Sometimes rows in Flow1 where S1 and S2 are correlated, can have different values for S3. So we need to address that on top. And since S3 is not numerical, we cannot eliminate the ones we don’t like in the aggregate component. We apply the same logic for this issue as with S2, and have some help from the external source. During the time slice that the dataset represents, S1 can be associated with more than one S3, although only one at a time. Per example, let say S1 correlates to S3a in the first fifteen (15) rows, and then after that correlates to S3b for another fifteen (15) rows and then switches back to correlate to S3a. consider the relation as if a salesman made sales in his first shop (S3a) then in his second shop (S3b) and then the day after it repeats itself again. To solve this appropriately, we would have to implement a many to many design (see M2M for in-depth information). But in our case, it is not important, at this point, to know at which store the seller made the sale, only if he/she is the sole seller in that particular store.
Our twist, in reference to the solution proposed by in the above mentioned blog post, is to add an additional score, in regards to S3. We end up, grouping by BusinessKey and aggregating by MIN/MAX, whichever is appropriate, on both scores.
This blog post is meant as inspiration, and thoughts or corrections are as welcome as ever – The learning curve never ends 🙂