This month marks the 72nd T-SQL Tuesday. Adam Machanic’s (b|l|t) started the T-SQL Tuesday blog party in December of 2009. Each month an invitation is sent out on the first Tuesday of the month, inviting bloggers to participate in a common topic. On the second Tuesday of the month all the bloggers post their contribution to the event for everyone to read. The host sums up all the participant’s entries at the end of the week.
This month Mickey Stuewe (b|l|t) is the host and the topic is …
Data Modeling Gone Wrong
I am really looking forward to the other entries – because “cleverness” never seems to know any boundaries when it comes to Database Design or Modeling, and I just know there are some crazy things going on out there. So be sure to tune in to the host summary, that will appear on Mickey’s blog in the near future.
For my own part, I will refer to the latest “bright idea” that I came across. The application in question, was based on SQL Server, and was used to track price quotes. The application on top was in use at that point in time, and enhancements were developed, business heavily depending on this being “on-line”.
I general there were three (3) types of quotes. For each type, there was a set of table in the database that was almost identical. Only a some attributes where different, depending on type, but an inconsistent naming convention still gave away the major relations between tables in the database.
This meant that the application relied on three (3) sets of almost identical tables. This could most definitely have been designed differently, but it’s not the real cluster f**k of this application, so I will not go into detail here.
In every table of the database, there was a column name with a post fix ‘key‘. There was also a column name with a post fix ‘id‘ in all of the tables. At first sight that seemed to look like two tech leads having a ball. But actually it was not. I discovered later, that the had been only one “architect” behind this application. The good thing about that, was that the problem was easy to “contain”, since that particular employee was no longer working on the project.
After some investigation and data profiling, I slowly honed in on the fact that data in column blablabla_key and blablabla_id wasn’t related, in any way. Nor was key in one table related to key in any other table. Neither was id. In fact, there were no foreign keys defined in any of the tables. So no ref. integrity was ensured. In theory, even if I found the right match, crappy data could actually obfuscate this fact that there really was a relation.
Further investigation led to the conclusion, that id and key where in no way inter-related. So id in one table was not related to key in another table assembling the naming convention. No, it turns out, that for every quote, the business is operating with a term validity. The id of the respective validity, one (1) of three (3), is related to a specific quote key, which in no way was reflected, in any way, in the naming convention of either tables nor columns. E.g.: QuoteKey was related to ValidityId, in each of the respective quote type schemas. But that’s not it. For each of the three types of quota, two validities had been combined into one table. In fact, additional logic was to be applied, as if one validity was not present in said table, the other should take over. Sort of a fail safe validity. This meant that keys, if not present, was to be treated differently…
Needless to say, I spent a good amount of time trying to figure out how to combine the data, so we could do some proper business intelligence on top. Conference call after conference call left me fruitless, and still to this day, I am not sure i could have asked the developers any differently. The disconnect between me and them was soo huge, too huge to identify it seems. Asking one thing and getting a seemingly sane answer just made it even more difficult. Because none proved valid when querying the database. In the end, profiling the database senselessly, made the relations stick out.
So, to wrap up. Get serious about naming conventions, event if they seem to be a pain. Secondly, and more importantly, do define the relations in the database. There are several benefits of doing that, even though you will probably meet someone who will argue this.
As a smart exit remark, I wanted to end with a quote (only fitting for this post), but unfortunately I wasn’t able to find the original source for the quote. Did some extensive searching on twitter yesterday, without any luck. To my best knowledge, this was something Adam Machanic (b|l|t) supposedly should have said. Don’t kill me, if it’s not
“Performance doesn’t matter, when you have Dirty Data.”
What I read of off this quote is, that if you have dirty data, no performance in the world will ever make up for that. This is where Foreign Key constraints becomes your best friend. There are actually several good articles out there on Foreign Key Constraints:
Do Foreign Key Constraints Help Performance? by Grant Fritchey (b|l|t)
Foreign Key Constraints: Friend or Frenemy? by Mickey Stuewe (b|l|t)
Do Foreign Keys Matter for Insert Speed? by Jeremiah Peschka (b|l|t)
Thanks to Mickey Stuewe for hosting such a brilliant topic. I look very much forward to read the other entries!