Ok, I thought it would be a good idea to write down a little bit about my experience in building Data Warehouses. Starting from 1997, I have been involved in some large-scale Data Warehouse implementations, mainly in retail, telco and banking industries. I’ve had hands on experience building some of them with different technologies, I also observed different implementations while performing “Data Warehouse Assessment” studies for different companies which built Data Warehouses without my intervention (what a pity). So, I think I’ve spent some time in the field to talk about what could go wrong or right.
Some of my experiences I will try to outline here are pure technical point of view, some applies to specific technologies, some fundamental and about methodology you need to follow. Most probably some are too obvious, but from time to time I still observe that people still need these sorts of small tricks.
I will try my best to give you a short and entertaining list of principles, but at this point if you do not agree with me spending your valuable time reading about Data Warehouse techniques that might not apply to your daily life or might be obsolete, then you may leave, I cannot blame. But for the rest who has time and wondering about some insights from the field experience, please read on.
Please take the precaution that the ideas expressed here are my subjective opinions, so I am sure there are other people around who might just argue the opposite. But I still believe there are some points here that could still be useful for those brave souls in the field trying to resolve practical problems.
Business ownership in a data warehouse project is a key success factor. Even though we use cutting edge technology products and techniques and talk about technical jargon, at the end of the day, what we try to solve is a business pain. So, it should be owned and directed by business groups, even the project budget should be funded by business.
This might seem too obvious, but even today, I still observe IT owning the Data Warehouse projects. This is serious risk, you might end up with multi-terabyte, multi-million project finished but still business pain is there, or else, business still complaining from the reports or analyses that is vital to their daily life. Worse, with your multi-terabyte baby in place, no one is interested to use it or hunting some BI solutions that would need to duplicate your data and effort.
To cut it short, Data Warehouse project is a business issue, not an IT owned technology implementation.
Create a multi-phase implementation plan with acceptable timeline for each step. A complex data warehouse implementation might even take a full two year of development, but unfortunately, if you choose to make it in one phase, once you’ve finished everything, the problems you try to resolve has already changed and business interest already disappeared, regardless of who owns the project.
The best practice for enterprise data warehouse projects is to have a subject area deliverable in every 3-4 month so that you will solve business problems on the go and keep people still interested and supporting the project. Having said that, due to the practical reasons like initial load and configurations, the first step might take a little bit longer.
Now, you’ve got the idea, multi-phase approach, solving a problem in each phase not letting business focus disappear.
It is hard to keep up with the speed of change in the technology field. Data Warehousing is no exception. In-memory processing, cloud computing, noSQL, self service BI, HADOOP, all the concepts we are talking about in our industry have a place today in Data Warehousing too.
It is best to follow new generation solutions in the technology, but bear in mind that, not every technology is equal in Data Warehousing. It also does not mean your current technology preferences will work as you wish when it comes to multi-terabyte Data Warehouse project.
If you have one vendor approach in your IT department, you might need to look other tools outside of your comfort zone or get prepared for a risk of data warehouse restructuring project in the future.
Please carefully inspect each and every components’ ability to meet your SLAs with the business and make sure that the components you choose will keep up with your growth in your business hence your data. Your database should scale well when your data grows, your ETL tool should enable to implement business changes in timely manner without compromises, your BI selection should let business users to explore self-service analytics etc.
This is critical. You need to design for performance and implement fastest techniques for reads (user queries) and writes (loads, data movements). Technology you preferred should provide enough performance for user requirements. Do not assume that any database technology will give you the best performance. It is not commodity, it is still innovation.
It is quite common that especially for large scale implementations data load window would take hours and sometimes finishes after start of business hours, we also observe that poor choices or designs result in user queries taking minutes if not hours.
If you fail to meet the performance requirements, then you will need to make unexpected spending in the near future, will you have any budget left?.
Get a piece of paper and do the math. You can use the great tool called Excel too. Write down the tables in your physical model and their current (estimated or real) sizes and future growth.
No, this is not as simple as you think. You should consider a lot of details including how your business will grow, how your tables will grow? Daily, monthly, annually? How much updates, how many inserts? And how do you think you should handle deletes? How much history will be needed as hot or cold data? How would you purge your tables, how many concurrent users the system will support? Get prepared for all and more.
Doing the sizing calculations correctly will help you to estimate your costs related with software and hardware. Plus, you will be able to design your physical data model and transformations correctly.
If it works with 100 rows, and if there is a possibility to grow to 100 million rows, then take my advice and test it with 100 million rows before going to production.
Test it with your peak load. Estimate your concurrency, and see what will happen to your system under 100 concurrent users throwing full table scans on large fact tables.
Estimate your daily, weekly, monthly, end of year data & query load, together with initial load, and test it with your code and components. Make it sure that, your piece of code will work under extreme conditions with acceptable performance.
Do the data cleaning
It is a dirty business, really. Do not trust the cleanliness of the data unless you make a data assessment. If you let uncleaned data leak into the user reports, then you will have a big trouble: losing business user’s trust in Data Warehouse data and it will be very challenging to win it back.
Get your data
In a large enterprise there are possibly many source systems performing different functions. Sometimes they are alive 7/24, so extracting the data is a subject of having SLAs with source system owners. Usually, extracting the data is performed in a
time window that the source system is less busy, probably after finishing daily jobs around midnight.
As a best practice, extract the data without any transformations from the source system i.e. no joins, no group bys etc. So, extract the data and finish any task in the system as soon as possible.
But source system designs sometimes makes this job a lot complex than it seems. Sometimes due to the system design, there are no clue when a record just updated or no history of updates at all. A simple case would be a customer changed his/her address and the required fields just updated in the source system without indication of previous address. In simple technical terms, when the data moved into Data Warehouse we can apply the same update and lose the history i.e. his/her previous information about location. Most of the time this is a loss of very important piece of information required by the business. There are alternatives to solve this problem like using a Change Data Capture (CDC) tool to capture what data is changed actually or get the data as a whole and find the real updates during the transformation process. It all depends on your strategy and budget.
Now, let’s stop here for a while, there will be more in Episode II . I hope you’ve enjoyed so far, and you will be with me on the upcoming part which you can find here if you want to jump in directly.