Let’s continue from where we left. For those who wondering what the Episode I was about here is the link.
You need to have an Enterprise Data Warehouse Data Model to integrate all of your data, Enterprise Data Warehouse Data Model is actually an abstraction of your business. Easy to say, hard to implement. I am not going into details of data modelling but I want to stress a few points here. There are a number of so called industry models in the market for specific businesses like telco, retail, banking, insurance etc. You can use them as a framework to guarantee that you do not have to re-implement everything as your business changes or you didn’t miss anything during the modelling phase by utilizing industry experts’ best practices. It is always an option of course developing your own data model instead of using an industry generic one.
Whatever modelling approach you choose, do not de-normalize your model to make your database to cope with it, instead, choose right database technology which can deal with many table joins and detailed data. But also, do not implement an “over-normalized” model since at the end of the day you need to write SQL (or use tools producing SQL) to query it. This is coming from field experience and should be considered in practical terms. I remember a data model implemented as exactly what a generic industry model looks like on the paper, the result was they needed to write a code in length of couple of A4 pages for a simple report, and unfortunately no tools were able to query it. You don’t want to end up with this, do you?
Use the available techniques to minimize data to deal with during both read and data load operations. In some database technologies, partitioning will help you to divide data into smaller chunks which is easier to manage and to query as well.
Think about a scenario where a large table having 3 years of historical data with billions of rows and users normally accessing this table with date predicates i.e. bring my sales within last three months. Partitioning will enable you to access only three months of the data instead of three years, a huge improvement on read costs, plus your database will be pleased to handle fewer rows in join operations or calculations.
Partitioning will also help you in data maintenance, if you plan to keep only 16 months of data in your fact table. It means you need to remove oldest months’ data, with partitioning in place it is only a data dictionary operation instead of a full table scan. Partitioning will also help you in joins and even data loads.
It is best to divide.
Don’t fall into loophole.
Please bear in mind that, you are dealing with large volumes of data and this might be different than your everyday development habits. Think in terms of sets i.e. instead of writing “for/while” loops (hope you are still with me dear reader) especially on large data sets, try to implement the same logic in SQL. Remember, SQL might be old but still is your friend.
Let me tell you a real life example about this. In a Customer meeting, which had a very large IT department with multi billion dollars’ investment, they mentioned they were running a PL/SQL program which lasts 5 days (yes 5 days) and complained about performance of Oracle database. Ok, Oracle database has some deficiencies on batch processing, parallelism etc. But this seemed to me a little bit awkward. So I wanted to take a look into the code. Checked their 4 pages A4 size PL/SQL program for a few minutes and then I realized that they had three inner loops on 50 million row data set. Well, I said, let me send you one of our consultants to re-write this code (I was managing the best Data Warehouse Consultant team on planet by that time). Half a day work, and same result again using PL/SQL but this time with dynamic SQL, lasted in half an hour so. Impressive, isn’t it?
Even though we have faster CPU and disks, in-memory computing and better software these days, the same old rule still applies: Think in sets, avoid loops.
If you did not fall into “real-time” trap, then you will probably performing daily loads into your Data Warehouse. So you have either SLAs with your source system owner, that they let extracts from their systems, or using CDC technology to capture changed data. In anyway, there will be a point where your integration will start.
For relatively small data sizes, this should not be problem. i.e. transformations start at midnight and finishes within a few hours altogether with maintenance jobs. But if the data to be processed each day is large, then your batch window might become a bottleneck. You need to finish everything before 9:00 A.M and plan everything accordingly.
In a typical scenario, your data load operation assumed to run every day, within certain hours as described above. But what happens if you fail to run it for a period of time? Then, assuming that you still perform extract operation from your systems, you will have a bunch of daily data backlog to be loaded when the system is available. You need to have a clean strategy for such cases to make it sure that the backlog will be cleared in a timely manner. So, you must have a kind of on-going initial load strategy after project goes into production. This might contain dropping indexes in the database, taking off summary calculations during the load etc.
Compression is great and unfortunately external to some of the database technologies like Oracle and Teradata. There is a cost related to compress a table, but since the volume of data you will read from disk will be less, your performance will be better, I promise.
Having said that, please check your database’s behavior about compression. If there are updates (merges) to compressed tables, it is highly possible you might end-up with a larger table than the non-compressed version of it.
And only if necessary!
For a conventional DBA, tuning=creating index, but this is usually just the opposite in Data Warehouse systems. Over-indexing is a common and obvious mistake in this field.
In a Data Warehouse database, queries happen to be mostly in analytical form which performs full table scans. Indexes does not help in full table scan operations, worse they will make it slower. If your optimizer chooses to perform an indexed scan one of your large tables, then you might end up with multi hour (or even days) running query.
The logic behind this is simple, with indexes you have at least two disk accesses: one for the index and one for the data, plus the operation will be row by row (remember we think and implement in sets not in rows). If you are doing single row retrieval as in a typical OLTP scenario, then that’s fine, but if you are reading more than 5-10% of the data (table or partition) then it would be better a full table scan without indexes, should I mention ‘better’ means faster here?
But that’s not only the story. Indexing might have adverse impact on data loads as well. If you are loading data into a large table, then all the indexes defined on it needs to be updated. Well, there are techniques to alter this behavior, like taking the indexes offline and rebuilt them once the data load finished which is quicker than updating the indexes during the load, but this will of course bring ‘index re-build’ cost.
Do not over-index!
The data warehouse is basically an integration project. You have your data sources and your aim is to integrate all of your meaningful data into a one enterprise data model which might have layers like facts and summaries. In each and every step, you need to modify and integrate source data according to the business rules agreed with your business groups.
So, basically you will embed all of your business logic into some piece of software during this transformation. What happens when something has changed? A business rule, or an application, even data type of a field in the source system might change. If you already utilized an ETL tool and applied all of the transformations using it, then it would be relatively simple. But if you just used scripting to transform your data into final form, it is quite possible that you need to go into thousands of lines of scripting code and do the changes and test it etc. Don’t do that, manage your metadata data with a GUI based state of the art ETL tool. There are plenty of alternatives in the market even cost effective open source tools like Talend.
Here in this subject, I want to stress one point. There are still some products in the market, offering high performance BI via in-memory technologies but requires you to load the data and do the transformations using scripting. Please be aware that once such a tool in place, after a while, the story is same, you need to maintain the scripts, thousands of lines of code, no metadata management. Result is predictable, a huge maintenance cost.
Defining foreign keys is a great mechanism to guarantee referential integrity of your data. So, it is automatically guaranteed that no child record exists without a parent. However, there is a downside of this mechanism in a Data Warehouse system if you implement it in the database level. The database does not have any other means than making a lookup read whether the record that you are inserting really has a parent as imposed by your foreign key or not. This will make your life miserable during data load. It will convert your bulk inserts into row by row operation, plus for each and every record you insert there will be an extra disk access which is very expensive.
But foreign keys are very helpful to understand your data model. You can keep them disabled if your database permits, so that there will be no extra lookup to check the constraint or without turning your data load into row by row operation. Having foreign keys in place but disabled your model will still be understandable and can be re-engineered by a design tool.
Here comes the question then, if I disable foreign keys, how can I make it sure that there will be no parent-less child in my tables? Remember the batch window? That is the place you should check your records and if there are some records with no parent you need to report them in a structured manner i.e. usually put them into maintenance tables and log as error. You can do this check in batch window using set operations which will be much faster than letting database to do it itself.
Where possible. Some database technologies let you tune for parallelism, some are designed for it. Depending on your hardware resources you can tune the database or have specific tables utilizing parallelism. Careful implementation of parallelism will increase performance and enable you to enjoy full power of your hardware.
Parallelism is by default in some database architectures like Exasol, so no extra effort is needed.
This two Episode long writing is aimed to give you some insight from the field experience, very little on theory. I also tried to keep this immune from any specific product. I am quite aware that each and every subject needs more explanation and case studies for specific products, and some might already changed with the evolving technologies, but my promise at the beginning was to keep this list as entertaining as possible, so I tried to summarize.
Even though there are many changes in the enterprise hardware & software technology areas these days and some of this information might be obsolete, I still believe there are still cases which will be useful and might save you a few more sleeping hours.
If you believe your Data Warehouse implementation misses something, query performances are poor, data load prevents users from accessing data during business hours, if vendors are knocking your doors for more investment and you are suspicious about it thinking if this is really necessary, we would be more than happy to assist you. Our Data Warehouse Assessment study aimed to have an External look into already running Data Warehouse implementations, and to produce an assessment about the current state suggesting a possible road-map to make it better. We are also helping customers to choose right technology and right implementation plan. Feel free to give us a call.