What is the best methodology to use when creating a data warehouse? Well, first off, let’s discuss some of the reasons why you would want to use a data warehouse and not just use your operational system:
- You need to integrate many different sources of data in near real-time. This will allow for better business decisions because users will have access to more data. Plus this will save users lots of time because they won’t waste precious time retrieving data from multiple sources
- You have tons of historical data that you need to gather in one easily accessible place in which it will have common formats, common keys, common data model, and common access methods.
- You need to keep historical records, even if the source transaction systems does not
- You can restructure the data and rename tables and fields so if makes more sense to the users
- You need to use master data management to consolidate many tables, such as customers, into one table
- Users are running reports directly against operational systems, causing performance problems. Instead, create a data warehouse so users can run reports off of that. Plus, the data warehouse is optimized for read access, resulting in faster report generation
- Having an easy to use data warehouse allows users to create their own reports without having to get IT involved
- Improve data quality by cleaning up data as it is imported into the data warehouse (providing more accurate data) as well as providing consistent codes and descriptions
- Having one version of the truth, so each department will produce results that are in line with all the other departments, providing consistency
- Having a data warehouse makes it easy to create business intelligence solutions on top of it, such as SSAS cubes
- Companies that have implemented data warehouses and complementary BI systems have generated more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses
Once you decide to build a data warehouse, the next step is deciding between a normalized versus dimensional approach for the storage of data in the data warehouse.
The dimensional approach, made popular by in Ralph Kimball (website), states that the data warehouse should be modeled using a Dimensional Model (star schema or snowflake). The normalized approach, also called the 3NF model, made popular by Bill Inmon (website), states that the data warehouse should be modeled using an E-R model/normalized model.
In a dimensional approach, data is partitioned into either “facts”, which are generally numeric transaction data, or “dimensions“, which are the reference information that gives context to the facts. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated. Plus, if you are used to working with a normalized approach, it can take a while to fully understand the dimensional approach and to become efficient in building one.
In the normalized approach, the data in the data warehouse are stored following database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented. The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.
The final step in building a data warehouse is deciding between using a top-down versus bottom-up design methodology.
Kimball is a proponent of an approach to data warehouse design described as bottom-up in which dimensional data marts are first created to provide reporting and analytical capabilities for specific business areas such as “Sales” or “Production”. These data marts are eventually integrated together to create a data warehouse using a bus architecture, which consists of conformed dimensions between all the data marts. So the data warehouse ends up being segmented into a number of logically self-contained and consistent data marts, rather than a big and complex centralized model. Business value can be returned as quickly as the first data marts can be created, and the method lends itself well to an exploratory and iterative approach to building data warehouses so that no master plan is required upfront.
Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. He has defined a data warehouse as a centralized repository for the entire enterprise. Dimensional data marts containing data needed for specific business processes or specific departments are created from the enterprise data warehouse only after the complete data warehouse has been created. In the Inmon vision the data warehouse is at the center of the “Corporate Information Factory” (CIF), and the newer DW 2.0, which provides a logical framework for delivering business intelligence (BI) and business management capabilities. The work is a long-term, construction will last a long time, but the return is expected to be a long-lasting and reliable data architecture.
Kimball is the most frequently used methodology, especially if you are using the Microsoft BI stack. It is popular because business users can see some results quickly, with the risk you may create duplicate data or may have to redo part of a design because there was no master plan. With Inmon there is a master plan and usually you will not have to redo anything, but if could be a while before you see any benefits, and the up-front cost is significant. And another risk is by the time you start generating results, the business source data has changed or there is changed priorities and you may have to redo some work anyway.