Tuesday, October 19, 2010

Data Warehouse Concepts

There are four separate and distinct components to be considered as we explore the data warehouse environment—operational source systems, data staging area, data presentation area, and data access tools.

Operational Source Systems:
These are the operational systems of record that capture the transactions of the business. The source systems should be thought of as outside the data warehouse because presumably we have little to no control over the content and format of the data in these operational legacy systems. The main priorities of the source systems are processing performance and availability. Queries against source systems are narrow, one-record-at-a-time queries that are part of the normal transaction flow and severely restricted in their demands on the operational
system. We make the strong assumption that source systems are not queried in the broad and unexpected ways that data warehouses typically are queried. The source systems maintain little historical data, and if you have a
good data warehouse, the source systems can be relieved of much of the responsibility for representing the past.

Data Staging Area:
The data staging area of the data warehouse is both a storage area and a set of processes commonly referred to as extract-transformation-load (ETL). The data staging area is everything between the operational source systems and the data presentation area. It is somewhat analogous to the kitchen of a restaurant, where raw food products are transformed into a fine meal. In the data warehouse, raw operational data is transformed into a warehouse deliverable fit for user query and consumption. Similar to the restaurant’s kitchen, the backroom data staging area is accessible only to skilled professionals. The data warehouse kitchen staff is busy preparing meals and simultaneously cannot be responding to customer inquiries. Customers aren’t invited to eat in the kitchen. It certainly isn’t safe for customers to wander into the kitchen. We wouldn’t want our data warehouse customers to be injured by the dangerous equipment, hot surfaces, and sharp knifes they may encounter in the kitchen, so we prohibit them from accessing the staging area. Besides, things happen in the kitchen that customers just shouldn’t be privy to.

Extraction is the first step in the process of getting data into the data warehouse environment. Extracting means reading and understanding the source data and copying the data needed for the data warehouse into the staging area for further manipulation. Once the data is extracted to the staging area, there are numerous potential transformations, such as cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats), combining data from multiple sources, deduplicating data, and assigning warehouse keys.

Data Presentation:
The data presentation area is where data is organized, stored, and made available for direct querying by users, report writers, and other analytical applications. Since the backroom staging area is off-limits, the presentation area is the data warehouse as far as the business community is concerned. We typically refer to the presentation area as a series of integrated data marts. A data mart is a wedge of the overall presentation area pie.
The industry sometimes refers to 3NF models as ER models. ER is an acronym for entity relationship. Entity-relationship diagrams (ER diagrams or ERDs) are drawings of boxes and lines to communicate the relationships between tables. Both 3NF and dimensional models can be represented in ERDs because both consist of joined relational tables; the key difference between 3NF and dimensional models is the degree of normalization. Since both model types can be presented as ERDs, we’ll refrain from referring to 3NF models as ER models; instead, we’ll call them normalized models to minimize confusion.

Data in the queryable presentation area of the data warehouse must be dimensional, must be atomic, and must adhere to the data warehouse bus architecture. If the presentation area is based on a relational database, then these dimensionally modeled tables are referred to as star schemas.

Data Access Tools:
The final major component of the data warehouse environment is the data access tool(s). We use the term tool loosely to refer to the variety of capabilities that can be provided to business users to leverage the presentation area for analytic decision making. By definition, all data access tools query the data in the data warehouse’s presentation area. Querying, obviously, is the whole point of using the data warehouse.

Dimensional Modeling Primer
A data access tool can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or modeling application. Ad hoc query tools, as powerful as they are, can be understood and used effectively only by a small percentage of the potential data warehouse business user population. The majority of the business user base likely will access the data via prebuilt parameter-driven analytic applications. Approximately 80 to 90 percent of the potential users will be served by these canned applications that are essentially finished templates that do not require users to construct relational queries directly

Additional Considerations:
Before we leave the discussion of data warehouse components, there are several other concepts that warrant discussion.
Metadata is all the information in the data warehouse environment that is not the actual data itself. Metadata is akin to an encyclopedia for the data warehouse.

Dimensional Modeling Vocabulary:
Throughout this book we will refer repeatedly to fact and dimension tables. Contrary to popular folklore, Ralph Kimball didn’t invent this terminology. As best as we can determine, the terms dimensions and facts originated from a joint research project conducted by General Mills and Dartmouth University in the 1960s. In the 1970s, both AC Nielsen and IRI used these terms consistently to describe their syndicated data offerings, which could be described accurately today as dimensional data marts for retail sales data. Long before simplicity was a lifestyle trend, the early database syndicators gravitated to these concepts for simplifying the presentation of analytic information. They understood that a database wouldn’t be used unless it was packaged simply.

Fact Table:
A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored. A row in a fact table corresponds to a measurement. A measurement is a row in a fact table. All the measurements in a fact table must be at the same grain.
The most useful facts are numeric and additive, such as dollar sales amount. There are facts that are semiadditive and still others that are nonadditive. Semiadditive facts can be added only along some of the dimensions, and nonadditive facts simply can’t be added at all.

The most useful facts in a fact table are numeric and additive.
All fact tables have two or more foreign keys, as designated by the FK notation that connect to the dimension tables’ primary keys. For example, the product key in the fact table always will match a specific product key in the product dimension table. When all the keys in the fact table match their respective primary keys correctly in the corresponding dimension tables, we say that the tables satisfy referential integrity. We access the fact table via the dimension tables joined to it.

The fact table itself generally has its own primary key made up of a subset of the foreign keys. This key is often called a composite or concatenated key. Every fact table in a dimensional model has a composite key, and conversely, every table that has a composite key is a fact table. Another way to say this is that in a dimensional model, every table that expresses a many-to-many relationship must be a fact table. All other
tables are dimension tables.

Dimension Tables:
Dimension tables are integral companions to a fact table. The dimension tables contain the textual descriptors of the business. In a well-designed dimensional model, dimension tables have many columns or attributes. These attributes describe the rows in the dimension table. We strive to include as many meaningful texts like descriptions as possible. Each dimension is defined by its single primary key, designated by the PK notation which serves as the basis for referential integrity with any given fact table to which it is joined. Dimension attributes serve as the primary source of query constraints, groupings, and report labels.

Dimension table attributes play a vital role in the data warehouse. Since they are the source of virtually all interesting constraints and report labels, they are key to making the data warehouse usable and understandable. In many ways, the data warehouse is only as good as the dimension attributes. The power of the data warehouse is directly proportional to the quality and depth of the dimension attributes. The more time spent providing attributes with verbose business terminology, the better the data warehouse is. The more time spent populating the values in an attribute column, the better the data warehouse is. The more time spent ensuring the quality of the values in an attribute column, the better the data warehouse is.

Dimension tables are the entry points into the fact table. Robust dimension attributes deliver robust analytic slicing and dicing capabilities. The dimensions implement the user interface to the data warehouse. The best attributes are textual and discrete. Attributes should consist of real words rather than cryptic abbreviations. Typical attributes for a product dimension would include a short description (10 to 15 characters), a long description (30 to 50 characters), a brand name, a category name, packaging type, size, and numerous other product characteristics. Although the size is probably numeric, it is still a dimension attribute because it behaves more like a textual description than like a numeric measurement. Size is a discrete and constant descriptor of a specific product.

Dimensional Modeling Myths:
Despite the general acceptance of dimensional modeling, some misperceptions continue to be disseminated in the industry. We refer to these misconceptions as dimensional modeling myths.

Myth 1. Dimensional models and data marts are for summary data only. This first myth is the root cause of many ill-designed dimensional models. Because we can’t possibly predict all the questions asked by business users, we need to provide them with queryable access to the most detailed data so that they can roll it up based on the business question at hand. Data at the lowest level of detail is practically impervious to surprises or changes. Our data marts also will include commonly requested summarized data in dimensional schemas. This summary data should complement the granular detail solely to provide improved performance for common queries, but not attempt to serve as a replacement for the details.

Myth 2. Dimensional models and data marts are departmental, not enterprise, solutions. Rather than drawing boundaries based on organizational departments, we maintain that data marts should be organized around business processes, such as orders, invoices, and service calls. Multiple business functions often want to analyze the same metrics resulting from a single business process. We strive to avoid duplicating the core measurements in multiple databases around the organization.

Myth 3. Dimensional models and data marts are not scalable. Modern fact tables have many billions of rows in them. The dimensional models within our data marts are extremely scalable. Relational DBMS vendors have embraced data warehousing and incorporated numerous capabilities into their products to optimize the scalability and performance of dimensional models.
A corollary to myth 3 is that dimensional models are only appropriate for retail or sales data. This notion is rooted in the historical origins of dimensional modeling but not in its current-day reality. Dimensional modeling has been applied to virtually every industry, including banking, insurance, brokerage, telephone, newspaper, oil and gas, government, manufacturing, travel, gaming, health care, education, and many more. In this book we use the retail industry to illustrate several early concepts mainly because it is an industry to which we have all been exposed; however, these concepts are extremely transferable to other businesses.

Myth 4. Dimensional models and data marts are only appropriate when there is a predictable usage pattern. A related corollary is that dimensional models aren’t responsive to changing business needs. On the contrary, because of their symmetry, the dimensional structures in our data marts are extremely flexible and adaptive to change. The secret to query flexibility is building the fact tables at the most granular level. In our opinion, the source of myth 4 is the designer struggling with fact tables that have been prematurely aggregated based on the designer’s unfortunate belief in myth 1 regarding summary data. Dimensional models that only deliver summary data are bound to be problematic. Users run into analytic brick walls when they try to drill down into details not available in the summary tables. Developers also run into brick walls because they can’t easily accommodate new dimensions, attributes, or facts with these prematurely summarized tables. The correct starting point for your dimensional models is to express data at the lowest detail possible for maximum flexibility and extensibility.

Myth 5. Dimensional models and data marts can’t be integrated and therefore lead to stovepipe solutions. Dimensional models and data marts most certainly can be integrated if they conform to the data warehouse bus architecture. Presentation area databases that don’t adhere to the data warehouse bus architecture will lead to standalone solutions. You can’t hold dimensional modeling responsible for the failure of some organizations to embrace one of its fundamental tenets.

Common Pitfalls to Avoid:
While we can provide positive recommendations about dimensional data warehousing, some readers better relate to a listing of common pitfalls or traps into which others have already stepped. Borrowing from a popular late-night television show, here is our favorite top 10 list of common errors to avoid while building your data warehouse. These are all quite lethal errors—one alone may be sufficient to bring down your data warehouse initiative.

Pitfall 10. Become overly enamored with technology and data rather than focusing on the business’s requirements and goals.
Pitfall 9. Fail to embrace or recruit an influential, accessible, and reasonable management visionary as the business sponsor of the data warehouse.
Pitfall 8. Tackle a galactic multiyear project rather than pursuing more manageable, while still compelling, iterative development efforts.
Pitfall 7. Allocate energy to construct a normalized data structure, yet run out of budget before building a viable presentation area based on dimensional models.
Pitfall 6. Pay more attention to backroom operational performance and ease of development than to front-room query performance and ease of use.
Pitfall 5. Make the supposedly queryable data in the presentation area overly complex. Database designers who prefer a more complex presentation should spend a year supporting business users; they’d develop a much
better appreciation for the need to seek simpler solutions.
Pitfall 4. Populate dimensional models on a standalone basis without regard to a data architecture that ties them together using shared, conformed dimensions.
Pitfall 3. Load only summarized data into the presentation area’s dimensional structures.
Pitfall 2. Presume that the business, its requirements and analytics, and the underlying data and the supporting technology are static.
Pitfall 1. Neglect to acknowledge that data warehouse success is tied directly to user acceptance. If the users haven’t accepted the data warehouse as a foundation for improved decision making, then your efforts have been exercises in futility.

Four-Step Dimensional Design Process:
1. Select the business process to model.
2. Declare the grain of the business process. Declaring the grain means specifying exactly what an individual fact table row represents. The grain conveys the level of detail associated with the fact table measurements. It provides the answer to the question, “How do you describe a single row in the fact table?”
Example grain declarations include:
An individual line item on a customer’s retail sales ticket as measured by a scanner device
A line item on a bill received from a doctor
An individual boarding pass to get on a flight
A daily snapshot of the inventory levels for each product in a warehouse
A monthly snapshot for each bank account

3. Choose the dimensions that apply to each fact table row. Dimensions fall out of the question, “How do businesspeople describe the data that results from the business process?” We want to decorate our fact tables with a robust set of dimensions representing all possible descriptions that take on single values in the context of each measurement. If we are clear about the grain, then the dimensions typically can be identified quite easily. With the choice of each dimension, we will list all the discrete, textlike attributes that will flesh out each dimension table. Examples of common dimensions include date, product, customer, transaction type, and status.
4. Identify the numeric facts that will populate each fact table row. Facts are determined by answering the question, “What are we measuring?” Business users are keenly interested in analyzing these business process performance measures. All candidate facts in a design must be true to the grain defined in step 2. Facts that clearly belong to a different grain must be in a separate fact table. Typical facts are numeric additive figures such as quantity ordered or dollar cost amount.

Select the Business Process:
The first dimensional model built should be the one with the most impact—it should answer the most pressing business questions and be readily accessible for data extraction.

Declare the Grain:
Preferably you should develop dimensional models for the most atomic information captured by a business process. Atomic data is the most detailed information collected; such data cannot be subdivided further.

Choose the Dimensions:
A careful grain statement determines the primary dimensionality of the fact table. It is then often possible to add more dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only one value under each combination of the primary dimensions. If the additional dimension violates the
grain by causing additional fact rows to be generated, then the grain statement must be revised to accommodate this dimension.

Surrogate Keys
Every join between dimension and fact tables in the data warehouse should be based on meaningless integer surrogate keys. You should avoid using the natural operational production codes. None of the data warehouse keys should be smart, where you can tell something about the row just by looking at the key.

What is Metadata?
Metadata is the background information which describes the content, quality, condition, and other appropriate characteristics of the data. We know that metadata is “data about data,” and it defines the content of a data object. Primary function is of enabling policy and providing access to data. The policies enabled include data protection, intelligent archiving, and storage optimization, which are the key services of Information Lifecycle Management.

Two Primary Purposes of Metadata:
Providing access to data objects: Ability to find a specific object many years after storing it, via either content-based or file-based metadata, is key to achieving business value from archived data.
Enabling policy- based storage management: Policy is automated by the combination of policy management systems, automated data movers, data classification, and metadata.

What is Metadata Management?

Metadata management refers to the activities associated with ensuring that metadata is created at the point of file creation and that this information is collected, stored in a repository for use by multiple applications, and controlled to remove inconsistencies and redundancies. It is the act of imposing management discipline on the collection and control of metadata.

Metadata is a key to information lifecycle implementation, which is a storage strategy that balances the cost of storing and managing information with its business value.
“A dimension that means the same thing with every possible fact table to which it can be joined”

Conformed dimensions are used to analyze the facts from two or more data marts.
Suppose you have a “shipping” data mart (shipped to whom and when) and a “sales” data mart (purchased what and when). Both marts require a “customer” dimension and a “time” dimension. If they have the same dimensional attributes, then you have conforming dimensions, allowing you to extract and manipulate facts for a particular customer from both marts such that it becomes easy to analyze whether late shipments have affected sales to that customer or not .
Suppose now that you add a “marketing” data mart to help you analyze product promotions. Again, with conformed customer and time dimensions, you’re able to analyze the effects of a particular product promotion on sales. (Analyzing facts from more than one fact table in this way is termed “drilling across”)

Conformed Dimensions makes possible a single dimension table to be used across multiple databases/data marts. Without the strict adherence to conformed dimensions, data warehouse cannot function as an integrated whole.
 A single dimension can be used easily against multiple fact tables.
 User Interfaces and data content are consistent every time the dimension is used.
 Consistent interpretation of attributes (and rollups) across data marts.
 These dimensions become enterprise property and can be used later in other marts as you evolve the enterprise data warehouse.

What is a dimension?
A data warehouse dimension provides the means to "slice and dice" data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric measures.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling.

Types of Dimensions:

Role-Playing Dimensions:
Role Playing dimension refers to a dimension that can play different roles in a fact table depending on the context. In a data warehouse, a role playing dimension can be easily identified as it will have multiple foreign keys of the same dimension in the fact table.
For example, If we take a record of Order Line Fact from any Retail mart it will have an ordered date (Day of order), scheduled shipping date (Supposed to be shipped), and shipment date (Actual date shipped), invoice date, and payment date.
For populating so many different date columns we don’t need to create a separate date dimension for each of these dates. We only need to create one generic date dimension which can be used for "Date of Order", “Date of Shipment”, “Date of Payment”, as well as "Date of Delivery" or "Date of Return" by creating different views of the same date dimension.

Junk dimensions:
Junk dimensions constitute of extra data elements about a particular transaction (fact) like yes-no flags/operation stamps/free text attributes etc. Instead of having these within the fact itself, we can put such data into a separate dimension i.e. Junk dimension.

Degenerate dimensions:
A degenerate dimension is a data element that is dimensional in nature but stored in a fact table.
For Eg: If you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows? Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the Fact table.

Degenerate Dimension:
Degenerate dimensions typically are reserved for operational transaction identifiers. They should not be used as an excuse to stick a cryptic code in the fact table without joining to a descriptive decode in a dimension table.

Junk Dimensions:
A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, we remove the flags from the fact table while placing them into a useful dimensional framework.