April 5, 2003
http://www.intelligententerprise.com//030405/606warehouse1_1.jhtml

The Soul of the Data Warehouse, Part Two: Drilling Across

By Ralph Kimball
The three fundamental themes that make up the soul of every data warehouse are drilling down, drilling across, and handling time. In Part One of "The Soul of the Data Warehouse," I showed that drilling down was nothing more than adding a row header, any row header, to an existing query. Although we often grouse about SQL's limitations as a report writer, when it comes to drilling down, SQL gracefully expands an existing query whenever a new row header is added. This simple result led to the powerful realization that when data is organized in a symmetrical, predictable fashion starting at the most atomic level, all queries and applications benefit.
If drilling down is the most fundamental maneuver in a data warehouse, drilling across is a close second. From the perspective of an answer set, drilling across adds more data to an existing row. Note that this result isn't what you get from a UNION of rows from separate queries. It's better described as the column accretion from separate queries.
Drilling across by adding another measured fact to the SELECT list from the existing fact table mentioned in the query is a trivial accomplishment. What's more interesting and important is adding another measured fact from a new fact table.
The issues raised by this simple view of drilling across are at the heart of data warehouse architecture. These issues boil down to an observation and a choice.
Drill-across observation: The new fact table called for in the drill-across operation must share certain dimensions with the fact table in the original query.
Certain dimensions will be named in the original query because they contribute row headers. Remember that these row headers are the basis of the grouping that creates the answer-set row. These dimensions will appear in the FROM clause of the SQL code and will be joined to the fact table through the relationship of a foreign key to primary key. The new fact table must also support exactly these same row headers, or the context of the answer-set row is meaningless.
Drill-across choice: Either send a single, simultaneous SQL request to the two fact tables or send two separate requests.
Although sending a single SQL request to the two fact tables seems cleaner, this choice can become a showstopper. Sending a single request means mentioning both fact tables in the FROM clause of the SQL code and joining both fact tables in some way to the common dimension tables I just discussed. This commingling of two fact tables in the same SQL statement causes these problems:

Because the two fact tables will be joined together either directly or through the common dimensions, the query must specify whether the many-to-many relationship between the two fact tables is handled with inner or outer joins. This fundamental challenge arises from the relational model. It's effectively impossible to get this right, even if you're an expert SQL programmer. Depending on the relative cardinality of the two fact tables, your aggregated numeric totals can either be too low or too high, or both! Even if you don't believe me, you have to deal with the next bullet point.
The vast majority of queries the relational database receives are generated by powerful query tools and report writers, and you have no direct control over the SQL they emit. You don't want control over the SQL. Some of these tools generate mind-boggling reams of SQL and you can't effectively intervene.
Emitting a single SQL statement precludes you from requesting data from separate table spaces, separate machines, or separate database vendors. You're stuck in the same table space on the same machine talking to one database vendor. If you can easily avoid this problem, why take on these restrictions?
Finally, if you emit a single SQL statement involving both fact tables, you'll almost certainly be unable to use any of the powerful query-rewrite tools that perform aggregate navigation. Aggregate navigation is the most cost-effective way to make dramatic gains in data warehouse performance. For more on aggregate navigation, see my column "Aggregate Navigation with (Almost) No Metadata" (August 1996) in the DBMS magazine archives accessible through http://www.ralphkimball.com/ or at www.dbmsmag.com/9608d54.html.

Implementing Drill-Across

If you've followed the logic of the observation and the choice, the architecture to support drill-across begins to emerge. I use some modern data warehouse words to describe the two key aspects of this architecture:

All fact tables in a drill-across query must use conformed dimensions.
The actual drill-across query consists of a multi-pass set of separate requests to the target fact tables followed by a simple sort/merge on the identical row headers returned from each request.

The simplest definition of conformed dimensions is that two instances of a conformed dimension are identical. So if two fact tables have a "Customer" dimension, then "Customer" is conformed if the two dimensions are exactly the same. But this definition is unnecessarily restrictive. Here's the precise definition of conformed dimensions: Two dimensions are conformed if the fields that you use as common row headers have the same domain.
When you bring two separate queries together in a drill-across operation, both queries must have the same number of row headers, arranged from left to right in the same order. All the rest of the columns (the computed facts) in the two queries, by definition, are not row headers. In other words, an independent examination of both queries shows that neither query has rows that duplicate the same row headers. To put it another way, the row headers form a unique key for each row of the answer set.
To sort/merge the two queries, you must sort them the same way. At this point, it becomes possible to merge the rows of the two queries together in a single pass. The resulting merged answer set has a single set of row headers plus the combined set of computed facts returned from both queries. Because traditional sort/merge is the same as an outer join, it is possible for a row in the final merged answer set to have nulls for either the first set of computed facts or the second set, but not both!
Once you've visualized the sort/merge step in drilling across, you really understand conformed dimensions. With conformed dimensions, the only thing you care about is matching row headers. If the contents of the respective fields you're using for the sort/merge are drawn from the same domains, then the match makes sense. If you try to match row headers from two dimensions that aren't conformed, you're guaranteed to get garbage. The sort/merge will fail, and the SQL engine will post the results from the two queries on separate lines — and probably in separate sorting locations in the merged answer set.

Amazing Magic

In my classes, I sometimes describe conformed dimensions as either dimensions that are exactly equal (the trivial case) or dimensions where "one is a subset of the other." For example, a brand dimension may be a subset of a more detailed product dimension. In this case, you can drill across two fact tables, one at the brand level with a brand dimension (such as a forecast), and the other at a detailed product level with a product dimension (such as sales transactions). Assume that the product dimension is a nice flat table containing the low cardinality brand attribute.
If the row header of the two queries is simply "brand," then some amazing magic takes place. The engine automatically aggregates both data sets to the brand level, which is exactly the right level for the drill-across. If the names of the brands are drawn from the same domain, you can complete the drill-across (of forecast vs. actual) by confidently merging the rows with the same brand names. Many commercial query and report-writing tools perform this drill-across operation.
You can see that it's possible to conform two dimensions even when they have some incompatible fields. You just need to be careful to avoid using these incompatible fields as row headers in drill-across queries. Not avoiding it lets a dimension contain some private fields that are meaningful only to a local user group.

Centralizing vs. Conforming

One final thought: In a recent guest column, my colleague Margy Ross pointed out that the decision to physically centralize a data warehouse has very little to do with conformed dimensions. (See Resources below.) Her point was that if you combine two data sets in a drill-across operation, you have to label them the same way, which is true whether the data sets are tightly administered on a single piece of hardware by one DBA or are loosely administered by remote IT organizations that merely agree to create a set of overlapping labels.
Ralph Kimball (founder of the Ralph Kimball Group) co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Warehouse Toolkit, Second Edition (Wiley, 2002). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, http://www.ralphkimball.com/.

RESOURCES

Related Article at IntelligentEnterprise.com:
"To Be or Not To Be Centralized," by Margy Ross, Feb. 1, 2003
"The Soul of the Data Warehouse, Part One: Drilling Down," March 20, 2003

Design tip 152 slowly changing dimension types 0 4 5 6 7

Ralph introduced the concept of slowly changing dimension (SCD) attributes in 1996. Dimensional modelers, in conjunction with the business’s data governance representatives, must specify the data warehouse’s response to operational attribute value changes. Most Kimball readers are familiar with the core SCD approaches: type 1 (overwrite), type 2 (add a row), and type 3 (add a column). Since legibility is a key component of the Kimball mantra, we sometimes wish Ralph had given these techniques more descriptive names, such as “overwrite” instead of “type 1.” But at this point, the SCD type numbers are part of our industry’s vernacular.

We have written about more advanced SCD patterns, such as the 2005 article entitled “Slowly Changing Dimensions are Not Always as Easy as 1, 2, 3.” However, we’ve not consistently named the more advanced and hybrid techniques. With the third edition of The Data Warehouse Toolkit (Wiley, 2013), we’ve decided to assign “type numbers” to several techniques that have been described, but not precisely labeled in the past. Our hope is that more specific technique names will facilitate clearer communication between DW/BI team members.

Type 0: Retain Original

With type 0, the dimension attribute value never changes, so facts are always grouped by this original value. Type 0 is appropriate for any attribute labeled “original,” such as a customer’s original credit score, or any durable identifiers. Type 0 also applies to most date dimension attributes.

Type 4: Add Mini-Dimension

The type 4 technique is used when a group of dimension attributes are split off into a separate mini-dimension. This approach is useful when dimension attribute values are relatively volatile. Frequently-used attributes in multi-million row dimension tables are also mini-dimension design candidates, even if they don’t change frequently. A surrogate key is assigned to each unique profile or combination of attribute values in the mini-dimension. The surrogate keys of both the base dimension and mini-dimension profile are captured as foreign keys in the fact table.

The following type 5, 6, and 7 techniques are hybrids that combine the basics to support the common requirement to both accurately preserve historical attribute values, plus report historical facts according to current attribute values. The hybrid approaches provide more analytic flexibility, albeit with greater complexity.

Type 5: Add Mini-Dimension and Type 1 Outrigger

The type 5 technique builds on the type 4 mini-dimension by embedding a “current profile” mini-dimension key in the base dimension that’s overwritten as a type 1 attribute. This approach, called type 5 because 4 + 1 equals 5, allows the currently-assigned mini-dimension attribute values to be accessed along with the base dimension’s others without linking through a fact table. Logically, we typically represent the base dimension and current mini-dimension profile outrigger as a single table in the presentation layer. The outrigger attributes should have distinct column names, like “Current Income Level,” to differentiate them from attributes in the mini-dimension linked to the fact table. The ETL team must update/overwrite the type 1 mini-dimension reference whenever the current mini-dimension changes over time. If the outrigger approach does not deliver satisfactory query performance, then the mini-dimension attributes could be physically embedded (and updated) in the base dimension.

Type 6: Add Type 1 Attributes to Type 2 Dimension

Type 6 builds on the type 2 technique by also embedding current attributes in the dimension so that fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value. The type 6 moniker was suggested by an HP engineer in 2000 because it’s a type 2 row with a type 3 column that’s overwritten as a type 1; both 2 + 3 + 1 and 2 x 3 x 1 equal 6. With this approach, the current attributes are updated on all prior type 2 rows associated with a particular durable key, as illustrated by the following sample rows:

Type 7: Dual Type 1 and Type 2 Dimensions

With type 7, the fact table contains dual foreign keys for a given dimension: a surrogate key linked to the dimension table where type 2 attributes are tracked, plus the dimension’s durable supernatural key linked to the current row in the type 2 dimension to present current attribute values.

Type 7 delivers the same functionality as type 6, but it’s accomplished via dual keys instead of physically overwriting the current attributes with type 6. Like the other hybrid approaches, the current dimension attributes should be distinctively labeled to minimize confusion.

Finally, here’s a figure from theThe Data Warehouse Toolkit, Third Edition that highlights the implications of each slowly changing dimension technique on the analysis of fact table performance metrics. As we’ve warned in the past, there’s more to consider than simply 1, 2 and 3!

http://www.kimballgroup.com/2005/03/slowly-changing-dimensions-are-not-always-as-easy-as-1-2-3/
How do you deal with changing dimensions? Hybrid approaches fill gaps left by the three fundamental techniques.

Unlike most OLTP systems, a major objective of a data warehouse is to track history. So, accounting for change is one of the analyst’s most important responsibilities. A sales force region reassignment is a good example of a business change that may require you to alter the dimensional data warehouse. We’ll discuss how to apply the right technique to account for the change historically. Hang on to your hats — this is not an easy topic.

Data warehouse design presumes that facts, such as customer orders or product shipments, will accumulate quickly. But the supporting dimension attributes of the facts, such as customer name or product size, are comparatively static. Still, most dimensions are subject to change, however slow. When dimensional modelers think about changing a dimension attribute, the three elementary approaches immediately come to mind: slowly changing dimension (SCD) types 1, 2 and 3.

As a quick review… Type 1 is most appropriate when processing corrections; this technique won’t preserve historically accurate associations. The changed attribute is simply updated (overwritten) to reflect the most current value.With a type 2 change, a new row with a new surrogate primary key is inserted into the dimension table to capture changes. Both the prior and new rows contain as attributes the natural key (or durable identifier), the most-recent-row flag and the row effective and expiration dates.With type 3, another attribute is added to the existing dimension row to support analysis based on either the new or prior attribute value. This is the least commonly needed technique.

These three fundamental techniques are adequate for most situations. However, what happens when you need variations that build on these basics to serve more analytically mature data warehouse users? Business folks sometimes want to preserve the historically accurate dimension attribute associated with a fact (such as at the time of sale or claim), but maintain the option to roll up historical facts based on current dimension characteristics. That’s when you need hybrid variations of the three main types. We’ll lead you through the hybrids in this column.

The Mini Dimension with “Current” Overwrite

When you need historical tracking but are faced with semi-rapid changes in a large dimension, pure type 2 tracking is inappropriate. If you use a mini dimension, you can isolate volatile dimension attributes in a separate table rather than track changes in the primary dimension table directly. The mini-dimension grain is one row per “profile,” or combination of attributes, while the grain of the primary dimension might be one row per customer. The number of rows in the primary dimension may be in the millions, but the number of mini-dimension rows should be a fraction of that. You capture the evolving relationship between the mini dimension and primary dimension in a fact table. When a business event (transaction or periodic snapshot) spawns a fact row, the row has one foreign key for the primary dimension and another for the mini-dimension profile in effect at the time of the event.

Profile changes sometimes occur outside of a business event, for example when a customer’s geographic profile is updated without a sales transaction. If the business requires accurate point-in-time profiling, a supplemental factless fact table with effective and expiration dates can capture every relationship change between the primary and profile dimensions. One more embellishment with this technique is to add a “current profile” key to the primary dimension. This is a type 1 attribute, overwritten with every profile change, that’s useful for analysts who want current profile counts regardless of fact table metrics or want to roll up historical facts based on the current profile. You’d logically represent the primary dimension and profile outrigger as a single table to the presentation layer if doing so doesn’t harm performance. To minimize user confusion and error, the current attributes should have column names that distinguish them from the mini-dimension attributes. For example, the labels should indicate whether a customer’s marketing segment designation is a current one or an obsolete one that was effective when the fact occurred — such as “historical marital status at time of event” in the profile mini dimension and “current marital status” in the primary customer dimension.

Type 2 with “Current” Overwrite

Another variation for tracking unpredictable changes while supporting rollup of historical facts to current dimension attributes is a hybrid of type 1 and type 2. In this scenario, you capture a type 2 attribute change by adding a row to the primary dimension table. In addition, you have a “current” attribute on each row that you overwrite (type 1) for the current and all previous type 2 rows. You retain the historical attribute in its own original column. When a change occurs, the most current dimension row has the same value in the uniquely labeled current and historical (“as was” or “at time of event”) columns.

You can expand this technique to cover not just the historical and current attribute values but also a fixed, end-of-period value as another type 1 column. Although it seems similar, the end-of-period attribute may be unique from either the historical or current perspective. Say a customer’s segment changed on Jan. 5 and the business wants to create a report on Jan. 10 to analyze last period’s data based on the customer’s Dec. 31 designation. You could probably derive the right information from the row effective and expiration dates, but providing the end-of-period value as an attribute simplifies the query. If this query occurs frequently, it’s better to have the work done once, during the ETL process, rather than every time the query runs. You can apply the same logic to other fixed characteristics, such as the customer’s original segment, which never changes. Instead of having the historical and current attributes reside in the same physical table, the current attributes could sit in an outrigger table joined to the dimension natural key. The same natural key, such as Customer ID, likely appears on multiple type 2 dimension rows with unique surrogate keys. The outrigger contains just one row of current data for each natural key in the dimension table; the attributes are overwritten whenever change occurs. To promote ease of use, have the core dimension and outrigger of current values appear to the user as one — unless this hurts application performance.

Type 2 with Natural Keys in the Fact Table

If you have a million-row dimension table with many attributes requiring historical and current tracking, the last technique we described becomes overly burdensome. In this situation, consider including the dimension natural key as a fact table foreign key, in addition to the surrogate key for type 2 tracking. This technique gives you essentially two dimension tables associated with the facts, but for good reason. The type 2 dimension has historically accurate attributes for filtering or grouping based on the effective values when the fact table was loaded. The dimension natural key joins to a table with just the current type 1 values. Again, the column labels in this table should be prefaced with “current” to reduce the risk of user confusion. You use these dimension attributes to summarize or filter facts based on the current profile, regardless of the values in effect when the fact row was loaded. Of course, if the natural key is unwieldy or ever reassigned, then you should use a durable surrogate reference key instead.

This approach delivers the same functionality as the type 2 with “current” overwrite technique we discussed earlier; that technique spawns more attribute columns in a single dimension table, while this approach relies on two foreign keys in the fact table.

While it’s uncommon, we’re sometimes asked to roll up historical facts based on any specific point-in-time profile, in addition to reporting them by the attribute values when the fact measurement occurred or by the current dimension attribute values. For example, perhaps the business wants to report three years of historical metrics based on the attributes or hierarchy in effect on Dec. 1 of last year. In this case, you can use the dual dimension keys in the fact table to your advantage. You first filter on the type 2 dimension table’s row effective and expiration dates to locate the attributes in effect at the desired point in time. With this constraint, a single row for each natural or durable surrogate key in the type 2 dimension has been identified. You can then join to the natural/durable surrogate dimension key in the fact table to roll up any facts based on the point-in-time attribute values. It’s as if you’re defining the meaning of “current” on the fly. Obviously, you must filter on the type 2 row dates or you’ll have multiple type 2 rows for each natural key, but that’s fundamental in the business’s requirement to report any history on any specified point-in-time attributes. Finally, only unveil this capability to a limited, highly analytic audience. This embellishment is not for the faint of heart.

Series of Type 3 Attributes

Say you have a dimension attribute that changes with a predictable rhythm, such as annually, and the business needs to summarize facts based on any historical value of the attribute (not just the historically accurate and current, as we’ve primarily been discussing). For example, imagine the product line is recategorized at the start of every fiscal year and the business wants to look at multiple years of historical product sales based on the category assignment for the current year or any prior year.

This situation is best handled with a series of type 3 dimension attributes. On every dimension row, have a “current” category attribute that can be overwritten, as well as attributes for each annual designation, such as “2004 category” and “2003 category.” You can then group historical facts based on any annual categorization.

This seemingly straightforward technique isn’t appropriate for the unpredictable changes we described earlier. Customer attributes evolve uniquely. You can’t add a series of type 3 attributes to track the prior attribute values (“prior-1,” “prior-2” and so on) for unpredictable changes, because each attribute would be associated with a unique point in time for nearly every row in the dimension table.

Balance Power against Ease of Use

Before using hybrid techniques to support sophisticated change tracking, remember to maintain the equilibrium between flexibility and complexity. Users’ questions and answer sets will vary depending on which dimension attributes are used for constraining or grouping. Given the potential for error or misinterpretation, hide the complexity (and associated capabilities) from infrequent users.

Design-tip-153-three-critical-components-for-successful-self-service-bi

Modelling free text comments