Data Warehousing Interview Questions and Answers
What is Data Warehousing?**********
A data warehouse is the main repository of an organization’s historical data, its corporate memory. It contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems (Ref:Wikipedia). Data warehousing collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. It is a repository of integrated information, available for queries and analysis.
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following:
Classic Definition :
Bill Inmon :
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse.
http://en.wikipedia.org/wiki/Data_Warehouse
http://www.1keydata.com/datawarehousing/data-warehouse-definition.html
http://en.wikipedia.org/wiki/Data_warehouse
http://www.1keydata.com/datawarehousing/datawarehouse.html
http://www.cs.sunysb.edu/~cse634/presentations/DataWarehousing-part-1.pdf
http://www.dkms.com/papers/kbmskwbak.pdf
http://www.b-eye-network.com/view/9752
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/87-what-is-a-data-warehouse-guide.html
http://www.youtube.com/watch?v=eiRhRxPuEU8
What are fundamental stages of Data Warehousing?
Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system’s performance.
Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
Each component is discussed individually below:
Data Source Layer :
This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
Further References
http://www.1keydata.com/datawarehousing/data-warehouse-architecture.html
http://en.wikipedia.org/wiki/Architectural_pattern_(computer_science)#Examples
http://en.wikipedia.org/wiki/Enterprise_architecture
http://docs.oracle.com/cd/B28359_01/server.111/b28313/concept.htm#i1006409
What is Dimensional Modeling?
Dimensional data model concept involves two types of tables and it is different from the 3rd normal form. This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context (dimension of calculation) of the measurements.
What is Fact table?
Fact table contains measurements of business processes also fact table contains the foreign keys for the dimension tables. For example, if your business process is “paper production” then “average production of paper by one machine” or “weekly production of paper” would be considered as measurement of business process.
A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.
Requirements of Fact Tables
You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.
Further read here :
http://docs.oracle.com/cd/B28359_01/server.111/b28313/logical.htm
What is Dimension table?
Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.
A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time.
Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.
Further read here:
http://en.wikipedia.org/wiki/Dimension_table
http://docs.oracle.com/cd/B28359_01/server.111/b28313/logical.htm
What are Hierarchies ?
Hierarchies are logical structures. They are used to describe business relationships and common access patterns in the database. They can be used to define aggregation and use ordered levels to organize data. Each level is logically connected to the levels above and below it with a hierarchy. Data values that are at lower levels aggregate to the data values at the higher levels.
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies—one for product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies.
Level Relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. Typical Dimension Hierarchy illustrates a dimension hierarchy based on
Figure 2-2 Typical Levels in a Dimension Hierarchy
What is Data Warehousing?**********
A data warehouse is the main repository of an organization’s historical data, its corporate memory. It contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems (Ref:Wikipedia). Data warehousing collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. It is a repository of integrated information, available for queries and analysis.
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following:
Classic Definition :
Bill Inmon :
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse.
http://en.wikipedia.org/wiki/Data_Warehouse
http://www.1keydata.com/datawarehousing/data-warehouse-definition.html
http://en.wikipedia.org/wiki/Data_warehouse
http://www.1keydata.com/datawarehousing/datawarehouse.html
http://www.cs.sunysb.edu/~cse634/presentations/DataWarehousing-part-1.pdf
http://www.dkms.com/papers/kbmskwbak.pdf
http://www.b-eye-network.com/view/9752
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/87-what-is-a-data-warehouse-guide.html
http://www.youtube.com/watch?v=eiRhRxPuEU8
What are fundamental stages of Data Warehousing?
Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system’s performance.
Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
- Data Source Layer
- Data Extraction Layer
- Staging Area
- ETL Layer
- Data Storage Layer
- Data Logic Layer
- Data Presentation Layer
- Metadata Layer
- System Operations Layer
Each component is discussed individually below:
Data Source Layer :
This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
Further References
http://www.1keydata.com/datawarehousing/data-warehouse-architecture.html
http://en.wikipedia.org/wiki/Architectural_pattern_(computer_science)#Examples
http://en.wikipedia.org/wiki/Enterprise_architecture
http://docs.oracle.com/cd/B28359_01/server.111/b28313/concept.htm#i1006409
What is Dimensional Modeling?
Dimensional data model concept involves two types of tables and it is different from the 3rd normal form. This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context (dimension of calculation) of the measurements.
What is Fact table?
Fact table contains measurements of business processes also fact table contains the foreign keys for the dimension tables. For example, if your business process is “paper production” then “average production of paper by one machine” or “weekly production of paper” would be considered as measurement of business process.
A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.
Requirements of Fact Tables
You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.
Further read here :
http://docs.oracle.com/cd/B28359_01/server.111/b28313/logical.htm
What is Dimension table?
Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.
A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time.
Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.
Further read here:
http://en.wikipedia.org/wiki/Dimension_table
http://docs.oracle.com/cd/B28359_01/server.111/b28313/logical.htm
What are Hierarchies ?
Hierarchies are logical structures. They are used to describe business relationships and common access patterns in the database. They can be used to define aggregation and use ordered levels to organize data. Each level is logically connected to the levels above and below it with a hierarchy. Data values that are at lower levels aggregate to the data values at the higher levels.
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies—one for product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies.
Level Relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. Typical Dimension Hierarchy illustrates a dimension hierarchy based on
Figure 2-2 Typical Levels in a Dimension Hierarchy
Reference Oracle.com
Further read here :
http://docs.oracle.com/cd/B28359_01/server.111/b28313/logical.htm
What are the Different methods of loading Dimension tables?
There are two different ways to load data in dimension tables.
Conventional (Slow) :
All the constraints and keys are validated against the data before, it is loaded, this way data integrity is maintained.
Direct (Fast) :
All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys. If data is found invalid or dirty it is not included in index and all future processes are skipped on this data.
What is OLTP?
OLTP is abbreviation of On-Line Transaction Processing. This system is an application that modifies data the instance it receives and has a large number of concurrent users.
What is OLAP?******
OLAP is abbreviation of Online Analytical Processing. This system is an application that collects, manages, processes and presents multidimensional data for analysis and management purposes.
What is the difference between OLTP and OLAP?*****
Data Source
OLTP: Operational data is from original data source of the data
OLAP: Consolidation data is from various source.
Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks
OLAP: Multi-dimensional views of business activities of planning and decision making
Queries and Process Scripts
OLTP: Simple quick running queries ran by users.
OLAP: Complex long running queries by system to update the aggregated data.
Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application-oriented database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks. This adopts star, snowflake or fact constellation mode of subject-oriented database design.
Describes the foreign key columns in fact table and dimension table?
Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.
What is Data Mining?*******
Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information.
What is the difference between view and materialized view?
A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables.
A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.
What is ER Diagram?
Entity Relationship Diagrams are a major data modelling tool and will help organize the data in your project into entities and define the relationships between the entities. This process has proved to enable the analyst to produce a good database structure so that the data can be stored and retrieved in a most efficient manner.
An entity-relationship (ER) diagram is a specialized graphic that illustrates the interrelationships between entities in a database. A type of diagram used in data modeling for relational data bases. These diagrams show the structure of each table and the links between tables.
What is ODS?*******
ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being re-engineered to use the ODS as there operation databases.
What is ETL?********
ETL is abbreviation of extract, transform, and load. ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that that data is in different forms or formats. The data can come from any source.ETL is powerful enough to handle such data disparities. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data – using rules orlookup tables, or creating combinations with other data – to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database.
Extract, Clean, Conform, Deliver
Steps under each
Further reference :
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Data_cleansing
http://en.wikipedia.org/wiki/Data_integration
http://en.wikipedia.org/wiki/Data_profiling
What is VLDB?
VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally be considered to be a VLDB. Typically, these are decision support systems or transaction processing applications serving large numbers of users.
Is OLTP database is design optimal for Data Warehouse?
No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longer period (many years) data, which needs to be analyzed. A OLTP system is basically ER model and not Dimensional Model. If a complex query is executed on a OLTP system, it may cause a heavy overhead on the OLTP server that will affect the normal business processes.
If de-normalized is improves data warehouse processes, why fact table is in normal form?
Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table.
What are lookup tables?*****
A lookup table is the table placed on the target table based upon the primary key of the target, it just updates the table by allowing only modified (new or updated) records based on thelookup condition.
What are Aggregate tables?**********
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. Aggregate tables reduces the load in the database server and increases the performance of the query and can retrieve the result quickly.
What is real time data-warehousing?***
Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.
What are conformed dimensions?******
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. They are common to the cubes.
Either the Columns are same or a subset with same values.
What is conformed fact?*******
Conformed facts are the facts which can be used across multiple Data Marts in combination with multiple dimension tables accordingly.
How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day.
What is a level of Granularity of a fact table?********
Level of granularity means level of detail that you put into the fact table in a data warehouse. Level of granularity would mean what detail are you willing to put for each transactional fact.
What are non-additive facts?*****
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there is changes in dimensions the same facts can be useful.
What is factless facts table?******
A fact table which does not contain numeric fact columns it is called factless facts table.
What are slowly changing dimensions (SCD)?
SCD is abbreviation of Slowly changing dimensions. SCD applies to cases where the attribute for a record varies over time.
There are three different types of SCD.*********
1) SCD1 : The new record replaces the original record. Only one record exist in database – current data.
2) SCD2 : A new record is added into the customer dimension table. Two records exist in database – current data and previous history data.
3) SCD3 : The original data is modified to include new data. One record exist in database – new information are attached with old information in same row.
What is hybrid slowly changing dimension?*****
Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don’t care.
What is BUS Schema?*********
BUS Schema is composed of a master suite (set) of confirmed dimensions and standardized definition if facts.
The Enterprise Bus Matrix :
The Enterprise Bus Matrix is a Data Warehouse planning tool and model created by Ralph Kimball, and is part of the Data Warehouse Bus Architecture. The Matrix is the logical definition of one of the core concepts of Kimball’s approach to Dimensional Modeling – Conformed dimensions.[1]
The Bus Matrix defines part of the Data Warehouse Bus Architecture and is an output of the Business Requirements phase in The Kimball Lifecycle. It is applied in the following phases of dimensional modeling and development of the Data Warehouse . The matrix can be categorized as a hybrid model, being part technical design tool, part project management tool and part communication tool [2]
Background:
The Enterprise Bus Matrix stems from the issue of how one goes about creating the overall Data Warehouse environment. Historically there has been the structure of the centralized and planned approach and the more loosely defined, department specific, solutions developed in a more independent matter. Autonomous projects can result in a range of isolated stove pipe data marts. Naturally each approach has its issues; the overall visionary approach often struggles with long delivery cycles and lack of reaction time as the formalities and scope issues is evident. On the other hand the development of isolated data marts, leading to Stovepipe systems that lacks synergy in development. Over time this approach will lead to a so called data-mart-in-a-box architecture [3] where interoperability and lack of cohesion is apparent, and can hinder the realization of an overall enterprise Data Warehouse. As an attempt to handle this matter Ralph Kimball introduced the enterprise bus.
The Bus Matrix:
In short words the bus matrix purpose is one of high abstraction and visionary planning on the Data Warehouse architectural level. By dictating coherency in the development and implementation of an overall Data Warehouse the Bus Architecture approach enables an overall vision of the broader enterprise integration and consistency while at the same time dividing the problem into more manageable parts [2] – all in a technology and software independent manner .[4]
The bus matrix and architecture builds upon the concept of conformed dimensions - creating a structure of common dimensions that ideally can be used across the enterprise by all business processes related to the DW and the corresponding fact tables from which they derive their context. According to Kimball and Marg Rosses article “Differences of Opinion” [5] "The Enterprise Data warehouse built on the bus architecture ”identifies and enforces the relationship between business process metrics (facts) and descriptive attributes (dimensions)”.
The concept of a bus[6] is well known in the language of Information Technology, and is what reflects the conformed dimension concept in the Data Warehouse, creating the skeletal structure where all parts of a system connect, ensuring interoperability and consistency of data, and at the same time considers future expansion. This makes the conformed dimensions act as the integration ‘glue’, creating a robust backbone of the enterprise Data Warehouse.[7]
Establishment and applicability:
Figure 1[8] shows the base for a single document planning tool for the whole of the DW implementation - a graphical overview of the enterprises core business processes or events each correspond to a measurement table of facts, that typically is complemented by a major source system in the horizontal rows. In the vertical columns the groups of contextual data is found as the common, conformed dimensions.
In this way the shared dimensions are defined, as each process indicates what dimensions it applies to through the cells figure 2.[2] By this definition and coordination of conformed dimensions and processes the development of the overall data DW bus architecture is realized.[2] The matrix identifies the shared dimensions related to processes and fact tables, and can be a tool for planning, prioritizing what needs to be approached, coordinating implementation and communicating the importance for conformed dimensions .
Kimball extends the matrix bus in detail as seen in figure 3 [2] by introducing the other steps of the Datawarehouse Methodology; The Fact tables, Granularity, and at last the description of the needed facts. description of the fact tables, granularity and fact instances of each process, structuring and specifying what is needed across the enterprise in a more specific matter, further exemplifying how the matrix can be used as a planning too.
Further References :
http://en.wikipedia.org/wiki/Enterprise_Service_Bus
http://en.wikipedia.org/wiki/Scalability
What are Stovepipes ?*********
Stovepipe are Independent Data Marts
Stove-pipe data marts are small data warehouses that support the needs of individual business units, but cannot be integrated at the corporate level because they do not conform with enterprise wide data definitions.
Lack of integration with central metadata results in the development of independent data marts, often called stovepipe data marts.
IS departments have known for years that they should not build stovepipe applications. In spite of the well-known problems with stovepipe applications, most data warehouses built today incorporate stovepipe data marts.
To minimize development risk and provide rapid return on investment, many organizations build data warehouses “bottom-up”, one business area at a time. Although the bottom-up approach minimizes development risk, it encourages the implementation of stovepipe data marts. In their rush to deploy individual data marts, many organizations do not take the time to implement the infrastructure required to support consolidation of data across data marts. The result is a profusion of stovepipe data marts that satisfy the requirements of individual business units, but cannot be used to support corporate requirements for an integrated view of data across data marts. Subsequent attempts to install the infrastructure required to integrate data marts with a central repository often fail. The failure is due to the complexity and risk associated with making major changes to an operational data warehousing facility that is growing rapidly and supports large numbers of business end users.
It is extremely difficult for organizations to avoid building stove-pipe data marts. For example, a business unit may use departmental funds to build a data mart that supports their specific DSS requirements. The data mart maintains local metadata (a local semantic layer) that provides intuitive access to data in the data mart. Subsequently, another business unit identifies the need to build a data mart for their area of the business. They build a data mart that meets their business requirements, but their local metadata does not conform with the metadata definitions contained in the first data mart. Neither data mart conforms with definitions stored in the central metadata repository.
http://www.mimno.com/avoiding-mistakes2.html
http://www.mimno.com/avoiding-mistakes3.html
http://www.mimno.com/index.html
What is a Star Schema?******
Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment.
What Snow Flake Schema?*****
Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.
Differences between star and snowflake schema?******
Star schema – A single fact table with N number of Dimension, all dimensions will be linked directly with a fact table. This schema is de-normalized and results in simple join and less complex query as well as faster results.
Snow schema – Any dimensions with extended dimensions are know as snowflake schema, dimensions maybe interlinked or may have one to many relationship with other tables. This schema is normalized and results in complex join and very complex query as well as slower results.
What is Difference between ER Modeling and Dimensional Modeling?
ER modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.
What is degenerate dimension table?
If a table contains the values, which are neither dimension nor measures is called degenerate dimensions.
Wiki Reference :
According to Ralph Kimball, in a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its own dimension table, because all the interesting attributes have been placed in analytic dimensions. The term "degenerate dimension" was originated by Ralph Kimball.
As per Anand Maheshwari
Degenerate dimensions commonly occur when the fact table's grain is a single transaction (or transaction line). Transaction control header numbers assigned by the operational business process are typically degenerate dimensions, such as order, ticket, credit card transaction, or check numbers. These degenerate dimensions are natural keys of the "parents" of the line items.
Even though there is no corresponding dimension table of attributes, degenerate dimensions can be quite useful for grouping together related fact tables rows. For example, retail point-of-sale transaction numbers tie all the individual items purchased together into a single market basket. In health care, degenerate dimensions can group the claims items related to a single hospital stay or episode of care.
Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part of the development process. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language , it can be reviewed and verified as correct by the end-users.
In computer science, data modeling is the process of creating a data model by applying a data model theory to create a data model instance. A data model theory is a formal data model description. When data modelling, we are structuring and organizing data. These data structures are then typically implemented in a database management system. In addition to defining and organizing the data, data modeling will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure.
Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe structured data for storage in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video.
Further Reference :
http://en.wikipedia.org/wiki/Database_model
http://en.wikipedia.org/wiki/Three_schema_approach
http://upload.wikimedia.org/wikipedia/commons/2/24/Operational_Data_Store_Architecture_Pattern.jpg
http://en.wikipedia.org/wiki/Blackboard_system
What is surrogate key?*******
Surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. It is useful because the natural primary key can change and this makes updates more difficult.Surrogated keys are always integer or numeric.
What is Data Mart?*****
A data mart (DM) is a specialized version of a data warehouse (DW). Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences. The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data. A data mart configuration emphasizes easy access to relevant information (Reference : Wiki). Data Marts are designed to help manager make strategic decisions about their business.
What is the difference between OLAP and Data warehouse?
Datawarehouse is the place where the data is stored for analyzing where as OLAP is the process of analyzing the data, managing aggregations, partitioning information into cubes for in depth visualization.
What is a Cube and Linked Cube with reference to data warehouse?*********
Cubes are logical representation of multidimensional data.The edge of the cube contains dimension members and the body of the cube contains data values. The linking in cube ensures that the data in the cubes remain consistent.
http://en.wikipedia.org/wiki/OLAP_cube
What is junk dimension?********
A number of very small dimensions might be lumped together to form a single dimension, a junk dimension – the attributes are not closely related. Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension is known as junk dimension.
What are Conformed dimension ?
A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. A conformed dimension cuts across many facts.
Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in the answer sets from two different conformed dimensions must be able to match perfectly.
Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors. At the most basic level, conformed dimensions mean exactly the same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.[1]
[edit] Junk dimension A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.[2] A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.
One solution is to create a new dimension for each of the remaining attributes, but due to their nature, it could be necessary to create a vast number of new dimensions resulting in a fact table with a very large number of foreign keys. The designer could also decide to leave the remaining attributes in the fact table but this could make the row length of the table unnecessarily large if, for example, the attributes is a long text string.
The solution to this challenge is to identify all the attributes and then put them into one or several Junk Dimensions. One Junk Dimension can hold several true/false or yes/no indicators that have no correlation with each other, so it would be convenient to convert the indicators into a more describing attribute. An example would be an indicator about whether a package had arrived, instead of indicating this as “yes” or “no”, it would be converted into “arrived” or “pending” in the junk dimension. The designer can choose to build the dimension table so it ends up holding all the indicators occurring with every other indicator so that all combinations are covered. This sets up a fixed size for the table itself which would be 2^x rows, where x is the number of indicators. This solution is appropriate in situations where the designer would expect to encounter a lot of different combinations and where the possible combinations are limited to an acceptable level. In a situation where the number of indicators are large, thus creating a very big table or where the designer only expect to encounter a few of the possible combinations, it would be more appropriate to build each row in the junk dimension as new combinations are encountered. To limit the size of the tables, multiple junk dimensions might be appropriate in other situations depending on the correlation between various indicators.
Junk dimensions are also appropriate for placing attributes like non-generic comments from the fact table. Such attributes might consist of data from an optional comment field when a customer places an order and as a result will probably be blank in many cases. Therefore the junk dimension should contain a single row representing the blanks as a surrogate key that will be used in the fact table for every row returned with a blank comment field[3]
A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table's primary key.[4]
[edit] Role-playing dimensionsDimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".
What is snapshot with reference to data warehouse?********
You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data.
What is active data warehousing?******
An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively.
What is the difference between data warehousing and business intelligence?
Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools.
Data Retention and Purging in a Data Warehouse
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/85-data-retention-and-purging-in-a-data-warehouse.html
What is a data warehouse - A 101 guide to modern data warehousing
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/87-what-is-a-data-warehouse-guide.html
What is Data Integration (DI)?
http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.html
Kimball vs Inmon Design approaches
http://docs.oracle.com/cd/B28359_01/server.111/b28313/concept.htm#i1006409
http://www.1keydata.com/datawarehousing/inmon-kimball.html
http://www.inmoncif.com/products/A%20TALE%20OF%20TWO%20ARCHITECTURES.pdf
http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html
http://bennyaustin.wordpress.com/2010/05/02/kimball-and-inmon-dw-models/
http://searchdatamanagement.techtarget.com/answer/Ralph-Kimball-vs-Bill-Inmon-approaches-to-data-warehouse-design
http://www.olap.it/Articoli/Battle%20of%20the%20giants%20-%20comparing%20Kimball%20and%20Inmon.pdf
http://oracle-dba-enciclopedia.blogspot.in/2011/08/oracle-1z0-515-data-warehousing-11g.html
http://opensourceanalytics.com/2006/03/14/data-mart-vs-data-warehouse-the-great-debate/
http://www.olap.it/Articoli/Battle%20of%20the%20giants%20-%20comparing%20Kimball%20and%20Inmon.pdf
http://infonitive.com/?p=255
http://danlinstedt.com/datavaultcat/wasting-money-on-inmon-vs-kimball-wars/
http://www.information-management.com/infodirect/19990901/1400-1.html
https://dspace.ist.utl.pt/bitstream/2295/630366/1/licao_6.pdf
http://www.maop.org/Fall_2005/Bertman_Jeffrey/Dispelling%20Myths.pdf
http://www.exforsys.com/tutorials/msas/data-warehouse-design-kimball-vs-inmon.html
http://www.kimballgroup.com/html/PDFs/Fables_Facts.pdf
Top 50 Data Warehousing Interview Questions with Answers
http://www.dwbiconcepts.com/tutorial/24-interview-questions/6-top-50-dwbi-interview-questions-with-answers.html
Oracle BI By Bakboord My view on Oracle Business Intelligence (Applications)
Multiple Fact Reporting on (Non-)Conforming dimensions
Multiple Fact Reporting on (Non-)Conforming dimensions – Part II http://obibb.wordpress.com/2010/05/31/multiple-fact-reporting-on-non-conforming-dimensions/
http://obibb.wordpress.com/tag/conformed/
Data Warehousing and Analytics on Unstructured Data / Big Data using Hadoop on Microsoft platform http://siddhumehta.blogspot.in/2011/07/data-warehousing-and-analytics-on.html
Developing a Dimensional Model w/ conformed dimensions and facts http://blog.tallan.com/2008/09/23/developing-a-dimensional-model-w-conformed-dimensions-and-facts/
Data Warehousing Blog Posts http://tdwi.org/blogs/blogs-data-warehousing.aspx
Surrogate Keys http://www.jamesserra.com/archive/2012/01/surrogate-keys/
The Data Warehousing Blog http://dwhblog.org/blog/
A Real World Experience http://www.jamesserra.com/archive/2012/01/difference-between-etl-and-elt/
Difference between ETL and ELT http://www.jamesserra.com/archive/2012/01/normalizing-your-database/
Normalizing Your Database
Filter on a dimension with a fact that don't depend on that dimension https://forums.oracle.com/forums/thread.jspa?messageID=3609020&filter and non-conforming dimensions
https://forums.oracle.com/forums/thread.jspa?threadID=1075609&tstart=0
Partitioned Fact Tables
http://blog.tallan.com/2008/12/04/partitioned-fact-tables/
SCD via SQL Stored Procedure
http://blog.tallan.com/2008/09/08/scd-via-sql-stored-procedure/
Uploading file: the .NET MVC wayThe SCD Transformation in SSIS »
A Warehousing Overview
http://blog.tallan.com/2008/07/31/a-warehousing-overview/
http://www.dwhworld.com/2010/07/data-marts/
http://www.dwhworld.com/2010/01/data-warehouse-interview-questions-part-1/
http://www.dwhworld.com/2010/01/data-warehouse-interview-questions-part-2/
- Drill Down/Up in information technology, to move vertically either from summary information to detailed data by focusing in on something or move from detailed data to summary information. To drill down through a series of folders, for example, on a desktop means to go through the hierarchy of folders to find a specific file or to click through drop-down menus in a GUI. To drill down through a database is to access information by starting with a general category and moving through the hierarchy of field to file to record.
- Drill through is a term used to describe the act of exploring or browsing items like folders, files or related components. Generally associated with drill down and drill up, which indicate vertical movements between components, drill through is an action in which you move horizontally between two items via a related link. An example to drill through is in the case of two reports that are in a master /detail relation with each other, and by clicking a master item on the master report you reach the details of the clicked item on the details report. Drill through means you are drilling through the same report for getting the data. This means when you are hiding /toggling some items with respect to other item,then you will get some plus (+) symbol and by expanding that you will get to show the hidden data. This feature is known as Drill through.
- Drill across means you are drilling across the reports to get the data. This means when you are jumping or navigating from one report to other, then you are drilling across the reports. This you will get from the property navigation, and by this you can show the data from other reports.
- Slice and Dice, is the term used to change dimensions after viewing the cube. See things by location, then change to view by product. You are slicing the data in a different perspective.
- A database shard ("sharding") is the phrase used to describe a horizontal partition in a database or search engine. The idea behind sharding is to split data among multiple machines while ensuring that the data is always accessed from the correct place.Since sharding spreads the database across multiple machines, the database programmer specifies explicit sharding rules to determine which machines any piece of data will be stored on. Sharding may also be referred to as horizontal scaling or horizontal partitioning.
No comments:
Post a Comment