7 July 2018
A data warehouse is a storage platform which contains historical data that is derived from transactional/relational database.As it includes data from different sources,it helps to seperate analysis workload from transaction workload.It is designed for query and analysis rather than transaction processing so that business can consolidate data from serveral sources.
In addition to being a RDBMS(Relational Database Management Systems), a data warehouse can often consists of an ETL(Extract Transform Load) solution, an OLAP(Online Analytical Processing) engine, client analysis tools, and can manage the process of gathering data and delivering it to business clients.
Data warehouse consists of three tightly integrated components.
System tables The system tables store metadata about the data in the data tables. Both the data and system tables are stored as files using the underlying operating system’s file system.
ETL is the process of reading data from one or more sources, applying some transformation on the data, and writing it to another data source. Conceptually, it consists of three steps: extract, transform and load. These steps need not be sequential; an application does not have to extract all the data before moving it to the transform step.Once it has extracted a portion of the data, it may run the three steps in parallel.
The extract step involves reading data from one or more operational systems. The data source could be a database, API, or a file. The source database can be a relational database or a NoSQL data store. A file can be in CSV, JSON, XML, Parquet, ORC, Avro, Protocol Buffers, or some other format. The transform step involves cleaning and modifying the source data using some rules. For example, rows with invalid data may be dropped or columns with null values may be populated with some value.
Transformations may include, concatenating two columns, splitting a column into multiple columns, encoding a column, translating a column from one encoding to a different encoding, or applying some other transformation to make the data ready for the destination system.
The load step writes the data to a destination data source. The destination can be a database or file.
Generally, ETL is used for data warehousing. Data is collected from a number of different operational systems, cleansed, transformed and stored into a data warehouse. However, ETL is not unique to data warehousing. For example, it can be used to enable sharing of data between two disparate systems. It can be used to convert data from one format to another. Similarly, migrating data from a legacy system to a new system is an ETL process.
Metadata is collected to improve supportability of the environment and to orient users to the meaning of the data.
There are two types of Metadata that is used in Enterprise Data Warehouse
Technical Metadata – data about how data is sourced, loaded, transformed, and managed.
Business Metadata – data that helps users understand the business context, meaning, and appropriate usage of the data.
Metadata benefits include the following areas:
The major types of Metadata collected and available include:
A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is dimensions that change slowly over time, rather than changing on regular schedule, time-base. In data warehouse we need a dimensional attribute in order to report historical data. Some of the examples of these dimensions would be customer address, geograhy ,employee salary etc.
It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.
Most popular approaches to deal with SCD is given below .
Type 0 - The passive method In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.
Type 1 - Overwriting the old value In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.
In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).
Type 2 - Creating a new additional record In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also ‘effective date’ and ‘current indicator’ columns are used in this method.
A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.
Type 3 - Adding a new column Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into ‘current/new’ column and the old one into ‘old/previous’ column. Generally speaking the history is limited to the number of column created for storing historical data. A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.
Type 4 - Using historical table In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension.
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
The Integration Layer is the heart of the Integrated Data Warehouse. The Integration Layer contains the lowest possible granularity available from an authoritative source, in near Third Normal Form (3NF). The term ‘near 3NF’ is used because there may be requirements for slight denormalization of the base data. This is done on an exception basis. All 3NF tables will be defined using the Natural (or Business) keys of the data. Surrogate keys will not be used.
Data in the higher layers of the architecture are derived from data in this layer. This layer includes all corporate data that has business value to more than one business area, meaning that it has corporate value. Contained in this layer is the ‘base’ business data.
The source of the data in this layer is a combination of the operational systems, base data, master data and possibly applications that are resident on the EDW (e.g., Marketing Applications, Supply Chain Applications, or other applications which create data on the EDW). Aggregation/summary tables that have broad business use could also be located here.
Guiding Principles for the Integration Layer are:
Enterprise Base Tables : Near 3NF representation of all business data. May also contain summary tables that are applicable across multiple business areas. An example of this could be Store/Item/Day sales that are created by summarizing all sales transactions.
This layer consists of Views that access the tables contained in the Integration Layer. Views are used to define a ‘virtual’ dimensional star schema model to hide the complexity associated with normalized data in the Integration Layer.
A semantic / data access layer provides ease of use for BI Developers and adhoc users. This layer presents data in a format that is easy to use and eliminates the most common joins of the physical tables. The Semantic / Data Access Layer structures provide users with a view to the data. All applications and users consume / use the data via views. Tableau is the current BI tool for Reporting. SQL Assistant is used for adhoc queries. Security will be managed per Line of Business through separate sets of semantic views.
The Integration Layer is the heart of the data warehouse and includes corporate data in a near 3NF data model.
Enterprise Views : One-to-One view on the base table. These include views with write permissions for ETL and ELT applications as well as views that provide read access to base tables. There may also be views that define corporate metrics and logical structures that are used across business areas.
Business Intelligence Reporting Views : Used by the reporting front end and most ad-hoc queries. May be a combination of Enterprise and Performance Layer access. May include views to create star schemas or dimensional models to simplify data usage
Security Views : Used to limit access to any sensitive data based on access rights
If performance requirements dictate better response time from these normalized tables in the Integration Layer, de-normalization of these tables can be created in the Performance Layer as either physical tables or other performance structures such as aggregate join indexes (AJIs).
Physical objects will only be created when a need is demonstrated; based on performance requirements and SLAs. Core Metrics and Key Performance Indicators that are used across business areas are best defined in this layer.
It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle. It is an important special case of snowflake schema and is more effective in handling simpler queries.
The snowflake schema represents a dimensional model which is also composed of a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate database tables or views store data pertaining to each level in the dimension.
For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.
Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:
Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions
Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products
Multi-enterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays
A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.
Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.